Course Outline

list High School / Statistics and Data Science II (XCD)

Book
  • College / Advanced Statistics and Data Science (ABCD)
  • College / Statistics and Data Science (ABC)
  • High School / Advanced Statistics and Data Science I (ABC)
  • High School / Statistics and Data Science I (AB)
  • High School / Statistics and Data Science II (XCD)

1.10 Visualizing & Summarizing Categorical Data in R

So far we have focused on examining the distributions of quantitative variables. For categorical variables we will use different methods.

Consider a variable such as Foundation which tells us what kind of foundation a house has (e.g., brick and tile, poured concrete, or cinder blocks).

bar chart of the three different types of foundations; most homes have PouredConcrete foundation

Although we might be tempted to call this a normal distribution, isn’t it a little strange to say the center of this distribution is Poured Concrete? What is the range of this distribution? Is it CinderBlock minus Brick&Tile? These descriptions of this distribution don’t seem to make sense!

We have thus far used histograms to examine the distribution of a variable. But histograms aren’t appropriate for categorical variables. And if R knows a variable is categorical (if, for example, you have specified it as a factor), it won’t even run the histogram, and will give you an error message instead.

Bar Graphs

When a variable is categorical you can visualize the distribution with a bar graph. It looks like a histogram, but it’s not. There is no such thing as bins, for example, in a bar graph. The number of bars in a bar graph will always equal the number of categories in your variable.

Let’s take a look at some categorical variables from the Ames data frame: Neighborhood and GarageType. Both of these have been specified as factors and the levels have been labeled already.

Here’s some code to make a bar graph in R:

gf_bar(~ Neighborhood, data = Ames)

bar chart of neighborhood, more homes in CollegeCreek than in OldTown

Use the code window below to create a bar graph of GarageType.

require(coursekata) # Create a bar graph of GarageType in the Ames data frame. Use the gf_bar() function gf_bar(~ GarageType, data = Ames) ex() %>% check_function("gf_bar") %>% { check_arg(., "data") %>% check_equal(incorrect_msg="Don't forget to set `data = Ames`") check_result(.) %>% check_equal(incorrect_msg = "Did you use `~GarageType`?") }
CK Code: X1_Code_VisualizingCat_01

bar graph of GarageType, most homes have attached garages but some have detached and a few have none

You can change the width of these bars by adding the argument width and setting it to some number between 0 and 1. You can also use the arguments color and fill to change the colors of the bars. Try playing with the width and colors here.

require(coursekata) # Add arguments color and fill and width to this bar graph gf_bar(~ GarageType, data = Ames) # any values of arguments are acceptable gf_bar(~ GarageType, color = "yellow", fill = "navyblue", width = .4, data = Ames) ex() %>% { check_function(., "gf_bar", index = 1) %>% { check_arg(., "color") check_arg(., "fill") check_arg(., "width") } }
CK Code: X1_Code_VisualizingCat_02

gf_props() or gf_percents(). Sometimes, instead of counts, we’d like to see the relative proportions of homes with certain characteristics. For example, from gf_bar() we can see that there are a bit fewer than 150 homes in the College Creek neighborhood versus about 50 homes in Old Town. To show proportions instead of counts on the y-axis, use gf_props() instead of gf_histogram() in the code block below.

require(coursekata) # change this to a bar chart with proportions # don’t change the fill color gf_bar(~ GarageType, data = Ames, fill = "royalblue") gf_props(~ GarageType, data = Ames, fill = "royalblue") ex() %>% check_function("gf_props") %>% check_result() %>% check_equal()
CK Code: X1_Code_VisualizingCat_03

a bar graph showing the proportion of homes in CollegeCreek and OldTown

Responses