Course Outline

list High School / Statistics and Data Science II (XCD)

Book
  • High School / Advanced Statistics and Data Science I (ABC)
  • High School / Statistics and Data Science I (AB)
  • High School / Statistics and Data Science II (XCD)
  • High School / Algebra + Data Science (G)
  • College / Introductory Statistics with R (ABC)
  • College / Advanced Statistics with R (ABCD)
  • College / Accelerated Statistics with R (XCD)
  • CKHub: Jupyter made easy

1.11 Visualizing & Summarizing Categorical Variables

So far we have focused on examining the distributions of quantitative variables. For categorical variables we will use different methods.

Consider a variable such as Foundation which tells us what kind of foundation a house has (e.g., brick and tile, poured concrete, or cinder blocks).

bar chart of the three different types of foundations; most homes have PouredConcrete foundation

Although we might be tempted to call this a normal distribution, isn’t it a little strange to say the center of this distribution is Poured Concrete? What is the range of this distribution? Is it CinderBlock minus Brick&Tile? These descriptions of this distribution don’t seem to make sense!

We have thus far used histograms to examine the distribution of a variable. But histograms aren’t appropriate for categorical variables. And if R knows a variable is categorical (if, for example, you have specified it as a factor), it won’t even run the histogram, and will give you an error message instead.

Bar Graphs

When a variable is categorical you can visualize the distribution with a bar graph. It looks like a histogram, but it’s not. There is no such thing as bins, for example, in a bar graph. The number of bars in a bar graph will always equal the number of categories in your variable.

Let’s take a look at some categorical variables from the Ames data frame: Neighborhood and GarageType. Both of these have been specified as factors and the levels have been labeled already.

Here’s some code to make a bar graph in R:

gf_bar(~ Neighborhood, data = Ames)

bar chart of neighborhood, more homes in CollegeCreek than in OldTown

Use the code window below to create a bar graph of GarageType.

require(coursekata) # Create a bar graph of GarageType in the Ames data frame. Use the gf_bar() function # Create a bar graph of GarageType in the Ames data frame. Use the gf_bar() function gf_bar(~ GarageType, data = Ames) ex() %>% check_or( . %>% check_function("gf_bar") %>% { check_arg(., "object") %>% check_equal() check_arg(., "data") %>% check_equal() }, override_solution(., "gf_bar(Ames, ~GarageType)") %>% check_function("gf_bar") %>% { check_arg(., "gformula") %>% check_equal() }, override_solution(., "gf_bar(~Ames$GarageType)") %>% check_function("gf_bar") %>% { check_arg(., "object") %>% check_equal() }, override_solution(., "gf_bar(data = Ames, gformula = ~ GarageType)") %>% check_function("gf_bar") %>% { check_arg(., "data") %>% check_equal() check_arg(., "gformula") %>% check_equal() } )

bar graph of GarageType, most homes have attached garages but some have detached and a few have none

You can change the width of these bars by adding the argument width and setting it to some number between 0 and 1. You can also use the arguments color and fill to change the colors of the bars. Try playing with the width and colors here.

require(coursekata) # Add arguments color and fill and width to this bar graph gf_bar(~ GarageType, data = Ames) # any values of arguments are acceptable gf_bar(~ GarageType, color = "yellow", fill = "navyblue", width = .4, data = Ames) ex() %>% { check_function(., "gf_bar", index = 1) %>% { check_arg(., "color") check_arg(., "fill") check_arg(., "width") } }

gf_props() or gf_percents(). Sometimes, instead of counts, we’d like to see the relative proportions of homes with certain characteristics. For example, from gf_bar() we can see that there are a bit fewer than 150 homes in the College Creek neighborhood versus about 50 homes in Old Town. To show proportions instead of counts on the y-axis, use gf_props() instead of gf_histogram() in the code block below.

require(coursekata) # change this to a bar chart with proportions gf_bar(~ GarageType, data = Ames, fill = "royalblue") # change this to a bar chart with proportions gf_props(~ GarageType, data = Ames, fill = "royalblue") ex() %>% check_or( . %>% check_function("gf_props") %>% { check_arg(., "object") %>% check_equal() check_arg(., "data") %>% check_equal() }, override_solution(., "gf_props(Ames, ~GarageType)") %>% check_function("gf_props") %>% { check_arg(., "gformula") %>% check_equal() }, override_solution(., "gf_props(~Ames$GarageType)") %>% check_function("gf_props") %>% { check_arg(., "object") %>% check_equal() }, override_solution(., "gf_props(data = Ames, gformula = ~ GarageType)") %>% check_function("gf_props") %>% { check_arg(., "data") %>% check_equal() check_arg(., "gformula") %>% check_equal() } )

a bar graph showing the proportion of homes in CollegeCreek and OldTown

Responses