Course Outline

list High School / Advanced Statistics and Data Science I (ABC)

Book
  • High School / Advanced Statistics and Data Science I (ABC)
  • High School / Statistics and Data Science I (AB)
  • High School / Statistics and Data Science II (XCD)
  • High School / Algebra + Data Science (G)
  • College / Introductory Statistics with R (ABC)
  • College / Advanced Statistics with R (ABCD)
  • College / Accelerated Statistics with R (XCD)
  • CKHub: Jupyter made easy

4.5 Faceted Histograms

Scatter plots, jitter plots, and box plots are all ways to visualize hypotheses (such as Thumb = Gender + other stuff) with categorical explanatory variables.

One more way is histograms, but this time we split the histogram into two “facets” — one for females and another for males. We do this by using %>% to chain on the command gf_facet_grid() after gf_histogram(). This puts the two histograms of Thumb (one for females and one for males) into a grid.

gf_histogram(~ Thumb, data = Fingers) %>%
  gf_facet_grid(Gender ~ .)

A faceted histogram of the distribution of Thumb by Gender in Fingers. Two graphs are stacked vertically, one above the other, with female distribution on the top and male distribution on the bottom. The female thumbs are spread from about 40 to 85, with most clumped between 50 to 60. The male thumbs are spread from about 45 to 90, with many clumped between 55 to 70 and a big spike near 64.

Just as putting a variable before the ~ (tilde) puts it on the y-axis, putting Gender before the ~ in gf_facet_grid(Gender ~ .) stacks the two graphs vertically, one on top of the other, along the y-axis. Putting Gender after the ~ would put the two graphs next to each other in a row along the x-axis.

Note that we used a dot (Gender ~ .) as a placeholder in case we later want to facet the plot by more than one variable, e.g., gf_facet_grid(Gender ~ RaceEthnic).

Faceted Density Histograms

In both the faceted histogram and the jitter plot, you may have noticed that there are fewer males than females. This is when a measure like density (rather than count) comes in handy. Remember density is somewhat like proportion (it’s exactly like proportion when the binwidth = 1 because density is proportion divided by binwidth).

Adjust the following code to re-create these histograms as density histograms.

require(coursekata) # Modify this code to create a density histograms gf_histogram(~ Thumb, data = Fingers) %>% gf_facet_grid(Gender ~ .) # Modify this code to create a density histograms gf_dhistogram(~ Thumb, data = Fingers) %>% gf_facet_grid(Gender ~ .) ex() %>% { check_function(., "gf_dhistogram") %>% { check_arg(., "object") %>% check_equal() check_arg(., "data") %>% check_equal() } check_function(., "gf_facet_grid") %>% { check_arg(., 1) %>% check_equal() check_arg(., 2) %>% check_equal() } }

A faceted density histogram of the distribution of Thumb by Gender in Fingers.

Between-Group and Within-Group Variation

Another way of thinking about Gender explaining variation in Thumb is to say that Thumb is really made up of two different distributions, one for males and one for females. Although the shape of these two histograms are roughly normal, the distribution of male thumb lengths seems to be centered a little higher on the scale than the distribution of female thumb lengths. It almost seems like the whole male distribution is shifted to the right along the x-axis.

Video Transcript

As the distribution of male thumb lengths is shifted to the right, the variation between the groups (the difference between the centers of the two distributions) gets larger. At the same time, however, the variation (or spread) within each group is now smaller than it would be if all the thumbs were together in a single histogram. The variation among members of the same group is called within-group variation, which is smaller than the total variation we started with. It’s as if some of the variation in Thumb has been accounted for by Gender.

Because we can only see the within-group variation after we divide the distribution up by gender, another name for within-group variation is leftover variation. Even though there is still a lot of variation in thumb length left over after taking out gender, it is still true that if we know someone’s gender we can be a little better at predicting their thumb length. A little better may not be great, but it is better than nothing.

Try exploring the hypothesis represented by Thumb = Year + other stuff with a faceted histogram in the code block below.

require(coursekata) # Modify this code gf_histogram(~ Thumb, data = Fingers, fill = "red") %>% gf_facet_grid(Gender ~ .) # Modify this code gf_histogram(~ Thumb, data = Fingers, fill = "red") %>% gf_facet_grid(Year ~ .) ex() %>% { check_function(., "gf_histogram") %>% { check_arg(., "object") %>% check_equal() check_arg(., "data") %>% check_equal() } check_function(., "gf_facet_grid") %>% { check_arg(., 1) %>% check_equal() check_arg(., 2) %>% check_equal() } }

A faceted histogram of the distribution of Thumb by Year in Fingers.

What part of the faceted histogram shows us that the explanatory variable Year is not as good of an explanatory variable as Gender? Check out the video below for an explanation.

Video Transcript

Faceting Histograms by a Quantitative Variable

Faceted histograms are for situations in which you have a quantitative outcome and a categorical explanatory variable. Try exploring what happens when we use a quantitative explanatory variable (such as Ring, the length of a student’s ring finger) in a faceted histogram instead of a categorical variable (such as Gender, RaceEthnic, or Job).

require(coursekata) # Modify this code to explore # this hypothesis: Thumb = Ring + other stuff gf_histogram(~ Thumb, data = Fingers) %>% gf_facet_grid(Gender ~ .) # Modify this code to explore # this hypothesis: Thumb = Ring + other stuff gf_histogram(~ Thumb, data = Fingers) %>% gf_facet_grid(Ring ~ .) ex() %>% { check_function(., "gf_histogram") %>% { check_arg(., "object") %>% check_equal() check_arg(., "data") %>% check_equal() } check_function(., "gf_facet_grid") %>% { check_arg(., 1) %>% check_equal() check_arg(., 2) %>% check_equal() } }

R's attempt to create faceted histograms of the distribution of Thumb by Ring in Fingers.

It doesn’t even look like a faceted histogram. This is because R is trying to create a separate histogram for every value of Ring, and there are many many values of Ring (such as 42, 66.04, 86, and more)! Faceting works better when there are only a limited set of values for the explanatory variable (as in the case of most categorical variables).

Extra Features for Histograms

A lot of what you already have learned about histograms can be added as well to faceted histograms. You can adjust bins, you can add labels, and you can chain on density plots.

gf_dhistogram( ~ Thumb, data = Fingers, bins = 10) %>%
  gf_facet_grid(Gender ~ .) %>%
  gf_density()

A faceted density histogram of the distribution of Thumb by Gender in Fingers overlaid with density plots.

You can also add box plots to your faceted histograms by using %>% to pipe on the function gf_boxplot().

gf_histogram( ~ Thumb, data = Fingers) %>%
  gf_facet_grid(Gender ~ .) %>%
  gf_boxplot(fill = "purple", width = 3)

A faceted density histogram of the distribution of Thumb by Gender in Fingers overlaid with horizontal box plots.

A faceted density histogram of the distribution of Thumb by Gender in Fingers overlaid with horizontal box plots.

Responses