Course Outline

segmentGetting Started (Don't Skip This Part)

segmentStatistics and Data Science: A Modeling Approach

segmentPART I: EXPLORING VARIATION

segmentChapter 1  Welcome to Statistics: A Modeling Approach

segmentChapter 2  Understanding Data

segmentChapter 3  Examining Distributions

segmentChapter 4  Explaining Variation

4.5 Faceted Histograms

segmentPART II: MODELING VARIATION

segmentChapter 5  A Simple Model

segmentChapter 6  Quantifying Error

segmentChapter 7  Adding an Explanatory Variable to the Model

segmentChapter 8  Digging Deeper into Group Models

segmentChapter 9  Models with a Quantitative Explanatory Variable

segmentPART III: EVALUATING MODELS

segmentChapter 10  The Logic of Inference

segmentChapter 11  Model Comparison with F

segmentChapter 12  Parameter Estimation and Confidence Intervals

segmentChapter 13  What You Have Learned

segmentFinishing Up (Don't Skip This Part!)

segmentResources
list High School / Advanced Statistics and Data Science I (ABC)
4.5 Faceted Histograms
Scatter plots, jitter plots, and box plots are all ways to visualize hypotheses (such as Thumb = Sex + other stuff) with categorical explanatory variables.
One more way is histograms, but this time we split the histogram into two “facets” — one for females and another for males. We do this by using %>%
to chain on the command gf_facet_grid()
after gf_histogram()
. This puts the two histograms of Thumb
(one for females and one for males) into a grid.
gf_histogram(~ Thumb, data = Fingers) %>%
gf_facet_grid(Sex ~ .)
Just as putting a variable before the ~
(tilde) puts it on the yaxis, putting Sex
before the ~
in gf_facet_grid(Sex ~ .)
stacks the two graphs vertically, one on top of the other, along the yaxis. Putting Sex
after the ~
would put the two graphs next to each other in a row along the xaxis.
Note that we used a dot (Sex ~ .)
as a placeholder in case we later want to facet the plot by more than one variable, e.g., gf_facet_grid(Sex ~ RaceEthnic)
.
Faceted Density Histograms
In both the faceted histogram and the jitter plot, you may have noticed that there are fewer males than females. This is when a measure like density (rather than count) comes in handy. Remember density is somewhat like proportion (it’s exactly like proportion when the binwidth = 1
because density is proportion divided by binwidth).
Adjust the following code to recreate these histograms as density histograms.
require(coursekata)
# Modify this code to create a density histograms
gf_histogram(~ Thumb, data = Fingers) %>%
gf_facet_grid(Sex ~ .)
# Modify this code to create a density histograms
gf_dhistogram(~ Thumb, data = Fingers) %>%
gf_facet_grid(Sex ~ .)
ex() %>% {
check_function(., "gf_dhistogram") %>% {
check_arg(., "object") %>% check_equal()
check_arg(., "data") %>% check_equal()
}
check_function(., "gf_facet_grid") %>% {
check_arg(., 1) %>% check_equal()
check_arg(., 2) %>% check_equal()
}
}
BetweenGroup and WithinGroup Variation
Another way of thinking about Sex
explaining variation in Thumb
is to say that Thumb
is really made up of two different distributions, one for males and one for females. Although the shape of these two histograms are roughly normal, the distribution of male thumb lengths seems to be centered a little higher on the scale than the distribution of female thumb lengths. It almost seems like the whole male distribution is shifted to the right along the xaxis.
As the distribution of male thumb lengths is shifted to the right, the variation between the groups (the difference between the centers of the two distributions) gets larger. At the same time, however, the variation (or spread) within each group is now smaller than it would be if all the thumbs were together in a single histogram. The variation among members of the same group is called withingroup variation, which is smaller than the total variation we started with. It’s as if some of the variation in Thumb
has been accounted for by Sex
.
Because we can only see the withingroup variation after we divide the distribution up by sex, another name for withingroup variation is leftover variation. Even though there is still a lot of variation in thumb length left over after taking out sex, it is still true that if we know someone’s sex we can be a little better at predicting their thumb length. A little better may not be great, but it is better than nothing.
Try exploring the hypothesis represented by Thumb = Year + other stuff with a faceted histogram in the code block below.
require(coursekata)
# Modify this code
gf_histogram(~ Thumb, data = Fingers, fill = "red") %>%
gf_facet_grid(Sex ~ .)
# Modify this code
gf_histogram(~ Thumb, data = Fingers, fill = "red") %>%
gf_facet_grid(Year ~ .)
ex() %>% {
check_function(., "gf_histogram") %>% {
check_arg(., "object") %>% check_equal()
check_arg(., "data") %>% check_equal()
}
check_function(., "gf_facet_grid") %>% {
check_arg(., 1) %>% check_equal()
check_arg(., 2) %>% check_equal()
}
}
What part of the faceted histogram shows us that the explanatory variable Year
is not as good of an explanatory variable as Sex
? Check out the video below for an explanation.
Faceting Histograms by a Quantitative Variable
Faceted histograms are for situations in which you have a quantitative outcome and a categorical explanatory variable. Try exploring what happens when we use a quantitative explanatory variable (such as Ring
, the length of a student’s ring finger) in a faceted histogram instead of a categorical variable (such as Sex
, RaceEthnic
, or Job
).
require(coursekata)
# Modify this code to explore
# this hypothesis: Thumb = Ring + other stuff
gf_histogram(~ Thumb, data = Fingers) %>%
gf_facet_grid(Sex ~ .)
# Modify this code to explore
# this hypothesis: Thumb = Ring + other stuff
gf_histogram(~ Thumb, data = Fingers) %>%
gf_facet_grid(Ring ~ .)
ex() %>% {
check_function(., "gf_histogram") %>% {
check_arg(., "object") %>% check_equal()
check_arg(., "data") %>% check_equal()
}
check_function(., "gf_facet_grid") %>% {
check_arg(., 1) %>% check_equal()
check_arg(., 2) %>% check_equal()
}
}
It doesn’t even look like a faceted histogram. This is because R is trying to create a separate histogram for every value of Ring
, and there are many many values of Ring
(such as 42, 66.04, 86, and more)! Faceting works better when there are only a limited set of values for the explanatory variable (as in the case of most categorical variables).
Extra Features for Histograms
A lot of what you already have learned about histograms can be added as well to faceted histograms. You can adjust bins, you can add labels, and you can chain on density plots.
gf_dhistogram( ~ Thumb, data = Fingers, bins = 10) %>%
gf_facet_grid(Sex ~ .) %>%
gf_density()
You can also add box plots to your faceted histograms by using %>%
to pipe on the function gf_boxplot()
.
gf_histogram( ~ Thumb, data = Fingers) %>%
gf_facet_grid(Sex ~ .) %>%
gf_boxplot(fill = "purple", width = 3)