Course Outline
-
segmentGetting Started (Don't Skip This Part)
-
segmentStatistics and Data Science: A Modeling Approach
-
segmentPART I: EXPLORING VARIATION
-
segmentChapter 1 - Welcome to Statistics: A Modeling Approach
-
segmentChapter 2 - Understanding Data
-
segmentChapter 3 - Examining Distributions
-
segmentChapter 4 - Explaining Variation
-
4.4 Using Box Plots to Explore Relationships
-
segmentPART II: MODELING VARIATION
-
segmentChapter 5 - A Simple Model
-
segmentChapter 6 - Quantifying Error
-
segmentChapter 7 - Adding an Explanatory Variable to the Model
-
segmentChapter 8 - Digging Deeper into Group Models
-
segmentChapter 9 - Models with a Quantitative Explanatory Variable
-
segmentPART III: EVALUATING MODELS
-
segmentChapter 10 - The Logic of Inference
-
segmentChapter 11 - Model Comparison with F
-
segmentChapter 12 - Parameter Estimation and Confidence Intervals
-
segmentChapter 13 - What You Have Learned
-
segmentFinishing Up (Don't Skip This Part!)
-
segmentResources
list High School / Advanced Statistics and Data Science I (ABC)
4.4 Using Box Plots to Explore Relationships
gf_point()
and gf_jitter()
are useful because they let us see each individual data point. There are times, however, when we want to transcend the individual data points and focus mainly on the overall pattern of a distribution.
box plots, which we have seen before, are helpful in this regard, and are especially useful for comparing the distribution of an outcome variable across different levels of a categorical explanatory variable.
Here’s how we would create a box plot of Thumb
length broken down by Sex
.
gf_boxplot(Thumb ~ Sex, data = Fingers)
We can also chain together box plots and jitter plots. In ggformula
, when we chain on multiple functions, the later functions assume the same variables and data frames so we don’t need to type those in again. Handy!
gf_boxplot(Thumb ~ Sex, data = Fingers) %>%
gf_jitter(height = 0, width = .25)
This code will make a box plot first and then overlay the jitter plot on top of it.
Recall that the shaded rectangle at the center of the box plot shows us where the middle 50% of the data points fall on the scale of the outcome variable. The thick horizontal line inside the box is the median. Think back to the five-number summary. We can get the five-number summary for Thumb
broken down by Sex
by modifying how we previously used favstats()
.
favstats(Thumb ~ Sex, data = Fingers)
Sex min Q1 median Q3 max mean sd n missing
1 female 39 54 57 63.125 86.36 58.25585 8.034694 112 0
2 male 47 60 64 70.000 90.00 64.70267 8.764933 45 0
Looking for Explained Variation in Box Plots
In the box plot of Thumb
length by Sex
, the box for females is lower down vertically than the box for males.
Let’s replace Sex
with a different explanatory variable. Job
is a categorical variable with three levels (no job
, part-time
, and full-time
).
Modify this code to produce a box plot for Thumb
length by Job
(instead of by Sex
).
require(coursekata)
# Modify this box plot to look at Thumb length by Job
gf_boxplot(Thumb ~ Sex, data = Fingers) %>%
gf_jitter(height = 0, width = .2, shape = 1, size = 3)
# Modify this box plot to look at Thumb length by Job
gf_boxplot(Thumb ~ Job, data = Fingers) %>%
gf_jitter(height = 0, width = .2, shape = 1, size = 3)
ex() %>% check_function(., "gf_boxplot") %>% {
check_arg(., "data") %>% check_equal()
check_arg(., "object") %>% check_equal()
}
Notice that in this box plot, the boxes for “Not Working” and “Part-time Job” are at approximately the same vertical position and are about the same height. The box plot for “Full-time Job” looks quite different (and strange).
The full-time box only includes one student, so, we wouldn’t want to draw any conclusions about the relationship between working full time and thumb length. Most of the students in the Fingers
data frame either work part-time or not at all. The thumbs of students with no job are not much longer or shorter than thumbs of students with part-time jobs. But within each group, their thumb lengths vary a lot. There are long-thumbed and short-thumbed students with part-time jobs and with no jobs.
Whiskers and IQR
Now let’s return our attention to the whisker part (the lines) that go out from the box. The whiskers are drawn in relation to IQR, the interquartile range but this time the IQR is calculated separately for each level of the explanatory variable (that is, each group).
In gf_boxplot()
, outliers, defined as observations more than 1.5 IQRs above or below the box for each group, are represented with dots. The ends of the whiskers (the lines that extend above and below the box) represent the maximum and minimum observations in each group that are not defined as outliers.
Any data that are greater or less than the whiskers are depicted in a box plot as individual points. By convention, these can be considered outliers.