Course Outline
-
segmentGetting Started (Don't Skip This Part)
-
segmentHigh School / Advanced Statistics and Data Science I (ABC)
-
segmentPART I: EXPLORING VARIATION
-
segmentChapter 1 - Welcome to Statistics: A Modeling Approach
-
segmentChapter 2 - Understanding Data
-
segmentChapter 3 - Examining Distributions
-
segmentChapter 4 - Explaining Variation
-
4.4 Using Box Plots to Explore Relationships
-
segmentPART II: MODELING VARIATION
-
segmentChapter 5 - A Simple Model
-
segmentChapter 6 - Quantifying Error
-
segmentChapter 7 - Adding an Explanatory Variable to the Model
-
segmentChapter 8 - Digging Deeper into Group Models
-
segmentChapter 9 - Models with a Quantitative Explanatory Variable
-
segmentPART III: EVALUATING MODELS
-
segmentChapter 10 - The Logic of Inference
-
segmentChapter 11 - Model Comparison with F
-
segmentChapter 12 - Parameter Estimation and Confidence Intervals
-
segmentChapter 13 - What You Have Learned
-
segmentFinishing Up (Don't Skip This Part!)
-
segmentResources
list High School / Advanced Statistics and Data Science I (ABC)
4.4 Using Box Plots to Explore Relationships
gf_point()
and gf_jitter()
are useful
because they let us see each individual data point. There are times,
however, when we want to transcend the individual data points and focus
mainly on the overall pattern of a distribution.
Box plots, which we have seen before, are helpful in this regard, and are especially useful for comparing the distribution of an outcome variable across different levels of a categorical explanatory variable.
Here’s how we would create a box plot of Thumb
length
broken down by Gender
.
gf_boxplot(Thumb ~ Gender, data = Fingers)
We can also chain together box plots and jitter plots. In
ggformula
, when we chain on multiple functions, the later
functions assume the same variables and data frames so we don’t need to
type those in again. Handy!
gf_boxplot(Thumb ~ Gender, data = Fingers) %>%
gf_jitter(height = 0, width = .25)
This code will make a box plot first and then overlay the jitter plot on top of it.
Recall that the shaded rectangle at the center of the box plot shows
us where the middle 50% of the data points fall on the scale of the
outcome variable. The thick horizontal line inside the box is the
median. Think back to the five-number summary. We can get the
five-number summary for Thumb
broken down by
Gender
by modifying how we previously used
favstats()
.
favstats(Thumb ~ Gender, data = Fingers)
Gender min Q1 median Q3 max mean sd n missing
1 female 39 54 57 63.125 86.36 58.25585 8.034694 112 0
2 male 47 60 64 70.000 90.00 64.70267 8.764933 45 0
Looking for Explained Variation in Box Plots
In the box plot of Thumb
length by Gender
,
the box for females is lower down vertically than the box for males.
Let’s replace Gender
with a different explanatory
variable. Job
is a categorical variable with three levels
(no job
, part-time
, and
full-time
).
Modify this code to produce a box plot for Thumb
length
by Job
(instead of by Gender
).
require(coursekata)
# Modify this box plot to look at Thumb length by Job
gf_boxplot(Thumb ~ Gender, data = Fingers) %>%
gf_jitter(height = 0, width = .2, shape = 1, size = 3)
# Modify this box plot to look at Thumb length by Job
gf_boxplot(Thumb ~ Job, data = Fingers) %>%
gf_jitter(height = 0, width = .2, shape = 1, size = 3)
ex() %>% check_function(., "gf_boxplot") %>% {
check_arg(., "data") %>% check_equal()
check_arg(., "object") %>% check_equal()
}
Notice that in this box plot, the boxes for “Not Working” and “Part-time Job” are at approximately the same vertical position and are about the same height. The box plot for “Full-time Job” looks quite different (and strange).
The full-time box only includes one student, so, we wouldn’t want to
draw any conclusions about the relationship between working full time
and thumb length. Most of the students in the Fingers
data
frame either work part-time or not at all. The thumbs of students with
no job are not much longer or shorter than thumbs of students with
part-time jobs. But within each group, their thumb lengths vary a lot.
There are long-thumbed and short-thumbed students with part-time jobs
and with no jobs.
Whiskers and IQR
Now let’s return our attention to the whisker part (the lines) that go out from the box. The whiskers are drawn in relation to IQR, the interquartile range but this time the IQR is calculated separately for each level of the explanatory variable (that is, each group).
In gf_boxplot()
, outliers, defined as
observations more than 1.5 IQRs above or below the box for each group,
are represented with dots. The ends of the whiskers (the lines that
extend above and below the box) represent the maximum and minimum
observations in each group that are not defined as outliers.
Any data that are greater or less than the whiskers are depicted in a box plot as individual points. By convention, these can be considered outliers.