Course Outline

segmentGetting Started (Don't Skip This Part)

segmentStatistics and Data Science: A Modeling Approach

segmentPART I: EXPLORING VARIATION

segmentChapter 1  Welcome to Statistics: A Modeling Approach

segmentChapter 2  Understanding Data

segmentChapter 3  Examining Distributions

segmentChapter 4  Explaining Variation

4.4 Using Box Plots to Explore Relationships

segmentPART II: MODELING VARIATION

segmentChapter 5  A Simple Model

segmentChapter 6  Quantifying Error

segmentChapter 7  Adding an Explanatory Variable to the Model

segmentChapter 8  Digging Deeper into Group Models

segmentChapter 9  Models with a Quantitative Explanatory Variable

segmentPART III: EVALUATING MODELS

segmentChapter 10  The Logic of Inference

segmentChapter 11  Model Comparison with F

segmentChapter 12  Parameter Estimation and Confidence Intervals

segmentChapter 13  What You Have Learned

segmentFinishing Up (Don't Skip This Part!)

segmentResources
list High School / Advanced Statistics and Data Science I (ABC)
4.4 Using Box Plots to Explore Relationships
gf_point()
and gf_jitter()
are useful because they let us see each individual data point. There are times, however, when we want to transcend the individual data points and focus mainly on the overall pattern of a distribution.
box plots, which we have seen before, are helpful in this regard, and are especially useful for comparing the distribution of an outcome variable across different levels of a categorical explanatory variable.
Here’s how we would create a box plot of Thumb
length broken down by Sex
.
gf_boxplot(Thumb ~ Sex, data = Fingers)
We can also chain together box plots and jitter plots. In ggformula
, when we chain on multiple functions, the later functions assume the same variables and data frames so we don’t need to type those in again. Handy!
gf_boxplot(Thumb ~ Sex, data = Fingers) %>%
gf_jitter(height = 0, width = .25)
This code will make a box plot first and then overlay the jitter plot on top of it.
Recall that the shaded rectangle at the center of the box plot shows us where the middle 50% of the data points fall on the scale of the outcome variable. The thick horizontal line inside the box is the median. Think back to the fivenumber summary. We can get the fivenumber summary for Thumb
broken down by Sex
by modifying how we previously used favstats()
.
favstats(Thumb ~ Sex, data = Fingers)
Sex min Q1 median Q3 max mean sd n missing
1 female 39 54 57 63.125 86.36 58.25585 8.034694 112 0
2 male 47 60 64 70.000 90.00 64.70267 8.764933 45 0
Looking for Explained Variation in Box Plots
In the box plot of Thumb
length by Sex
, the box for females is lower down vertically than the box for males.
Let’s replace Sex
with a different explanatory variable. Job
is a categorical variable with three levels (no job
, parttime
, and fulltime
).
Modify this code to produce a box plot for Thumb
length by Job
(instead of by Sex
).
require(coursekata)
# Modify this box plot to look at Thumb length by Job
gf_boxplot(Thumb ~ Sex, data = Fingers) %>%
gf_jitter(height = 0, width = .2, shape = 1, size = 3)
# Modify this box plot to look at Thumb length by Job
gf_boxplot(Thumb ~ Job, data = Fingers) %>%
gf_jitter(height = 0, width = .2, shape = 1, size = 3)
ex() %>% check_function(., "gf_boxplot") %>% {
check_arg(., "data") %>% check_equal()
check_arg(., "object") %>% check_equal()
}
Notice that in this box plot, the boxes for “Not Working” and “Parttime Job” are at approximately the same vertical position and are about the same height. The box plot for “Fulltime Job” looks quite different (and strange).
The fulltime box only includes one student, so, we wouldn’t want to draw any conclusions about the relationship between working full time and thumb length. Most of the students in the Fingers
data frame either work parttime or not at all. The thumbs of students with no job are not much longer or shorter than thumbs of students with parttime jobs. But within each group, their thumb lengths vary a lot. There are longthumbed and shortthumbed students with parttime jobs and with no jobs.
Whiskers and IQR
Now let’s return our attention to the whisker part (the lines) that go out from the box. The whiskers are drawn in relation to IQR, the interquartile range but this time the IQR is calculated separately for each level of the explanatory variable (that is, each group).
In gf_boxplot()
, outliers, defined as observations more than 1.5 IQRs above or below the box for each group, are represented with dots. The ends of the whiskers (the lines that extend above and below the box) represent the maximum and minimum observations in each group that are not defined as outliers.
Any data that are greater or less than the whiskers are depicted in a box plot as individual points. By convention, these can be considered outliers.