Course Outline
-
segmentGetting Started (Don't Skip This Part)
-
segmentHigh School / Advanced Statistics and Data Science I (ABC)
-
segmentPART I: EXPLORING VARIATION
-
segmentChapter 1 - Welcome to Statistics: A Modeling Approach
-
segmentChapter 2 - Understanding Data
-
segmentChapter 3 - Examining Distributions
-
3.7 Box Plots and the Five-Number Summary
-
segmentChapter 4 - Explaining Variation
-
segmentPART II: MODELING VARIATION
-
segmentChapter 5 - A Simple Model
-
segmentChapter 6 - Quantifying Error
-
segmentChapter 7 - Adding an Explanatory Variable to the Model
-
segmentChapter 8 - Digging Deeper into Group Models
-
segmentChapter 9 - Models with a Quantitative Explanatory Variable
-
segmentPART III: EVALUATING MODELS
-
segmentChapter 10 - The Logic of Inference
-
segmentChapter 11 - Model Comparison with F
-
segmentChapter 12 - Parameter Estimation and Confidence Intervals
-
segmentChapter 13 - What You Have Learned
-
segmentFinishing Up (Don't Skip This Part!)
-
segmentResources
list High School / Advanced Statistics and Data Science I (ABC)
3.7 Box Plots and the Five-Number Summary
The five-number summary gives us a snapshot of the distribution of a
quantitative variable. Box plots (also called box-and-whisker
plots) give us a way to visualize the five-number summary. For example,
here is a box plot that depicts the distribution of Wt
from
the MindsetMatters
dataset. Let’s see how it relates to the
five-number summary.
The box plot above is a horizontal box plot. (Later we’ll show you
how to make a vertical box plot.) The x-axis, running horizontally,
shows the scale on which Wt
is measured (roughly from 90 to
200 lbs). The y-axis doesn’t mean anything on this graph, which is why
we removed the label from it for now.
Below, we have labeled the same box plot of Wt
to show
how it visualizes the min, Q1, median, Q3, and max. We can then “read”
the five-number summary off of the box plot.
If we print out the favstats()
of Wt
, we
will see that our estimates correspond to the values of min, Q1, median,
Q3, and max.
favstats(~ Wt, data = MindsetMatters)
min Q1 median Q3 max mean sd n missing
90 130 145 161.5 196 146.1333 22.46459 75 0
In the distribution of Wt
there are no outliers (defined
as more than 1.5 IQRs above Q3 or below Q1); the whiskers simply end at
the max and min values for Wt
. When there are outliers,
they are represented as dots off to the left or right of the
whiskers.
Make Your Own Box Plot
Look at the code in the window below and you will see how we created
the box plot above. Go ahead and click <Run> to check that it
works. Notice that putting the ~
before Wt
is
what makes this a horizontal box plot with the variable Wt
on the x-axis.
Now modify the code to create a box plot for Age
from
the MindsetMatters
data frame.
require(coursekata)
# Modify this code to create a box plot of Age from MindsetMatters
gf_boxplot(~ Wt, data = MindsetMatters)
# Modify this code to create a box plot of Age from MindsetMatters
gf_boxplot(~ Age, data = MindsetMatters)
ex() %>% check_function("gf_boxplot") %>% {
check_arg(., "object") %>% check_equal()
check_arg(., "data") %>% check_equal()
}
Notice how the structure of the gf_boxplot()
function is
just like that for gf_histogram()
. Try altering the code
above to make a histogram instead of a box plot. Look at the box plot
compared with the histogram. Can you see how the same distribution gets
represented by these two types of plots?
How to Overlay a Box Plot on a Histogram
It’s easier to compare box plots and histograms if you overlay one on
the other. You can overlay a box plot on a histogram of the same
variable using %>%
. Notice that we don’t need to include
any arguments in parentheses for the gf_boxplot()
function
as the arguments for the gf_histogram()
function are
carried over. R usually makes this assumption when chaining one function
onto another.
gf_histogram(~ Age, data = MindsetMatters) %>%
gf_boxplot()
The default box plot is a little hard to see on top of the histogram.
We can get a better effect by adding arguments such as fill
and width
to adjust features of the plot.
gf_histogram(~ Age, data = MindsetMatters) %>%
gf_boxplot(fill = "purple", width = 1)
Take another look at the histogram with box plot below. This time we have put dashed lines at Q1 and Q3 (the hinges of the box).
You can tweak how the box plot looks in a number of ways. Run the
code block below, then try changing some of the arguments. Explore what
happens if you change the width
argument from 1 to a
different number. Predict what would happen if you change the number in
front of the ~ Age
in gf_boxplot()
. In the
code below we set it to 6. What do you think would happen if you set it
to a negative number?
require(coursekata)
# run this code to see what happens
# then try modifying the 6, 1 and "purple" (one at a time)
# what do you think will happen?
gf_histogram(~ Age, data = MindsetMatters) %>%
gf_boxplot(6 ~ Age, width = 1, fill = "purple") %>%
gf_labs(y="count", x = "Age (in Years)")
# Assume the data frame SmallerCountries has been made for you
# try modifying the 10, 3 and "purple" (one at a time)
# what do you think will happen?
gf_histogram(~ Age, data = MindsetMatters) %>%
gf_boxplot(-1.5 ~ Age, width = 2, fill = "orange") %>%
gf_labs(x = "Age (in Years)")
ex() %>% check_function("gf_boxplot")
The ntile()
Function
Now that you have learned about quartiles and the five number
summary, there is another R function that is often useful:
ntile()
. This function arranges the values of a
quantitative variable in order and then divides them up into some number
(n
) of equal-sized groups. For example, if we want to
create 4 equal-sized groups, the n
is 4. Another way of
saying 4-tiles is “quartiles”.
Here’s an example of how to use this function:
MindsetMatters$AgeQuartile <- ntile(MindsetMatters$Age, 4)
This code ntile(MindsetMatters$Age, 4)
arranges the
Age
values in order, cuts them up into four equal-sized
groups, then returns a number from 1 to 4 indicating which quartile the
value belongs to (e.g., 1, 4, 3, 2, 4, 4, etc). The assignment operator
(<-
) is used to save these numbers into a variable
called AgeQuartile
in the MindsetMatters
data
frame.
Here’s what we get when we run the code and print out a random sample of 10 cases from the data frame.
Age AgeQuartile
38 3
34 2
50 4
35 2
33 2
28 1
46 4
21 1
27 1
48 4
You can see that the youngest housekeepers (ages 21 and 27) are assigned a 1 because they are in the first quartile. The oldest (ages 48 and 50) are assigned a 4.
Let’s try using the new variable to color the bars of the histogram
of Age
according to quartiles. In order to use the numbers
from ntile()
as a basis for fill=
we must
first turn it into a factor. We can do that by adding
factor()
to the code we wrote above, like this:
MindsetMatters$AgeQuartile <- factor(ntile(MindsetMatters$Age, 4))
In the code below, we’ve created AgeQuartile
for you and
made it a factor. Use that variable to fill
the bars of the
Age
histogram. (Note: Leave the fill
of the
box plot “white”. And remember, to use a variable to fill
the bars, you’ll need the tilde, ~
.)
require(coursekata)
# This creates AgeQuartile and make it a factor
MindsetMatters$AgeQuartile <- factor(ntile(MindsetMatters$Age, 4))
# Fill the bars based on AgeQuartile
gf_histogram(~ Age, data = MindsetMatters, fill = "black") %>%
gf_boxplot(fill = "white", width = 1)
# This creates AgeQuartile and make it a factor
MindsetMatters$AgeQuartile <- factor(ntile(MindsetMatters$Age, 4))
# Fill the bars based on AgeQuartile
gf_histogram(~ Age, data = MindsetMatters, fill = ~AgeQuartile) %>%
gf_boxplot(fill = "white", width = 1)
ex() %>% check_function(., "gf_histogram") %>% {
check_arg(., "data") %>% check_equal()
check_arg(., "object") %>% check_equal()
check_arg(., "fill") %>% check_equal()
}
ex() %>% check_function(., "gf_boxplot") %>% {
check_arg(., "fill") %>% check_equal()
}