Course Outline
-
segmentGetting Started (Don't Skip This Part)
-
segmentStatistics and Data Science: A Modeling Approach
-
segmentPART I: EXPLORING VARIATION
-
segmentChapter 1 - Welcome to Statistics: A Modeling Approach
-
segmentChapter 2 - Understanding Data
-
segmentChapter 3 - Examining Distributions
-
3.7 Box Plots and the Five-Number Summary
-
segmentChapter 4 - Explaining Variation
-
segmentPART II: MODELING VARIATION
-
segmentChapter 5 - A Simple Model
-
segmentChapter 6 - Quantifying Error
-
segmentChapter 7 - Adding an Explanatory Variable to the Model
-
segmentChapter 8 - Digging Deeper into Group Models
-
segmentChapter 9 - Models with a Quantitative Explanatory Variable
-
segmentPART III: EVALUATING MODELS
-
segmentChapter 10 - The Logic of Inference
-
segmentChapter 11 - Model Comparison with F
-
segmentChapter 12 - Parameter Estimation and Confidence Intervals
-
segmentChapter 13 - What You Have Learned
-
segmentFinishing Up (Don't Skip This Part!)
-
segmentResources
list High School / Advanced Statistics and Data Science I (ABC)
3.7 Box Plots and the Five-Number Summary
The five-number summary gives us a snapshot of the distribution of a quantitative variable. Box plots (also called box-and-whisker plots) give us a way to visualize the five-number summary. For example, here is a box plot that depicts the distribution of Wt
from the MindsetMatters
data set. Let’s see how it relates to the five-number summary.
The box plot above is a horizontal box plot. (Later we’ll show you how to make a vertical box plot.) The x-axis, running horizontally, shows the scale on which Wt
is measured (roughly from 90 to 200 lbs). The y-axis doesn’t mean anything on this graph, which is why we removed the label from it for now.
Below, we have labeled the same box plot of Wt
to show how it visualizes the min, Q1, median, Q3, and max. We can then “read” the five-number summary off of the box plot.
If we print out the favstats()
of Wt
, we will see that our estimates correspond to the values of min, Q1, median, Q3, and max.
favstats(~ Wt, data = MindsetMatters)
min Q1 median Q3 max mean sd n missing
90 130 145 161.5 196 146.1333 22.46459 75 0
In the distribution of Wt
there are no outliers (defined as more than 1.5 IQRs above Q3 or below Q1); the whiskers simply end at the max and min values for Wt
. When there are outliers, they are represented as dots off to the left or right of the whiskers.
Make Your Own Box Plot
Look at the code in the window below and you will see how we created the box plot above. Go ahead and click <Run> to check that it works. Notice that putting the ~
before Wt
is what makes this a horizontal box plot with the variable Wt
on the x-axis.
Now modify the code to create a box plot for Age
from the MindsetMatters
data frame.
require(coursekata)
# Modify this code to create a box plot of Age from MindsetMatters
gf_boxplot(~ Wt, data = MindsetMatters)
# Modify this code to create a box plot of Age from MindsetMatters
gf_boxplot(~ Age, data = MindsetMatters)
ex() %>% check_function("gf_boxplot") %>% {
check_arg(., "object") %>% check_equal()
check_arg(., "data") %>% check_equal()
}
Notice how the structure of the gf_boxplot()
function is just like that for gf_histogram()
. Try altering the code above to make a histogram instead of a box plot. Look at the box plot compared with the histogram. Can you see how the same distribution gets represented by these two types of plots?
How to Overlay a Box Plot on a Histogram
It’s easier to compare box plots and histograms if you overlay one on the other. You can overlay a box plot on a histogram of the same variable using %>%
. Notice that we don’t need to include any arguments in parentheses for the gf_boxplot()
function as the arguments for the gf_histogram()
function are carried over. R usually makes this assumption when chaining one function onto another.
gf_histogram(~ Age, data = MindsetMatters) %>%
gf_boxplot()
The default box plot is a little hard to see on top of the histogram. We can get a better effect by adding arguments such as fill
and width
to adjust features of the plot.
gf_histogram(~ Age, data = MindsetMatters) %>%
gf_boxplot(fill = "purple", width = 1)
Take another look at the histogram with box plot below. This time we have put dashed lines at Q1 and Q3 (the hinges of the box).
You can tweak how the box plot looks in a number of ways. Run the code block below, then try changing some of the arguments. Explore what happens if you change the width
argument from 1 to a different number. Predict what would happen if you change the number in front of the ~ Age
in gf_boxplot()
. In the code below we set it to 6. What do you think would happen if you set it to a negative number?
require(coursekata)
# run this code to see what happens
# then try modifying the 6, 1 and "purple" (one at a time)
# what do you think will happen?
gf_histogram(~ Age, data = MindsetMatters) %>%
gf_boxplot(6 ~ Age, width = 1, fill = "purple") %>%
gf_labs(y="count", x = "Age (in Years)")
# Assume the data frame SmallerCountries has been made for you
# try modifying the 10, 3 and "purple" (one at a time)
# what do you think will happen?
gf_histogram(~ Age, data = MindsetMatters) %>%
gf_boxplot(-1.5 ~ Age, width = 2, fill = "orange") %>%
gf_labs(x = "Age (in Years)")
ex() %>% check_function("gf_boxplot")
The ntile()
Function
Now that you have learned about quartiles and the five number summary, there is another R function that is often useful: ntile()
. This function arranges the values of a quantitative variable in order and then divides them up into some number (n
) of equal-sized groups. For example, if we want to create 4 equal-sized groups, the n
is 4. Another way of saying 4-tiles is “quartiles”.
Here’s an example of how to use this function:
MindsetMatters$AgeQuartile <- ntile(MindsetMatters$Age, 4)
This code ntile(MindsetMatters$Age, 4)
arranges the Age
values in order, cuts them up into four equal-sized groups, then returns a number from 1 to 4 indicating which quartile the value belongs to (e.g., 1, 4, 3, 2, 4, 4, etc). The assignment operator (<-
) is used to save these numbers into a variable called AgeQuartile
in the MindsetMatters
data frame.
Here’s what we get when we run the code and print out a random sample of 10 cases from the data frame.
Age AgeQuartile
38 3
34 2
50 4
35 2
33 2
28 1
46 4
21 1
27 1
48 4
You can see that the youngest housekeepers (ages 21 and 27) are assigned a 1 because they are in the first quartile. The oldest (ages 48 and 50) are assigned a 4.
Let’s try using the new variable to color the bars of the histogram of Age
according to quartiles. In order to use the numbers from ntile()
as a basis for fill=
we must first turn it into a factor. We can do that by adding factor()
to the code we wrote above, like this:
MindsetMatters$AgeQuartile <- factor(ntile(MindsetMatters$Age, 4))
In the code below, we’ve created AgeQuartile
for you and made it a factor. Use that variable to fill
the bars of the Age
histogram. (Note: Leave the fill
of the box plot “white”. And remember, to use a variable to fill
the bars, you’ll need the tilde, ~
.)
require(coursekata)
# This creates AgeQuartile and make it a factor
MindsetMatters$AgeQuartile <- factor(ntile(MindsetMatters$Age, 4))
# Fill the bars based on AgeQuartile
gf_histogram(~ Age, data = MindsetMatters, fill = "black") %>%
gf_boxplot(fill = "white", width = 1)
# This creates AgeQuartile and make it a factor
MindsetMatters$AgeQuartile <- factor(ntile(MindsetMatters$Age, 4))
# Fill the bars based on AgeQuartile
gf_histogram(~ Age, data = MindsetMatters, fill = ~AgeQuartile) %>%
gf_boxplot(fill = "white", width = 1)
ex() %>% check_function(., "gf_histogram") %>% {
check_arg(., "data") %>% check_equal()
check_arg(., "object") %>% check_equal()
check_arg(., "fill") %>% check_equal()
}
ex() %>% check_function(., "gf_boxplot") %>% {
check_arg(., "fill") %>% check_equal()
}