Course Outline

list High School / Advanced Statistics and Data Science I (ABC)

Book
  • High School / Advanced Statistics and Data Science I (ABC)
  • High School / Statistics and Data Science I (AB)
  • High School / Statistics and Data Science II (XCD)
  • High School / Algebra + Data Science (G)
  • College / Introductory Statistics with R (ABC)
  • College / Advanced Statistics with R (ABCD)
  • College / Accelerated Statistics with R (XCD)
  • CKHub: Jupyter made easy

3.7 Box Plots and the Five-Number Summary

The five-number summary gives us a snapshot of the distribution of a quantitative variable. Box plots (also called box-and-whisker plots) give us a way to visualize the five-number summary. For example, here is a box plot that depicts the distribution of Wt from the MindsetMatters data set. Let’s see how it relates to the five-number summary.

A box plot of the distribution of Wt in MindsetMatters. The box plot is made up of a few parts. There is a teal-colored rectangular box in the center divided (with a vertical line) into two parts, a left and a right. There are horizontal lines, called whiskers, that extend out to the left and right of the box. Another name for box plot is box-and-whisker plot.

The box plot above is a horizontal box plot. (Later we’ll show you how to make a vertical box plot.) The x-axis, running horizontally, shows the scale on which Wt is measured (roughly from 90 to 200 lbs). The y-axis doesn’t mean anything on this graph, which is why we removed the label from it for now.

Below, we have labeled the same box plot of Wt to show how it visualizes the min, Q1, median, Q3, and max. We can then “read” the five-number summary off of the box plot.

A box plot of the distribution of Wt in MindsetMatters.

If we print out the favstats() of Wt, we will see that our estimates correspond to the values of min, Q1, median, Q3, and max.

favstats(~ Wt, data = MindsetMatters)
 min  Q1 median    Q3 max     mean       sd  n missing
  90 130    145 161.5 196 146.1333 22.46459 75       0

In the distribution of Wt there are no outliers (defined as more than 1.5 IQRs above Q3 or below Q1); the whiskers simply end at the max and min values for Wt. When there are outliers, they are represented as dots off to the left or right of the whiskers.

Make Your Own Box Plot

Look at the code in the window below and you will see how we created the box plot above. Go ahead and click <Run> to check that it works. Notice that putting the ~ before Wt is what makes this a horizontal box plot with the variable Wt on the x-axis.

Now modify the code to create a box plot for Age from the MindsetMatters data frame.

require(coursekata) # Modify this code to create a box plot of Age from MindsetMatters gf_boxplot(~ Wt, data = MindsetMatters) # Modify this code to create a box plot of Age from MindsetMatters gf_boxplot(~ Age, data = MindsetMatters) ex() %>% check_function("gf_boxplot") %>% { check_arg(., "object") %>% check_equal() check_arg(., "data") %>% check_equal() }

A box plot of the distribution of Age in MindsetMatters.

Notice how the structure of the gf_boxplot() function is just like that for gf_histogram(). Try altering the code above to make a histogram instead of a box plot. Look at the box plot compared with the histogram. Can you see how the same distribution gets represented by these two types of plots?

A histogram of the distribution of Age in MindsetMatters.

How to Overlay a Box Plot on a Histogram

It’s easier to compare box plots and histograms if you overlay one on the other. You can overlay a box plot on a histogram of the same variable using %>%. Notice that we don’t need to include any arguments in parentheses for the gf_boxplot() function as the arguments for the gf_histogram() function are carried over. R usually makes this assumption when chaining one function onto another.

gf_histogram(~ Age, data = MindsetMatters) %>%
  gf_boxplot()

A box plot of the distribution of Age in MindsetMatters overlaid on a histogram of the same distribution.

The default box plot is a little hard to see on top of the histogram. We can get a better effect by adding arguments such as fill and width to adjust features of the plot.

gf_histogram(~ Age, data = MindsetMatters) %>%
  gf_boxplot(fill = "purple", width = 1)

A thicker purple box plot of the distribution of Age in MindsetMatters overlaid on a histogram of the same distribution.

Take another look at the histogram with box plot below. This time we have put dashed lines at Q1 and Q3 (the hinges of the box).

A histogram of Age overlaid with a box plot, but this time there are two vertical lines corresponding to Q1 and Q3.

You can tweak how the box plot looks in a number of ways. Run the code block below, then try changing some of the arguments. Explore what happens if you change the width argument from 1 to a different number. Predict what would happen if you change the number in front of the ~ Age in gf_boxplot(). In the code below we set it to 6. What do you think would happen if you set it to a negative number?

require(coursekata) # run this code to see what happens # then try modifying the 6, 1 and "purple" (one at a time) # what do you think will happen? gf_histogram(~ Age, data = MindsetMatters) %>% gf_boxplot(6 ~ Age, width = 1, fill = "purple") %>% gf_labs(y="count", x = "Age (in Years)") # Assume the data frame SmallerCountries has been made for you # try modifying the 10, 3 and "purple" (one at a time) # what do you think will happen? gf_histogram(~ Age, data = MindsetMatters) %>% gf_boxplot(-1.5 ~ Age, width = 2, fill = "orange") %>% gf_labs(x = "Age (in Years)") ex() %>% check_function("gf_boxplot")

The ntile() Function

Now that you have learned about quartiles and the five number summary, there is another R function that is often useful: ntile(). This function arranges the values of a quantitative variable in order and then divides them up into some number (n) of equal-sized groups. For example, if we want to create 4 equal-sized groups, the n is 4. Another way of saying 4-tiles is “quartiles”.

Here’s an example of how to use this function:

MindsetMatters$AgeQuartile <- ntile(MindsetMatters$Age, 4)

This code ntile(MindsetMatters$Age, 4) arranges the Age values in order, cuts them up into four equal-sized groups, then returns a number from 1 to 4 indicating which quartile the value belongs to (e.g., 1, 4, 3, 2, 4, 4, etc). The assignment operator (<-) is used to save these numbers into a variable called AgeQuartile in the MindsetMatters data frame.

Here’s what we get when we run the code and print out a random sample of 10 cases from the data frame.

 Age   AgeQuartile 
  38             3 
  34             2 
  50             4 
  35             2 
  33             2 
  28             1 
  46             4  
  21             1 
  27             1 
  48             4 

You can see that the youngest housekeepers (ages 21 and 27) are assigned a 1 because they are in the first quartile. The oldest (ages 48 and 50) are assigned a 4.

Let’s try using the new variable to color the bars of the histogram of Age according to quartiles. In order to use the numbers from ntile() as a basis for fill= we must first turn it into a factor. We can do that by adding factor() to the code we wrote above, like this:

MindsetMatters$AgeQuartile <- factor(ntile(MindsetMatters$Age, 4))

In the code below, we’ve created AgeQuartile for you and made it a factor. Use that variable to fill the bars of the Age histogram. (Note: Leave the fill of the box plot “white”. And remember, to use a variable to fill the bars, you’ll need the tilde, ~.)

require(coursekata) # This creates AgeQuartile and make it a factor MindsetMatters$AgeQuartile <- factor(ntile(MindsetMatters$Age, 4)) # Fill the bars based on AgeQuartile gf_histogram(~ Age, data = MindsetMatters, fill = "black") %>% gf_boxplot(fill = "white", width = 1) # This creates AgeQuartile and make it a factor MindsetMatters$AgeQuartile <- factor(ntile(MindsetMatters$Age, 4)) # Fill the bars based on AgeQuartile gf_histogram(~ Age, data = MindsetMatters, fill = ~AgeQuartile) %>% gf_boxplot(fill = "white", width = 1) ex() %>% check_function(., "gf_histogram") %>% { check_arg(., "data") %>% check_equal() check_arg(., "object") %>% check_equal() check_arg(., "fill") %>% check_equal() } ex() %>% check_function(., "gf_boxplot") %>% { check_arg(., "fill") %>% check_equal() }

A histogram of Age overlaid with a box plot, but this time the values of Age are colored differently by quartile.

Responses