Course Outline

list High School / Advanced Statistics and Data Science I (ABC)

Book
  • High School / Advanced Statistics and Data Science I (ABC)
  • High School / Statistics and Data Science I (AB)
  • High School / Statistics and Data Science II (XCD)
  • High School / Algebra + Data Science (G)
  • College / Introductory Statistics with R (ABC)
  • College / Advanced Statistics with R (ABCD)
  • College / Accelerated Statistics with R (XCD)
  • CKHub: Jupyter made easy

3.7 Boxplots and the Five-Number Summary

The five-number summary gives us a snapshot of the distribution of a quantitative variable. Boxplots (also called box-and-whisker plots) give us a way to visualize the five-number summary. For example, here is a boxplot that depicts the distribution of Wt from the MindsetMatters data set. Let’s see how it relates to the five-number summary.

A boxplot of the distribution of Wt in MindsetMatters. The boxplot is made up of a few parts. There is a teal-colored rectangular box in the center divided (with a vertical line) into two parts, a left and a right. There are horizontal lines, called whiskers, that extend out to the left and right of the box. Another name for boxplot is box-and-whisker plot.

The boxplot above is a horizontal boxplot. (Later we’ll show you how to make a vertical boxplot.) The x-axis, running horizontally, shows the scale on which Wt is measured (roughly from 90 to 200 lbs). The y-axis doesn’t mean anything on this graph, which is why we removed the label from it for now.

Below, we have labeled the same boxplot of Wt to show how it visualizes the min, Q1, median, Q3, and max. We can then “read” the five-number summary off of the boxplot.

A boxplot of the distribution of Wt in MindsetMatters.

If we print out the favstats() of Wt, we will see that our estimates correspond to the values of min, Q1, median, Q3, and max.

favstats(~ Wt, data = MindsetMatters)
 min  Q1 median    Q3 max     mean       sd  n missing
  90 130    145 161.5 196 146.1333 22.46459 75       0

In the distribution of Wt there are no outliers (defined as more than 1.5 IQRs above Q3 or below Q1); the whiskers simply end at the max and min values for Wt. When there are outliers, they are represented as dots off to the left or right of the whiskers.

Make Your Own Boxplot

Look at the code in the window below and you will see how we created the boxplot above. Go ahead and click <Run> to check that it works. Notice that putting the ~ before Wt is what makes this a horizontal boxplot with the variable Wt on the x-axis.

Now modify the code to create a boxplot for Age from the MindsetMatters data frame.

require(coursekata) # Modify this code to create a boxplot of Age from MindsetMatters gf_boxplot(~ Wt, data = MindsetMatters) # Modify this code to create a boxplot of Age from MindsetMatters gf_boxplot(~ Age, data = MindsetMatters) ex() %>% check_function("gf_boxplot") %>% { check_arg(., "object") %>% check_equal() check_arg(., "data") %>% check_equal() }

A boxplot of the distribution of Age in MindsetMatters.

Notice how the structure of the gf_boxplot() function is just like that for gf_histogram(). Try altering the code above to make a histogram instead of a boxplot. Look at the boxplot compared with the histogram. Can you see how the same distribution gets represented by these two types of plots?

A histogram of the distribution of Age in MindsetMatters.

How to Overlay a Boxplot on a Histogram (change to MindsetMatters)

It’s easier to compare boxplots and histograms if you overlay one on the other. You can overlay a boxplot on a histogram of the same variable using %>%. Notice that we don’t need to include any arguments in parentheses for the gf_boxplot() function as the arguments for the gf_histogram() function are carried over. R usually makes this assumption when chaining one function onto another.

gf_histogram(~ Age, data = MindsetMatters) %>%
  gf_boxplot()

A boxplot of the distribution of Age in MindsetMatters overlaid on a histogram of the same distribution.

The default boxplot is a little hard to see on top of the histogram. We can get a better effect by adding arguments such as fill and width to adjust features of the plot.

gf_histogram(~ Age, data = MindsetMatters) %>%
  gf_boxplot(fill = "purple", width = 1)

A thicker purple boxplot of the distribution of Age in MindsetMatters overlaid on a histogram of the same distribution.

Take another look at the histogram with boxplot below. This time we have put dashed lines at Q1 and Q3 (the hinges of the box).

A histogram of Age overlaid with a boxplot, but this time there are two vertical lines corresponding to Q1 and Q3.

You can tweak how the boxplot looks in a number of ways. Run the code block below, then try changing some of the arguments. Explore what happens if you change the width argument from 1 to a different number. Predict what would happen if you change the number in front of the ~ Age in gf_boxplot(). In the code below we set it to 6. What do you think would happen if you set it to a negative number?

require(coursekata) # run this code to see what happens # then try modifying the 6, 1 and "purple" (one at a time) # what do you think will happen? gf_histogram(~ Age, data = MindsetMatters) %>% gf_boxplot(6 ~ Age, width = 1, fill = "purple") %>% gf_labs(y="count", x = "Age (in Years)") # Assume the data frame SmallerCountries has been made for you # try modifying the 10, 3 and "purple" (one at a time) # what do you think will happen? gf_histogram(~ Age, data = MindsetMatters) %>% gf_boxplot(-1.5 ~ Age, width = 2, fill = "orange") %>% gf_labs(x = "Age (in Years)") ex() %>% check_function("gf_boxplot")

Responses