Course Outline

list High School / Advanced Statistics and Data Science I (ABC)

Book
  • High School / Advanced Statistics and Data Science I (ABC)
  • High School / Statistics and Data Science I (AB)
  • High School / Statistics and Data Science II (XCD)
  • High School / Algebra + Data Science (G)
  • College / Introductory Statistics with R (ABC)
  • College / Advanced Statistics with R (ABCD)
  • College / Accelerated Statistics with R (XCD)
  • CKHub: Jupyter made easy

4.4 Using Box Plots to Explore Relationships

gf_point() and gf_jitter() are useful because they let us see each individual data point. There are times, however, when we want to transcend the individual data points and focus mainly on the overall pattern of a distribution.

Box plots, which we have seen before, are helpful in this regard, and are especially useful for comparing the distribution of an outcome variable across different levels of a categorical explanatory variable.

Here’s how we would create a box plot of Thumb length broken down by Gender.

gf_boxplot(Thumb ~ Gender, data = Fingers)

box plots of the distribution of Thumb by Gender in Fingers.

We can also chain together box plots and jitter plots. In ggformula, when we chain on multiple functions, the later functions assume the same variables and data frames so we don’t need to type those in again. Handy!

gf_boxplot(Thumb ~ Gender, data = Fingers) %>%
  gf_jitter(height = 0, width = .25)

This code will make a box plot first and then overlay the jitter plot on top of it.

box plots of the distribution of Thumb by Gender in Fingers overlaid with corresponding jitter plots.

Recall that the shaded rectangle at the center of the box plot shows us where the middle 50% of the data points fall on the scale of the outcome variable. The thick horizontal line inside the box is the median. Think back to the five-number summary. We can get the five-number summary for Thumb broken down by Gender by modifying how we previously used favstats().

favstats(Thumb ~ Gender, data = Fingers)
  Gender min Q1 median     Q3   max     mean       sd   n missing
1 female  39 54     57 63.125 86.36 58.25585 8.034694 112       0
2   male  47 60     64 70.000 90.00 64.70267 8.764933  45       0

Looking for Explained Variation in Box Plots

In the box plot of Thumb length by Gender, the box for females is lower down vertically than the box for males.

Let’s replace Gender with a different explanatory variable. Job is a categorical variable with three levels (no job, part-time, and full-time).

Modify this code to produce a box plot for Thumb length by Job (instead of by Gender).

require(coursekata) # Modify this box plot to look at Thumb length by Job gf_boxplot(Thumb ~ Gender, data = Fingers) %>% gf_jitter(height = 0, width = .2, shape = 1, size = 3) # Modify this box plot to look at Thumb length by Job gf_boxplot(Thumb ~ Job, data = Fingers) %>% gf_jitter(height = 0, width = .2, shape = 1, size = 3) ex() %>% check_function(., "gf_boxplot") %>% { check_arg(., "data") %>% check_equal() check_arg(., "object") %>% check_equal() }

box plots of the distribution of Thumb by Job in Fingers overlaid with a jitter plot.

Notice that in this box plot, the boxes for “Not Working” and “Part-time Job” are at approximately the same vertical position and are about the same height. The box plot for “Full-time Job” looks quite different (and strange).

The full-time box only includes one student, so, we wouldn’t want to draw any conclusions about the relationship between working full time and thumb length. Most of the students in the Fingers data frame either work part-time or not at all. The thumbs of students with no job are not much longer or shorter than thumbs of students with part-time jobs. But within each group, their thumb lengths vary a lot. There are long-thumbed and short-thumbed students with part-time jobs and with no jobs.

Whiskers and IQR

Now let’s return our attention to the whisker part (the lines) that go out from the box. The whiskers are drawn in relation to IQR, the interquartile range but this time the IQR is calculated separately for each level of the explanatory variable (that is, each group).

In gf_boxplot(), outliers, defined as observations more than 1.5 IQRs above or below the box for each group, are represented with dots. The ends of the whiskers (the lines that extend above and below the box) represent the maximum and minimum observations in each group that are not defined as outliers.

Any data that are greater or less than the whiskers are depicted in a box plot as individual points. By convention, these can be considered outliers.

Responses