Course Outline

list High School / Advanced Statistics and Data Science I (ABC)

Book
  • High School / Advanced Statistics and Data Science I (ABC)
  • High School / Statistics and Data Science I (AB)
  • High School / Statistics and Data Science II (XCD)
  • High School / Algebra + Data Science (G)
  • College / Introductory Statistics with R (ABC)
  • College / Advanced Statistics with R (ABCD)
  • College / Accelerated Statistics with R (XCD)
  • CKHub: Jupyter made easy

3.8 Outliers on a Box Plot

A Box Plot from HappyPlanetIndex

We’ve been making box plots of variables in the MindsetMatters data frame. Modify the code in the window below to create a box plot for Population from the HappyPlanetIndex data frame.

require(coursekata) HappyPlanetIndex$Region <- recode( HappyPlanetIndex$Region, '1'="Latin America", '2'="Western Nations", '3'="Middle East and North Africa", '4'="Sub-Saharan Africa", '5'="South Asia", '6'="East Asia", '7'="Former Communist Countries" ) # Modify this code to create a box plot of Population from HappyPlanetIndex gf_boxplot(~ Age, data = MindsetMatters) # Modify this code to create a box plot of Population from HappyPlanetIndex gf_boxplot(~ Population, data = HappyPlanetIndex) ex() %>% check_function("gf_boxplot") %>% { check_arg(., "object") %>% check_equal() check_arg(., "data") %>% check_equal() }

A box plot of the distribution of Population in HappyPlanetIndex.

Wow, this is a strange-looking box plot. You can hardly see the box — it’s squished over on the left. And there are all these points (dots) to the right of the whisker.

Outliers on a Box Plot

The points that appear beyond a whisker on a box plot are the outliers. If they appear to the right of the right whisker, it means that R has checked and found these values to be greater than the \(\text{Q3} + 1.5*\text{IQR}\). If they appear to the left of the left whisker, it means that R has found these values to be smaller than the \(\text{Q1} - 1.5*\text{IQR}\). When there are outliers, the end of the whisker depicts the max or min value that is not considered an outlier.

There are a lot of large outlier countries. No wonder the histogram of Population (see below) put so many countries into the same bin! By having to put the outlier countries on the same scale, the rest of the countries get squished together on the lower end of the population scale. This makes it look as though most countries have populations of 0 million people.

A histogram of the distribution of Population in HappyPlanetIndex. The histogram is right-skewed.

Excluding Outliers to See the Rest of the Distribution More Clearly

In a distribution like this, we might want to exclude the outliers from the histogram in order to get a clearer look at the distribution of most countries’ populations, without having them squished together because of the largest countries. To do this we first need to find the point on the population scale above which we would consider a country an outlier.

In the following code window, use filter() to save a new version of the data frame that includes only countries from the HappyPlanetIndex data frame with populations smaller than this upper boundary. Name this new version of the data frame SmallerCountries. Then run the code to produce a histogram of Population that only includes the smaller countries.

require(coursekata) HappyPlanetIndex$Region <- recode( HappyPlanetIndex$Region, '1'="Latin America", '2'="Western Nations", '3'="Middle East and North Africa", '4'="Sub-Saharan Africa", '5'="South Asia", '6'="East Asia", '7'="Former Communist Countries" ) # this calculates the Q3 + 1.5*IQR upper_boundary <- 31.225 + 1.5*(31.225-4.455) # use the filter function to create a data frame that only includes # countries with population sizes less than the upper_boundary SmallerCountries <- # this makes a histogram of the smaller countries' populations gf_histogram(~ Population, data = SmallerCountries) %>% gf_labs(x = "Population (in millions)", title = "Population of Countries (Excludes Outliers)") # this calculates the Q3 + 1.5*IQR upper_boundary <- 31.225 + 1.5*(31.225-4.455) # use the filter function to create a data frame that only includes # countries with population sizes less than the upper_boundary SmallerCountries <- filter(HappyPlanetIndex, Population < upper_boundary) # this makes a histogram of the smaller countries' populations gf_histogram(~ Population, data = SmallerCountries) %>% gf_labs(x = "Population (in millions)", title = "Population of Countries (Excludes Outliers)") ex() %>% { check_function(., "filter") %>% { check_arg(., ".data") %>% check_equal(incorrect_msg="Don't forget to filter in HappyPlanetIndex") check_arg(., "...") %>% check_equal(incorrect_msg="Did you use `Population < upper_boundary` as the second argument?") check_result(.) %>% check_equal() } check_object(., "SmallerCountries") %>% check_equal check_function(., "gf_histogram") %>% { check_arg(., "object") %>% check_equal() check_arg(., "data") %>% check_equal() } }

A histogram of the distribution of Population in SmallerCountries. It is still right-skewed like the one above, but only ranges from 0 to about 70 million.

This is a very different histogram than the one that included outliers. Here we get a sense of how countries that previously got lumped together in one bin at the bottom of the distribution actually vary in their population size. Notice, too, that the scale on the x-axis has changed, from about 0 to 1500 (millions) to 0 to 70.

Now let’s re-make the box plot of Population for just the countries in the SmallerCountries data frame to see what that looks like. Modify the code below to make the box plot.

require(coursekata) HappyPlanetIndex$Region <- recode( HappyPlanetIndex$Region, '1'="Latin America", '2'="Western Nations", '3'="Middle East and North Africa", '4'="Sub-Saharan Africa", '5'="South Asia", '6'="East Asia", '7'="Former Communist Countries" ) pop_stats <- favstats(~ Population, data = HappyPlanetIndex) SmallerCountries <- filter(HappyPlanetIndex, Population < (pop_stats$Q3 + 1.5*(pop_stats$Q3 - pop_stats$Q1))) # Assume the data frame SmallerCountries has been made for you # Modify this code to make a box plot of Population for just the SmallerCountries gf_boxplot(~ Population, data = HappyPlanetIndex) # Assume the data frame SmallerCountries has been made for you # Modify this code to make a box plot of Population for just the SmallerCountries gf_boxplot(~ Population, data = SmallerCountries) ex() %>% check_function("gf_boxplot") %>% { check_arg(., "object") %>% check_equal() check_arg(., "data") %>% check_equal() }

A box plot of the distribution of Population in SmallerCountries, with a longer whisker and some outliers in higher values of population.

To check our intuitions about the relationship between histograms and box plots, try chaining (with the pipe operator, %>%) the code for box plot right onto our histogram of Population from SmallerCountries. Also feel free to change the fill or width arguments to make the box plot more easily visible.

require(coursekata) HappyPlanetIndex$Region <- recode( HappyPlanetIndex$Region, '1'="Latin America", '2'="Western Nations", '3'="Middle East and North Africa", '4'="Sub-Saharan Africa", '5'="South Asia", '6'="East Asia", '7'="Former Communist Countries" ) pop_stats <- favstats(~ Population, data = HappyPlanetIndex) SmallerCountries <- filter(HappyPlanetIndex, Population < (pop_stats$Q3 + 1.5*(pop_stats$Q3 - pop_stats$Q1))) # Assume the data frame SmallerCountries has been made for you # Modify this code to chain the box plot onto our histogram gf_histogram(~ Population, data = SmallerCountries) # Assume the data frame SmallerCountries has been made for you # Modify this code to chain the box plot onto our histogram gf_histogram(~ Population, data = SmallerCountries) %>% gf_boxplot() ex() %>% check_function("gf_boxplot")%>% check_result()

A histogram overlaid with a box plot of the distribution of Population in SmallerCountries, with a longer whisker and some outliers in higher values of population.

We set our fill to "white" and the width to 4 to get this visualization. The right-skewed distribution we see with a histogram is represented as a box plot with a bigger right box and a longer whisker on the right. The outliers are represented as a “tail” in the histogram and as “dots” in the box plot.

Responses