list

Statistics and Data Science: A Modeling Approach

4.8 From Quantitative to Categorical Explanatory Variables

Okay, let’s go back to where we were, explaining the variation in thumb length using the variable Sex.

THUMB LENGTH = SEX + OTHER STUFF

Let’s look at the histograms and scatterplots of this word equation, which showed that the overall variation in Thumb length could be partially explained by taking Sex into account.

gf_dhistogram( ~ Thumb, data = Fingers, fill = "orange") %>%
gf_facet_grid(Sex ~ .)
gf_point(Thumb ~ Sex, data = Fingers, color = "orange", size = 5, alpha = .5)

One histogram and one scatterplot showing relationship between Thumb length and Sex

Let’s now see if we can take the same approach for a different explanatory variable: Height. First, let’s write a word equation to represent the relationship we want to explore:

THUMB LENGTH = HEIGHT + OTHER STUFF

We might expect that people who are taller have longer thumbs.

We actually could use the same approach with Height as we did with Sex. But notice, whereas Sex is a categorical variable, Height is continuous. We can construct a new categorical variable by cutting up Height into two categories—short and tall. You can do that using the function ntile().

Recall that quartiles could be created by sorting a quantitative variable in order and then dividing the observations into four groups of equal sizes. In the same way, we could create tertiles (three equal-sized groups), quantiles (five groups), deciles (10 groups), and so on. The ntile() function lets you divide observations into any (n) number of groups (-tiles).

Running the code below will divide the students into two equal groups: those taller than the middle student, and those shorter. Students who belong to the shorter group will get a 1 and those in the taller group will get a 2.

ntile(Fingers$Height, 2)
  [1] 2 1 1 2 2 2 2 2 1 1 2 2 1 2 1 1 2 2 1 2 1 1 1 2 1 1 2 2 1 2 1 2 1 1 1 1 1
 [38] 2 1 2 1 1 2 2 1 1 1 2 1 1 1 2 1 1 2 1 2 1 1 2 1 1 1 2 1 2 2 2 2 2 2 2 2 2
 [75] 1 2 1 1 2 1 1 1 1 2 2 2 1 2 1 1 1 2 1 2 1 2 1 2 2 2 2 1 1 1 2 2 1 1 2 1 1
[112] 2 2 2 2 1 2 2 2 2 2 2 1 1 1 2 1 2 2 1 1 1 2 2 2 1 1 2 1 1 1 2 2 2 1 1 1 2
[149] 1 2 2 2 2 1 1 1 2

Like everything else in R, if you don’t save it to a data frame, this work will go to waste. Use ntile() to create the shorter and taller group but this time, save this in Fingers as a new variable called Height2Group.

require(tidyverse) require(mosaic) require(supernova) # Use ntile() to cut Height into groups Fingers$Height2Group <- # This prints out a few observations of Height and Height2Group head(select(Fingers, Height, Height2Group)) Fingers$Height2Group <- ntile(Fingers$Height, 2) head(select(Fingers, Height, Height2Group)) ex() %>% check_correct( check_object(., "Fingers") %>% check_column("Height2Group") %>% check_equal(), { check_error(.) check_function(., "ntile", not_called_msg = "Have you called ntile()?") %>% check_arg("n") %>% check_equal() }) incorrect_msg <- "Did you remember to use select() to select the Height and Height2Group columns from the Fingers data frame before calling head()?" ex() %>% check_or( check_output_expr(., "head(select(Fingers, Height, Height2Group))", missing_msg = incorrect_msg), check_output_expr(., "head(select(Fingers, Height2Group, Height))", missing_msg = incorrect_msg) )
Use ntile() to split Fingers$Height into 2 categories
DataCamp: ch4-21

  Height Height2Group
1   70.5            2
2   64.8            1
3   64.0            1
4   70.0            2
5   68.0            2
6   68.0            2

Now we can try looking at the data the same way as we did for Sex, which also had two levels.

Create histograms in a grid to look at variability in Thumb based on Height2Group.

require(tidyverse) require(mosaic) require(Lock5withR) require(supernova) # Here we create the variable Height2Group Fingers$Height2Group <- ntile(Fingers$Height, 2) # Try creating histograms of Thumb in a grid by Height2Group # Here we create the variable Height2Group Fingers$Height2Group <- ntile(Fingers$Height, 2) # Try creating histograms of Thumb in a grid by Height2Group gf_histogram(~Thumb, data=Fingers) %>% gf_facet_grid(Height2Group ~ .) ex() %>% check_or( check_function(., "gf_histogram") %>% { check_arg(., "object") %>% check_equal() check_arg(., "data") %>% check_equal() }, override_solution(., "gf_histogram(Fingers, ~Thumb)") %>% check_function("gf_histogram") %>% { check_arg(., "object") %>% check_equal() check_arg(., "gformula") %>% check_equal() } ) ex() %>% check_function(., "gf_facet_grid", not_called_msg = "Have you called gf_facet_grid() to put your histograms in a grid?") %>% check_arg(., "object") %>% check_equal() ex() %>% check_or( check_function(., "gf_facet_grid") %>% check_arg(2) %>% check_equal(), override_solution(., "gf_facet_grid(gf_histogram(Fingers, ~Thumb), . ~ Height2Group)") %>% check_function('gf_facet_grid') %>% check_arg(2) %>% check_equal() )
Don't forget to use gf_facet_grid to put your histograms of Height2Group in a grid
DataCamp: ch4-22

Faceted histogram showing the relationship between Thumb length and the two groups that were created based on Height

Is there a difference between groups 1 and 2? Does the taller group have longer thumbs than the shorter group? It would be more helpful if instead of groups 1 and 2, these visualizations were labeled “short” and “tall”.

The variable Height2Group is categorical because the numbers are stand-ins for categories. In this case, the number 1 stands for “short”. This differs from quantitative variables for which the numbers actually stand for quantities. For instance, in the variable Thumb, 60 stands for 60 mm.

Before, we learned to use the factor() function to turn a numeric variable into a factor: factor(Fingers$Height2Group). We can use the same function to label the levels of a categorical variable.

factor(Fingers$Height2Group, levels = c(1,2), labels = c("short", "tall"))

This looks complicated. But you can think of the input to the factor() function as having three parts (what we call arguments): the variable name, the levels, and the labels.

As always, if we want this change to stick around, we have to save this back into a variable. Use the <- (assignment operator that looks like an arrow) to save the result of the factor() function back into Fingers$Height2Group.

require(tidyverse) require(mosaic) require(Lock5withR) require(supernova) # This code will cut Height from Fingers into 2 categories Fingers$Height2Group <- ntile(Fingers$Height, 2) # Try using factor() to label the groups "short" and "tall" # This code recreates the faceted histogram gf_histogram(~ Thumb, data = Fingers) %>% gf_facet_grid(Height2Group ~ .) Fingers$Height2Group <- ntile(Fingers$Height, 2) Fingers$Height2Group <- factor(Fingers$Height2Group, levels = c(1,2), labels = c("short", "tall")) gf_histogram(~ Thumb, data = Fingers) %>% gf_facet_grid(Height2Group ~ .) ex() %>% check_correct( check_object(., "Fingers") %>% check_column("Height2Group") %>% check_equal(incorrect_msg = "Did you remember to save the factored variable back into Fingers$Height2Group?"), { check_error(.) check_function(., "factor") %>% check_arg("levels") %>% check_equal() check_function(., "factor") %>% check_arg("labels") %>% check_equal() } ) success_msg("Wow! You're a rock staR. Keep up the good work!")
Use ntile to split a variable into a certain number of parts
DataCamp: ch4-23

Two histograms showing Thumb length by height groupings of Short and Tall

To get a different perspective on the same data, let’s also try looking at these distributions with a scatterplot and boxplot.

require(tidyverse) require(mosaic) require(Lock5withR) require(supernova) Fingers <- supernova::Fingers %>% mutate(Height2Group = factor(ntile(Height, 2), 1:2, c("short", "tall"))) # Create a scatterplot of Thumb by Height2Group # Create boxplots of Thumb by Height2Group gf_point(Thumb ~ Height2Group, data = Fingers) gf_boxplot(Thumb ~ Height2Group, data = Fingers) sol_2 <- "gf_boxplot(Fingers$Thumb ~ Fingers$Height2Group)" ex() %>% check_or( check_function(., "gf_boxplot") %>% check_result() %>% check_equal(), override_solution(., sol_2) %>% check_function(., "gf_boxplot") %>% check_result() %>% check_equal() )
gf_boxplot() can make multiple boxplots on the same graph without a grid---you can specify the categorical explanatory variable in the formula like y ~ x.
DataCamp: ch4-14

Scatterplot and boxplots of Thumb Length by height groupings of Short and Tall

Similar to what we found for Sex, where there was a lot of variability within the female group and male group, there is a lot of variability within the short and tall groups. But there is less variability within each group than there would be in the overall distribution we would get if we just combined both groups together. Again, it is useful to think about this within-group variation as the leftover variation after explaining some of the variation with Height2Group.

See if you can break Height into three categories (let’s call it Height3Group) and then compare the distribution of height across all three categories with a scatterplot. Create boxplots as well.

require(tidyverse) require(mosaic) require(Lock5withR) require(supernova) Fingers <- supernova::Fingers %>% mutate(Height2Group = factor(ntile(Height, 2), 1:2, c("short", "tall"))) # Modify this code to break Height into 3 categories: "short", "medium", and "tall" Fingers$Height3Group <- ntile(Fingers$Height, 2) Fingers$Height3Group <- factor( , levels = 1:2, labels = c("short", "tall")) # Create a scatterplot of Thumb by Height3Group # Create boxplots of Thumb by Height3Group Fingers$Height3Group <- ntile(Fingers$Height, 3) Fingers$Height3Group <- factor(Fingers$Height3Group, 1:3, c("short", "medium", "tall")) gf_point(Thumb ~ Height3Group, data = Fingers) gf_boxplot(Thumb ~ Height3Group, data = Fingers) ex() %>% { check_object(., "Fingers") %>% check_column("Height3Group") %>% check_equal(incorrect_msg = "Did you remember to use `ntile()`?") } ex() %>% check_or( check_function(., "gf_point") %>% { check_arg(., "object") %>% check_equal() check_arg(., "data") %>% check_equal() }, override_solution(., "gf_point(Fingers, Thumb ~ Height3Group)") %>% check_function("gf_point") %>% { check_arg(., "object") %>% check_equal() check_arg(., "gformula") %>% check_equal() }, override_solution(., "gf_point(Fingers$Thumb ~ Fingers$Height3Group)") %>% check_function("gf_point") %>% { check_arg(., "object") %>% check_equal() }, override_solution(., "gf_jitter(Thumb ~ Height3Group, data = Fingers)") %>% check_function("gf_jitter") %>% { check_arg(., "object") %>% check_equal() check_arg(., "data") %>% check_equal() }, override_solution(., "gf_jitter(Fingers, Thumb ~ Height3Group)") %>% check_function("gf_jitter") %>% { check_arg(., "object") %>% check_equal() check_arg(., "gformula") %>% check_equal() }, override_solution(., "gf_jitter(Fingers$Thumb ~ Fingers$Height3Group)") %>% check_function("gf_jitter") %>% check_arg("object") %>% check_equal() ) ex() %>% check_or( check_function(., "gf_boxplot") %>% { check_arg(., "object") %>% check_equal() check_arg(., "data") %>% check_equal() }, override_solution(., "gf_boxplot(Fingers, Thumb ~ Height3Group)") %>% check_function("gf_boxplot") %>% { check_arg(., "object") %>% check_equal() check_arg(., "gformula") %>% check_equal() }, override_solution(., "gf_boxplot(Fingers$Thumb ~ Fingers$Height3Group)") %>% check_function("gf_boxplot") %>% check_arg("object") %>% check_equal() ) success_msg("Keep it up!")
Don't forget to set the number of categories to 3
DataCamp: ch4-15

Scatterplot of Thumb Length by height groupings Short, Medium and Tall

Boxplots of Thumb Length by height groupings Short, Medium and Tall

Side-by-side boxplots of Thumb length by two height groupings versus three height groupings

Looking at these two boxplots, we have an intuition that the three-group version of Height explains more variation in thumb length than does the two-group version. Although there is still a lot of variation within each group in the three-group version, the within-group variation appears smaller in the three-group than in the two-group model. Or, to put it another way, there is less variation left over after taking out the variation due to height.