Course Outline

list High School / Advanced Statistics and Data Science I (ABC)

Book
  • High School / Advanced Statistics and Data Science I (ABC)
  • High School / Statistics and Data Science I (AB)
  • High School / Statistics and Data Science II (XCD)
  • College / Statistics and Data Science (ABC)
  • College / Advanced Statistics and Data Science (ABCD)

7.10 Extending to a Three-Group Model

You have now learned how to specify a model with a single categorical explanatory variable consisting of two groups. It’s actually pretty simple to extend this idea to a categorical variable with three groups.

First, a New Two-Group Model

Let’s use a new explanatory variable to explain variation in thumb length: Height. Height, in our data set, is a quantitative variable measured in inches. But we can make a new variable that turns Height into a categorical variable with two categories: short and tall.

We can do this easily in R. We will use the ntile() function to cut the sample up into two equal-sized groups based on Height and save the result into a new variable called Height2Group.

Fingers$Height2Group <- ntile(Fingers$Height, 2)
head(select(Fingers, Thumb, Height, Height2Group), 10)

We will then use head() and select() to look at the first 10 rows of the relevant variables – Thumb, Height, and Height2Group:

   Thumb Height Height2Group
1  66.00   70.5            2
2  64.00   64.8            1
3  56.00   64.0            1
4  58.42   70.0            2
5  74.00   68.0            2
6  60.00   68.0            2
7  70.00   69.0            2
8  55.00   65.7            2
9  60.00   62.5            1
10 52.00   63.4            1

In the code window below, use the factor() function to add labels to Height2Group so that the 1s are labeled as short and the 2s are labeled as tall. Just a reminder: Use the larger data frame, Fingers.

require(coursekata) Fingers <- Fingers %>% mutate( Height2Group = ntile(Height, 2) ) # this creates Height2Group, a numeric variable Fingers$Height2Group <- ntile(Fingers$Height, 2) # this is how we used factor() before: Fingers$Sex <- factor(Fingers$Sex, levels = c(1,2), labels = c("female", "male")) # modify this line so that 1s are labeled as "short" and 2s are labeled as "tall" Fingers$Height2Group <- factor() # this prints out 10 rows of Fingers for the selected columns head(select(Fingers, Thumb, Height, Height2Group), 10) Fingers$Height2Group <- factor(Fingers$Height2Group, levels = 1:2, labels = c("short", "tall")) head(select(Fingers, Thumb, Height, Height2Group), 10) ex() %>% { check_object(., "Fingers") %>% check_column("Height2Group") %>% check_equal() check_output_expr(., "head(select(Fingers, Thumb, Height, Height2Group), 10)") }
CK Code: ch7-16
   Thumb Height Height2Group
1  66.00   70.5         tall
2  64.00   64.8        short
3  56.00   64.0        short
4  58.42   70.0         tall
5  74.00   68.0         tall
6  60.00   68.0         tall
7  70.00   69.0         tall
8  55.00   65.7         tall
9  60.00   62.5        short
10 52.00   63.4        short 

Using the same approach we used for sex, we can write the model for Height2Group like this:

\[Y_{i}=b_{0}+b_{1}X_{i}+e_{i}\]

Go ahead and fit the Height2Group model and take a look at the parameter estimates.

require(coursekata) Fingers <- Fingers %>% mutate( Height2Group = factor(ntile(Height, 2), 1:2, c("short", "tall")) ) # fit a model for Thumb ~ Height2Group Height2Group_model <- # this prints out the estimates Height2Group_model Height2Group_model <- lm(formula = Thumb ~ Height2Group, data = Fingers) Height2Group_model ex() %>% { check_function(., "lm") %>% check_arg("formula") %>% check_equal() check_object(., "Height2Group_model") %>% check_equal() check_output_expr(., "Height2Group_model") }
CK Code: ch7-17
Call:
lm(formula = Thumb ~ Height2Group, data = Fingers)

Coefficients:
     (Intercept)  Height2Grouptall  
          57.818             4.601

Now go ahead and run supernova() to print the complete ANOVA table for the Height2Group_model.

require(coursekata) Fingers <- Fingers %>% mutate( Height2Group = factor(ntile(Height, 2), 1:2, c("short", "tall")) ) Height2Group_model <- lm(Thumb ~ Height2Group, data = Fingers) # run supernova to print the ANOVA table for Height2Group_model # run supernova to print the ANOVA table supernova(Height2Group_model) ex() %>% check_output_expr("supernova(Height2Group_model)")
CK Code: ch7-18
Analysis of Variance Table (Type III SS)
Model: Thumb ~ Height2Group

                               SS  df      MS      F    PRE     p
----- --------------- | --------- --- ------- ------ ------ -----
Model (error reduced) |   830.880   1 830.880 11.656 0.0699 .0008
Error (from model)    | 11049.331 155  71.286                    
----- --------------- | --------- --- ------- ------ ------ -----
Total (empty model)   | 11880.211 156  76.155                    

Let’s now compare the Sex model with the Height2Group model. Both of these are two-group models, and both have the same outcome variable (Thumb). What differs is the explanatory variable (Sex vs. Height2Group). We’ve pasted in the supernova table for both models below:

Sex Model

Analysis of Variance Table (Type III SS)
Model: Thumb ~ Sex

                               SS  df       MS      F    PRE     p
----- --------------- | --------- --- -------- ------ ------ -----
Model (error reduced) |  1334.203   1 1334.203 19.609 0.1123 .0000
Error (from model)    | 10546.008 155   68.039                    
----- --------------- | --------- --- -------- ------ ------ -----
Total (empty model)   | 11880.211 156   76.155                    

Height2Group Model

Analysis of Variance Table (Type III SS)
Model: Thumb ~ Height2Group

                               SS  df      MS      F    PRE     p
----- --------------- | --------- --- ------- ------ ------ -----
Model (error reduced) |   830.880   1 830.880 11.656 0.0699 .0008
Error (from model)    | 11049.331 155  71.286                    
----- --------------- | --------- --- ------- ------ ------ -----
Total (empty model)   | 11880.211 156  76.155                    

A Three-Group Model

Now let’s try this same approach with three height groups: short, medium, and tall.

Revise the code below to make a new variable called Height3Group that divides the sample into three categories based on Height, each with an equal number of students. Label the levels (1,2,3) as short, medium, and tall.

require(coursekata) Fingers <- Fingers %>% mutate( Height2Group = factor(ntile(Height, 2), 1:2, c("short", "tall")) ) Height2Group.model <- lm(Thumb ~ Height2Group, data = Fingers) # modify these two lines of code to create 3 Height groups with the labels "short", "medium", and "tall" # make sure you save to a new variable in Fingers called Height3Group Fingers$Height2Group <- ntile(Fingers$Height, 2) Fingers$Height2Group <- factor(Fingers$Height2Group, levels = c(1,2), labels = c("short", "tall")) # this prints out 10 rows of Fingers for selected columns head(select(Fingers, Thumb, Height, Height3Group), 10) Fingers$Height3Group <- ntile(Fingers$Height, 3) Fingers$Height3Group <- factor(Fingers$Height3Group, levels = c(1,2,3), labels = c("short", "medium", "tall")) head(select(Fingers, Thumb, Height, Height3Group), 10) ex() %>% { check_object(., "Fingers") %>% check_column("Height3Group") %>% check_equal() check_output_expr(., "head(select(Fingers, Thumb, Height, Height3Group),10)") }
CK Code: ch7-19
   Thumb Height Height3Group
1  66.00   70.5         tall
2  64.00   64.8       medium
3  56.00   64.0        short
4  58.42   70.0         tall
5  74.00   68.0         tall
6  60.00   68.0         tall
7  70.00   69.0         tall
8  55.00   65.7       medium
9  60.00   62.5        short
10 52.00   63.4        short 

Calculate and print out the group means of Thumb for the three height groups.

require(coursekata) Fingers <- Fingers %>% mutate( Height2Group = factor(ntile(Height, 2), 1:2, c("short", "tall")), Height3Group = factor(ntile(Height, 3), 1:3, c("short", "medium", "tall")) ) Height2Group.model <- lm(Thumb ~ Height2Group, data = Fingers) # use favstats() to print the group means of Thumb length for the three height groups you created earlier favstats() favstats(Thumb ~ Height3Group, data = Fingers) ex() %>% check_function("favstats") %>% check_result() %>% check_equal()
CK Code: ch7-20
  Height3Group   min    Q1 median    Q3   max     mean       sd  n missing
1        short 39.00 51.00     55 58.42 79.00 56.07113 7.499937 53       0
2       medium 45.00 55.00     60 64.00 86.36 60.22375 8.490406 52       0
3         tall 44.45 59.75     64 68.25 90.00 64.09365 8.388113 52       0 

Fitting the Height3Group Model

Now use the code window below to fit the Height3Group model to the data, and print out the model estimates.

require(coursekata) Fingers <- Fingers %>% mutate( Height2Group = factor(ntile(Height, 2), 1:2, c("short", "tall")), Height3Group = factor(ntile(Height, 3), 1:3, c("short", "medium", "tall")) ) Height2Group_model <- lm(Thumb ~ Height2Group, data = Fingers) # modify this code to fit the model Height3Group_model <- lm(Thumb ~ ) # this prints out the estimates Height3Group_model Height3Group_model <- lm(Thumb ~ Height3Group, data = Fingers) Height3Group_model ex() %>% { check_function(., "lm") %>% check_arg("formula") %>% check_equal() check_object(., "Height3Group_model") %>% check_equal() check_output_expr(., "Height3Group_model") }
CK Code: ch7-21
Call:
lm(formula = Thumb ~ Height3Group, data = Fingers)

Coefficients:
       (Intercept)  Height3Groupmedium    Height3Grouptall  
            56.071               4.153               8.023 

The three-group model is written like this using General Linear Model notation:

\[Y_{i}=b_{0}+b_{1}X_{1i}+b_{2}X_{2i}+e_{i}\]

Whereas fitting the two-group model involved constructing two parameter estimates (\(b_{0}\) and \(b_{1}\)), the three-group model adds a third parameter estimate (\(b_{2}\)).

Interpreting the Height3Group Model

\(b_{0}\) is the mean of the short group. \(b_{1}\) is the increment you have to add to the short group to get the mean of the medium group. And \(b_{2}\) is the increment you have to add to the short group to get the mean of the tall group.

We can substitute in the parameter estimates into the model, like this:

\[Y_{i}=56.07+4.15X_{1i}+8.02X_{2i}+e_{i}\]

Just as before, it is useful to think through exactly how the X variables are coded. Notice, first, that we now have two of these in the model: \(X_{1i}\) and \(X_{2i}\). The new subscripts (1 and 2) just distinguish between the two variables; instead of giving them different names, we call them X-sub-1 and X-sub-2.

The sub-i indicates these are not parameters, but variables, which means that each individual in the data set will have their own scores on the two variables. As before, it’s a little tricky to figure out what all the possible scores are on these two variables, and also how scores are assigned for each individual.

R doesn’t necessarily use the same numbers you do to code a variable. For the Height3Group model we put in a single categorical explanatory variable (Height3Group, with level 1 representing short, 2 representing medium, and 3 representing tall). But R turns this one variable into two new variables, \(X_{1}\) and \(X_{2}\), both of which are dummy coded, which means they can either have a value of 0 or 1 for each person in the data set.

Here’s how dummy coding works: For someone in the short group, the model needs to assign them a score of 56.07, the mean for the short group. You can think of \(X_{1}\) as a variable asking, “Is this person medium?” and 0 means no and 1 means yes. By the same reasoning, \(X_{2}\) represents whether someone is tall or not. For short people, \(X_{1}\) and \(X_{2}\) are both 0 because they are not medium and not tall.

For the people in the medium group, \(X_{1}\) should be 1 (because they are in the medium group), and \(X_{2}\) should be 0 (because they are not in the tall group). So the model will give them a predicted thumb length of 56.07 + 4.15 which is equal to 60.22 mm.

And notice from favstats that the average thumb length of the medium group is 60.22!

  Height3Group   min    Q1 median    Q3   max     mean      sd  n missing
1        short 39.00 51.00     55 58.42 79.00 56.07113 7.499937 53       0
2       medium 45.00 55.00     60 64.00 86.36 60.22375 8.490406 52       0
3         tall 44.45 59.75     64 68.25 90.00 64.09365 8.388113 52       0

Dummy coding takes categorical variables and turns them into a series of binary codes. As you can see from the table below, just giving each person a 0 or 1 on \(X_{1}\) and \(X_{2}\) can uniquely categorize them as short, medium, or tall.

Category (Group) \(X_1\) Code \(X_2\) Code
Short 0 0
Medium 1 0
Tall 0 1

You may wonder why you need to go through all the details of how R assigns dummy codes for the categorical explanatory variable. It’s important because it gives you a very concrete understanding of how to interpret the model parameters. In this course, we don’t often ask you to calculate numbers on your own. Instead, we want you to focus on thinking about what, exactly, a number means. This will help you do that.

Examining the Three-Group Model Fit

You have already done the following: created the Height3Group categorical explanatory variable; examined the mean thumb lengths of students in each of the three groups; fit the Height3Group model using lm() and interpreted the model parameter estimates; and learned how to represent the three-group model using notation of the GLM.

The final step is to take a look at the ANOVA table so you can compare the fit of the Height3Group model to the empty model. Of course, you know how to do this using supernova(). Go ahead and get the ANOVA table for the Height3Group model.

require(coursekata) Fingers <- Fingers %>% mutate( Height2Group = factor(ntile(Height, 2), 1:2, c("short", "tall")), Height3Group = factor(ntile(Height, 3), 1:3, c("short", "medium", "tall")) ) Height2Group.model <- lm(Thumb ~ Height2Group, data = Fingers) Height3Group.model <- lm(Thumb ~ Height3Group, data = Fingers) # use supernova() to print the ANOVA table for Height3Group.model # use supernova() to print the ANOVA table for Height3Group.model supernova(Height3Group.model) ex() %>% check_output_expr("supernova(Height3Group.model)")
CK Code: ch7-22

Here’s the ANOVA table for the Height3Group model. Just for comparison, we pasted in the table for the Height2Group model right above it.

Height2Group Model

Analysis of Variance Table (Type III SS)
Model: Thumb ~ Height2Group

                               SS  df      MS      F    PRE     p
----- --------------- | --------- --- ------- ------ ------ -----
Model (error reduced) |   830.880   1 830.880 11.656 0.0699 .0008
Error (from model)    | 11049.331 155  71.286                    
----- --------------- | --------- --- ------- ------ ------ -----
Total (empty model)   | 11880.211 156  76.155                    

Height3Group Model

Analysis of Variance Table (Type III SS)
Model: Thumb ~ Height3Group

                               SS  df      MS      F    PRE     p
----- --------------- | --------- --- ------- ------ ------ -----
Model (error reduced) |  1690.440   2 845.220 12.774 0.1423 .0000
Error (from model)    | 10189.770 154  66.167                    
----- --------------- | --------- --- ------- ------ ------ -----
Total (empty model)   | 11880.211 156  76.155                    

In more advanced classes you will learn how to compare these two models directly. But for our class, we will only compare each model to the empty model.

Responses