Course Outline

segmentGetting Started (Don't Skip This Part)

segmentStatistics and Data Science: A Modeling Approach

segmentPART I: EXPLORING VARIATION

segmentChapter 1  Welcome to Statistics: A Modeling Approach

segmentChapter 2  Understanding Data

segmentChapter 3  Examining Distributions

segmentChapter 4  Explaining Variation

segmentPART II: MODELING VARIATION

segmentChapter 5  A Simple Model

segmentChapter 6  Quantifying Error

segmentChapter 7  Adding an Explanatory Variable to the Model

7.10 Extending to a ThreeGroup Model

segmentChapter 8  Models with a Quantitative Explanatory Variable

segmentPART III: EVALUATING MODELS

segmentChapter 9  The Logic of Inference

segmentChapter 10  Model Comparison with F

segmentChapter 11  Parameter Estimation and Confidence Intervals

segmentChapter 12  What You Have Learned

segmentFinishing Up (Don't Skip This Part!)

segmentResources
list High School / Advanced Statistics and Data Science I (ABC)
7.10 Extending to a ThreeGroup Model
You have now learned how to specify a model with a single categorical explanatory variable consisting of two groups. It’s actually pretty simple to extend this idea to a categorical variable with three groups.
First, a New TwoGroup Model
Let’s use a new explanatory variable to explain variation in thumb length: Height
. Height
, in our data set, is a quantitative variable measured in inches. But we can make a new variable that turns Height
into a categorical variable with two categories: short
and tall
.
We can do this easily in R. We will use the ntile()
function to cut the sample up into two equalsized groups based on Height
and save the result into a new variable called Height2Group
.
Fingers$Height2Group < ntile(Fingers$Height, 2)
head(select(Fingers, Thumb, Height, Height2Group), 10)
We will then use head()
and select()
to look at the first 10 rows of the relevant variables – Thumb
, Height
, and Height2Group
:
Thumb Height Height2Group
1 66.00 70.5 2
2 64.00 64.8 1
3 56.00 64.0 1
4 58.42 70.0 2
5 74.00 68.0 2
6 60.00 68.0 2
7 70.00 69.0 2
8 55.00 65.7 2
9 60.00 62.5 1
10 52.00 63.4 1
In the code window below, use the factor()
function to add labels to Height2Group
so that the 1s are labeled as short
and the 2s are labeled as tall
. Just a reminder: Use the larger data frame, Fingers
.
require(coursekata)
Fingers < Fingers %>% mutate(
Height2Group = ntile(Height, 2)
)
# this creates Height2Group, a numeric variable
Fingers$Height2Group < ntile(Fingers$Height, 2)
# this is how we used factor() before:
Fingers$Sex < factor(Fingers$Sex, levels = c(1,2), labels = c("female", "male"))
# modify this line so that 1s are labeled as "short" and 2s are labeled as "tall"
Fingers$Height2Group < factor()
# this prints out 10 rows of Fingers for the selected columns
head(select(Fingers, Thumb, Height, Height2Group), 10)
Fingers$Height2Group < factor(Fingers$Height2Group, levels = 1:2, labels = c("short", "tall"))
head(select(Fingers, Thumb, Height, Height2Group), 10)
ex() %>% {
check_object(., "Fingers") %>% check_column("Height2Group") %>% check_equal()
check_output_expr(., "head(select(Fingers, Thumb, Height, Height2Group), 10)")
}
Thumb Height Height2Group
1 66.00 70.5 tall
2 64.00 64.8 short
3 56.00 64.0 short
4 58.42 70.0 tall
5 74.00 68.0 tall
6 60.00 68.0 tall
7 70.00 69.0 tall
8 55.00 65.7 tall
9 60.00 62.5 short
10 52.00 63.4 short
Using the same approach we used for sex, we can write the model for Height2Group
like this:
\[Y_{i}=b_{0}+b_{1}X_{i}+e_{i}\]
Go ahead and fit the Height2Group
model and take a look at the parameter estimates.
require(coursekata)
Fingers < Fingers %>% mutate(
Height2Group = factor(ntile(Height, 2), 1:2, c("short", "tall"))
)
# fit a model for Thumb ~ Height2Group
Height2Group_model <
# this prints out the estimates
Height2Group_model
Height2Group_model < lm(formula = Thumb ~ Height2Group, data = Fingers)
Height2Group_model
ex() %>% {
check_function(., "lm") %>% check_arg("formula") %>% check_equal()
check_object(., "Height2Group_model") %>% check_equal()
check_output_expr(., "Height2Group_model")
}
Call:
lm(formula = Thumb ~ Height2Group, data = Fingers)
Coefficients:
(Intercept) Height2Grouptall
57.818 4.601
Now go ahead and run supernova()
to print the complete ANOVA table for the Height2Group_model
.
require(coursekata)
Fingers < Fingers %>% mutate(
Height2Group = factor(ntile(Height, 2), 1:2, c("short", "tall"))
)
Height2Group_model < lm(Thumb ~ Height2Group, data = Fingers)
# run supernova to print the ANOVA table for Height2Group_model
# run supernova to print the ANOVA table
supernova(Height2Group_model)
ex() %>% check_output_expr("supernova(Height2Group_model)")
Analysis of Variance Table (Type III SS)
Model: Thumb ~ Height2Group
SS df MS F PRE p
        
Model (error reduced)  830.880 1 830.880 11.656 0.0699 .0008
Error (from model)  11049.331 155 71.286
        
Total (empty model)  11880.211 156 76.155
Let’s now compare the Sex
model with the Height2Group
model. Both of these are twogroup models, and both have the same outcome variable (Thumb
). What differs is the explanatory variable (Sex
vs. Height2Group
). We’ve pasted in the supernova table for both models below:
Sex Model
Analysis of Variance Table (Type III SS)
Model: Thumb ~ Sex
SS df MS F PRE p
        
Model (error reduced)  1334.203 1 1334.203 19.609 0.1123 .0000
Error (from model)  10546.008 155 68.039
        
Total (empty model)  11880.211 156 76.155
Height2Group Model
Analysis of Variance Table (Type III SS)
Model: Thumb ~ Height2Group
SS df MS F PRE p
        
Model (error reduced)  830.880 1 830.880 11.656 0.0699 .0008
Error (from model)  11049.331 155 71.286
        
Total (empty model)  11880.211 156 76.155
A ThreeGroup Model
Now let’s try this same approach with three height groups: short, medium, and tall.
Revise the code below to make a new variable called Height3Group
that divides the sample into three categories based on Height
, each with an equal number of students. Label the levels (1,2,3) as short
, medium
, and tall
.
require(coursekata)
Fingers < Fingers %>% mutate(
Height2Group = factor(ntile(Height, 2), 1:2, c("short", "tall"))
)
Height2Group.model < lm(Thumb ~ Height2Group, data = Fingers)
# modify these two lines of code to create 3 Height groups with the labels "short", "medium", and "tall"
# make sure you save to a new variable in Fingers called Height3Group
Fingers$Height2Group < ntile(Fingers$Height, 2)
Fingers$Height2Group < factor(Fingers$Height2Group, levels = c(1,2), labels = c("short", "tall"))
# this prints out 10 rows of Fingers for selected columns
head(select(Fingers, Thumb, Height, Height3Group), 10)
Fingers$Height3Group < ntile(Fingers$Height, 3)
Fingers$Height3Group < factor(Fingers$Height3Group, levels = c(1,2,3), labels = c("short", "medium", "tall"))
head(select(Fingers, Thumb, Height, Height3Group), 10)
ex() %>% {
check_object(., "Fingers") %>% check_column("Height3Group") %>% check_equal()
check_output_expr(., "head(select(Fingers, Thumb, Height, Height3Group),10)")
}
Thumb Height Height3Group
1 66.00 70.5 tall
2 64.00 64.8 medium
3 56.00 64.0 short
4 58.42 70.0 tall
5 74.00 68.0 tall
6 60.00 68.0 tall
7 70.00 69.0 tall
8 55.00 65.7 medium
9 60.00 62.5 short
10 52.00 63.4 short
Calculate and print out the group means of Thumb
for the three height groups.
require(coursekata)
Fingers < Fingers %>% mutate(
Height2Group = factor(ntile(Height, 2), 1:2, c("short", "tall")),
Height3Group = factor(ntile(Height, 3), 1:3, c("short", "medium", "tall"))
)
Height2Group.model < lm(Thumb ~ Height2Group, data = Fingers)
# use favstats() to print the group means of Thumb length for the three height groups you created earlier
favstats()
favstats(Thumb ~ Height3Group, data = Fingers)
ex() %>% check_function("favstats") %>% check_result() %>% check_equal()
Height3Group min Q1 median Q3 max mean sd n missing
1 short 39.00 51.00 55 58.42 79.00 56.07113 7.499937 53 0
2 medium 45.00 55.00 60 64.00 86.36 60.22375 8.490406 52 0
3 tall 44.45 59.75 64 68.25 90.00 64.09365 8.388113 52 0
Fitting the Height3Group Model
Now use the code window below to fit the Height3Group
model to the data, and print out the model estimates.
require(coursekata)
Fingers < Fingers %>% mutate(
Height2Group = factor(ntile(Height, 2), 1:2, c("short", "tall")),
Height3Group = factor(ntile(Height, 3), 1:3, c("short", "medium", "tall"))
)
Height2Group_model < lm(Thumb ~ Height2Group, data = Fingers)
# modify this code to fit the model
Height3Group_model < lm(Thumb ~ )
# this prints out the estimates
Height3Group_model
Height3Group_model < lm(Thumb ~ Height3Group, data = Fingers)
Height3Group_model
ex() %>% {
check_function(., "lm") %>% check_arg("formula") %>% check_equal()
check_object(., "Height3Group_model") %>% check_equal()
check_output_expr(., "Height3Group_model")
}
Call:
lm(formula = Thumb ~ Height3Group, data = Fingers)
Coefficients:
(Intercept) Height3Groupmedium Height3Grouptall
56.071 4.153 8.023
The threegroup model is written like this using General Linear Model notation:
\[Y_{i}=b_{0}+b_{1}X_{1i}+b_{2}X_{2i}+e_{i}\]
Whereas fitting the twogroup model involved constructing two parameter estimates (\(b_{0}\) and \(b_{1}\)), the threegroup model adds a third parameter estimate (\(b_{2}\)).
Interpreting the Height3Group Model
\(b_{0}\) is the mean of the short group. \(b_{1}\) is the increment you have to add to the short group to get the mean of the medium group. And \(b_{2}\) is the increment you have to add to the short group to get the mean of the tall group.
We can substitute in the parameter estimates into the model, like this:
\[Y_{i}=56.07+4.15X_{1i}+8.02X_{2i}+e_{i}\]
Just as before, it is useful to think through exactly how the X variables are coded. Notice, first, that we now have two of these in the model: \(X_{1i}\) and \(X_{2i}\). The new subscripts (1 and 2) just distinguish between the two variables; instead of giving them different names, we call them Xsub1 and Xsub2.
The subi indicates these are not parameters, but variables, which means that each individual in the data set will have their own scores on the two variables. As before, it’s a little tricky to figure out what all the possible scores are on these two variables, and also how scores are assigned for each individual.
R doesn’t necessarily use the same numbers you do to code a variable. For the Height3Group
model we put in a single categorical explanatory variable (Height3Group
, with level 1 representing short, 2 representing medium, and 3 representing tall). But R turns this one variable into two new variables, \(X_{1}\) and \(X_{2}\), both of which are dummy coded, which means they can either have a value of 0 or 1 for each person in the data set.
Here’s how dummy coding works: For someone in the short group, the model needs to assign them a score of 56.07, the mean for the short group. You can think of \(X_{1}\) as a variable asking, “Is this person medium?” and 0 means no and 1 means yes. By the same reasoning, \(X_{2}\) represents whether someone is tall or not. For short people, \(X_{1}\) and \(X_{2}\) are both 0 because they are not medium and not tall.
For the people in the medium group, \(X_{1}\) should be 1 (because they are in the medium group), and \(X_{2}\) should be 0 (because they are not in the tall group). So the model will give them a predicted thumb length of 56.07 + 4.15 which is equal to 60.22 mm.
And notice from favstats that the average thumb length of the medium group is 60.22!
Height3Group min Q1 median Q3 max mean sd n missing
1 short 39.00 51.00 55 58.42 79.00 56.07113 7.499937 53 0
2 medium 45.00 55.00 60 64.00 86.36 60.22375 8.490406 52 0
3 tall 44.45 59.75 64 68.25 90.00 64.09365 8.388113 52 0
Dummy coding takes categorical variables and turns them into a series of binary codes. As you can see from the table below, just giving each person a 0 or 1 on \(X_{1}\) and \(X_{2}\) can uniquely categorize them as short, medium, or tall.
Category (Group)  \(X_1\) Code  \(X_2\) Code 

Short  0  0 
Medium  1  0 
Tall  0  1 
You may wonder why you need to go through all the details of how R assigns dummy codes for the categorical explanatory variable. It’s important because it gives you a very concrete understanding of how to interpret the model parameters. In this course, we don’t often ask you to calculate numbers on your own. Instead, we want you to focus on thinking about what, exactly, a number means. This will help you do that.
Examining the ThreeGroup Model Fit
You have already done the following: created the Height3Group
categorical explanatory variable; examined the mean thumb lengths of students in each of the three groups; fit the Height3Group
model using lm()
and interpreted the model parameter estimates; and learned how to represent the threegroup model using notation of the GLM.
The final step is to take a look at the ANOVA table so you can compare the fit of the Height3Group
model to the empty model. Of course, you know how to do this using supernova()
. Go ahead and get the ANOVA table for the Height3Group
model.
require(coursekata)
Fingers < Fingers %>% mutate(
Height2Group = factor(ntile(Height, 2), 1:2, c("short", "tall")),
Height3Group = factor(ntile(Height, 3), 1:3, c("short", "medium", "tall"))
)
Height2Group.model < lm(Thumb ~ Height2Group, data = Fingers)
Height3Group.model < lm(Thumb ~ Height3Group, data = Fingers)
# use supernova() to print the ANOVA table for Height3Group.model
# use supernova() to print the ANOVA table for Height3Group.model
supernova(Height3Group.model)
ex() %>% check_output_expr("supernova(Height3Group.model)")
Here’s the ANOVA table for the Height3Group
model. Just for comparison, we pasted in the table for the Height2Group
model right above it.
Height2Group Model
Analysis of Variance Table (Type III SS)
Model: Thumb ~ Height2Group
SS df MS F PRE p
        
Model (error reduced)  830.880 1 830.880 11.656 0.0699 .0008
Error (from model)  11049.331 155 71.286
        
Total (empty model)  11880.211 156 76.155
Height3Group Model
Analysis of Variance Table (Type III SS)
Model: Thumb ~ Height3Group
SS df MS F PRE p
        
Model (error reduced)  1690.440 2 845.220 12.774 0.1423 .0000
Error (from model)  10189.770 154 66.167
        
Total (empty model)  11880.211 156 76.155
In more advanced classes you will learn how to compare these two models directly. But for our class, we will only compare each model to the empty model.