*list*

# Statistics and Data Science: A Modeling Approach

## 7.8 Extending to a Three-Group Model

You have now learned how to specify a model with a single categorical explanatory variable consisting of two groups. It’s actually pretty simple to extend this idea to a categorical variable with three groups.

### First, a New Two-Group Model

Let’s use a new explanatory variable to explain variation in thumb length: **height**. Of course **height**, in our data set, is a quantitative variable measured in inches. But let’s make a new variable that turns height into a categorical variable with two categories: *short* and *tall*. We can do this easily in R. Name the variable **Height2Group**.

We will use the `ntile()`

function to cut a quantitative variable up into groups, and then use `head()`

and `select()`

to look at the first 10 rows of the relevant variables:

```
Fingers$Height2Group <- ntile(Fingers$Height, 2)
head(select(Fingers, Thumb, Height, Height2Group), 10)
```

```
Thumb Height Height2Group
1 66.00 70.5 2
2 64.00 64.8 1
3 56.00 64.0 1
4 58.42 70.0 2
5 74.00 68.0 2
6 60.00 68.0 2
7 70.00 69.0 2
8 55.00 65.7 2
9 60.00 62.5 1
10 52.00 63.4 1
```

Use the `factor()`

function to modify this variable so that the 1s are labeled as “short” and the 2s are labeled as “tall”. Just a reminder—use the larger data frame **Fingers**.

```
require(tidyverse)
require(mosaic)
require(lsr)
require(Lock5Data)
require(supernova)
Fingers <- Fingers %>% mutate(
Height2Group = ntile(Height, 2)
)
```

```
# this is how we used factor() before:
Fingers$Sex <- factor(Fingers$Sex, levels = c(1,2), labels = c("female", "male"))
# modify this line so that 1s are labeled as "short" and 2s are labeled as "tall"
Fingers$Height2Group <- factor()
# this prints out 10 rows of Fingers for the selected columns
head(select(Fingers, Thumb, Height, Height2Group), 10)
```

```
Fingers$Height2Group <- factor(Fingers$Height2Group, levels = 1:2, labels = c("short", "tall"))
head(select(Fingers, Thumb, Height, Height2Group), 10)
```

```
ex() %>% {
check_object(., "Fingers") %>% check_column("Height2Group") %>% check_equal()
check_output_expr(., "head(select(Fingers, Thumb, Height, Height2Group), 10)")
}
```

`Height2Group`

column from the `Fingers`

data using `$`

when converting it to a factor```
Thumb Height Height2Group
1 66.00 70.5 tall
2 64.00 64.8 short
3 56.00 64.0 short
4 58.42 70.0 tall
5 74.00 68.0 tall
6 60.00 68.0 tall
7 70.00 69.0 tall
8 55.00 65.7 tall
9 60.00 62.5 short
10 52.00 63.4 short
```

Using the same approach we used for sex, we can write the model for **Height2Group** like this:

\[Y_{i}=b_{0}+b_{1}X_{i}+e_{i}\]

Go ahead and fit the **Height2Group** model and take a look at the parameter estimates.

```
require(tidyverse)
require(mosaic)
require(lsr)
require(Lock5Data)
require(supernova)
Fingers <- Fingers %>% mutate(
Height2Group = factor(ntile(Height, 2), 1:2, c("short", "tall"))
)
```

```
# fit a model for Thumb ~ Height2Group
Height2Group.model <-
# this prints out the estimates
Height2Group.model
```

```
Height2Group.model <- lm(formula = Thumb ~ Height2Group, data = Fingers)
Height2Group.model
```

```
ex() %>% {
check_function(., "lm") %>% check_arg("formula") %>% check_equal()
check_object(., "Height2Group.model") %>% check_equal()
check_output_expr(., "Height2Group.model")
}
```

`formula = Thumb ~ Height2Group`

and `data = Fingers`

```
Call:
lm(formula = Thumb ~ Height2Group, data = Fingers)
Coefficients:
(Intercept) Height2Grouptall
57.818 4.601
```

Now go ahead and run `supernova()`

to print the complete ANOVA table for the **Height2Group.model**.

```
require(tidyverse)
require(mosaic)
require(lsr)
require(Lock5Data)
require(supernova)
Fingers <- Fingers %>% mutate(
Height2Group = factor(ntile(Height, 2), 1:2, c("short", "tall"))
)
Height2Group.model <- lm(Thumb ~ Height2Group, data = Fingers)
```

```
# run supernova to print the ANOVA table for Height2Group.model
```

```
# run supernova to print the ANOVA table
supernova(Height2Group.model)
```

```
ex() %>% check_output_expr("supernova(Height2Group.model)")
```

`supernova()`

function from previous activities```
Analysis of Variance Table
Outcome variable: Thumb
Model: lm(formula = Thumb ~ Height2Group, data = Fingers)
SS df MS F PRE p
----- ----------------- --------- --- ------- ------ ------ -----
Model (error reduced) | 830.880 1 830.880 11.656 0.0699 .0008
Error (from model) | 11049.331 155 71.286
----- ----------------- --------- --- ------- ------ ------ -----
Total (empty model) | 11880.211 156 76.155
```

Let’s now compare the **Sex** model with the **Height2Group** model. Both of these are two-group models, and both have the same outcome variable (**Thumb**). What differs is the explanatory variable (**Sex** vs. **Height2Group**). We’ve pasted in the supernova table for both models below:

**Sex Model**

```
Analysis of Variance Table
Outcome variable: Thumb
Model: lm(formula = Thumb ~ Sex, data = Fingers)
SS df MS F PRE p
----- ----------------- --------- --- -------- ------ ------ -----
Model (error reduced) | 1334.203 1 1334.203 19.609 0.1123 .0000
Error (from model) | 10546.008 155 68.039
----- ----------------- --------- --- -------- ------ ------ -----
Total (empty model) | 11880.211 156 76.155
```

**Height2Group Model**

```
Analysis of Variance Table
Outcome variable: Thumb
Model: lm(formula = Thumb ~ Height2Group, data = Fingers)
SS df MS F PRE p
----- ----------------- --------- --- ------- ------ ------ -----
Model (error reduced) | 830.880 1 830.880 11.656 0.0699 .0008
Error (from model) | 11049.331 155 71.286
----- ----------------- --------- --- ------- ------ ------ -----
Total (empty model) | 11880.211 156 76.155
```

### A Three-Group Model

Now let’s try this same approach with three height groups: short, medium, and tall.

Revise the code below to make a new variable called **Height3Group** that divides height into three categories, each with an equal number of students. Label the levels (1,2,3) as short, medium, and tall.

```
require(tidyverse)
require(mosaic)
require(lsr)
require(Lock5Data)
require(supernova)
Fingers <- Fingers %>% mutate(
Height2Group = factor(ntile(Height, 2), 1:2, c("short", "tall"))
)
Height2Group.model <- lm(Thumb ~ Height2Group, data = Fingers)
```

```
# modify these two lines of code to create 3 Height groups with the labels "short", "medium", and "tall"
# make sure you save to a new variable in Fingers called Height3Group
Fingers$Height2Group <- ntile(Fingers$Height, 2)
Fingers$Height2Group <- factor(Fingers$Height2Group, levels = c(1,2), labels = c("short", "tall"))
# this prints out 10 rows of Fingers for selected columns
head(select(Fingers, Thumb, Height, Height3Group), 10)
```

```
Fingers$Height3Group <- ntile(Fingers$Height, 3)
Fingers$Height3Group <- factor(Fingers$Height3Group, levels = c(1,2,3), labels = c("short", "medium", "tall"))
head(select(Fingers, Thumb, Height, Height3Group), 10)
```

```
ex() %>% {
check_object(., "Fingers") %>% check_column("Height3Group") %>% check_equal()
check_output_expr(., "head(select(Fingers, Thumb, Height, Height3Group),10)")
}
```

`Height2Group`

to `Height3Group`

```
Thumb Height Height3Group
1 66.00 70.5 tall
2 64.00 64.8 medium
3 56.00 64.0 short
4 58.42 70.0 tall
5 74.00 68.0 tall
6 60.00 68.0 tall
7 70.00 69.0 tall
8 55.00 65.7 medium
9 60.00 62.5 short
10 52.00 63.4 short
```

Calculate and print out the group means of **Thumb** length for the three height groups.

```
require(tidyverse)
require(mosaic)
require(lsr)
require(Lock5Data)
require(supernova)
Fingers <- Fingers %>% mutate(
Height2Group = factor(ntile(Height, 2), 1:2, c("short", "tall")),
Height3Group = factor(ntile(Height, 3), 1:3, c("short", "medium", "tall"))
)
Height2Group.model <- lm(Thumb ~ Height2Group, data = Fingers)
```

```
# use favstats() to print the group means of Thumb length for the three height groups you created earlier
favstats()
```

```
favstats(Thumb ~ Height3Group, data = Fingers)
```

```
ex() %>% check_function("favstats") %>% check_result() %>% check_equal()
```

`?favstats`

to check which arguments it takes```
Height3Group min Q1 median Q3 max mean sd n missing
1 short 39.00 51.00 55 58.42 79.00 56.07113 7.499937 53 0
2 medium 45.00 55.00 60 64.00 86.36 60.22375 8.490406 52 0
3 tall 44.45 59.75 64 68.25 90.00 64.09365 8.388113 52 0
```

#### Fitting the Height3Group Model

Now use the DataCamp window below to fit the **Height3Group** model to the data, and print out the model estimates.

```
require(tidyverse)
require(mosaic)
require(lsr)
require(Lock5Data)
require(supernova)
Fingers <- Fingers %>% mutate(
Height2Group = factor(ntile(Height, 2), 1:2, c("short", "tall")),
Height3Group = factor(ntile(Height, 3), 1:3, c("short", "medium", "tall"))
)
Height2Group.model <- lm(Thumb ~ Height2Group, data = Fingers)
```

```
# modify this code to fit the model
Height3Group.model <- lm(Thumb ~ )
# this prints out the estimates
Height3Group.model
```

```
Height3Group.model <- lm(Thumb ~ Height3Group, data = Fingers)
Height3Group.model
```

```
ex() %>% {
check_function(., "lm") %>% check_arg("formula") %>% check_equal()
check_object(., "Height3Group.model") %>% check_equal()
check_output_expr(., "Height3Group.model")
}
```

```
Call:
lm(formula = Thumb ~ Height3Group, data = Fingers)
Coefficients:
(Intercept) Height3Groupmedium Height3Grouptall
56.071 4.153 8.023
```

The three-group model is written like this using General Linear Model notation:

\[Y_{i}=b_{0}+b_{1}X_{1i}+b_{2}X_{2i}+e_{i}\]

Whereas fitting the two-group model involved estimating two parameters (\(b_{0}\) and \(b_{1}\)), the three-group model adds a third parameter (\(b_{2}\)).

#### Interpreting the Height3Group Model

\(b_{0}\) is the mean of the short group. \(b_{1}\) is the increment you have to add to the short group to get the mean of the medium group. And \(b_{2}\) is the increment you have to add to the short group to get the mean of the tall group.

We can substitute in the parameter estimates into the model, like this:

\[Y_{i}=56.07+4.15X_{1i}+8.02X_{2i}+e_{i}\]

Just as before, it is useful to think through exactly how the X variables are coded. Notice, first, that we now have two of these in the model: \(X_{1i}\) and \(X_{2i}\). The sub-1 and sub-2 just distinguish between the two variables; instead of giving them different names, we call them X-sub-1 and X-sub-2.

The sub-i indicates these are not parameters, but variables, which means that each individual in the data set will have their own scores on the two variables. As before, it’s a little tricky to figure out what all the possible scores are on these two variables, and also how scores are assigned for each individual.

R doesn’t necessarily use the same numbers you do to code a variable. For the **Height3Group** model we put in a single categorical explanatory variable (**Height3Group**, with levels 1 representing short, 2 representing medium, and 3 representing tall). But R turns this one variable into two new variables, \(X_{1}\) and \(X_{2}\), both of which are “dummy coded,” which means they can either have a value of 0 or 1 for each person in the data set.

So here is how it works: For someone in the short group, the model needs to assign them a score of 56.07, the mean for the short group. You can think of \(X_{1}\) as a variable asking, “Is this person medium?” and 0 means no and 1 means yes. By the same reasoning, \(X_{2}\) represents whether someone is tall or not. For short people, \(X_{1}\) and \(X_{2}\) are both 0 because they are *not* medium and *not* tall.

For the people in the medium group, \(X_{1}\) should be 1 (because they are in the medium group), and \(X_{2}\) should be 0 (because they are *not* in the tall group). So the model will give them a predicted thumb length of 56.07 + 4.15 which is equal to 60.22 mm.

And notice from favstats that the average thumb length of the medium group is 60.22!

```
Height3Group min Q1 median Q3 max mean sd n missing
1 short 39.00 51.00 55 58.42 79.00 56.07113 7.499937 53 0
2 medium 45.00 55.00 60 64.00 86.36 60.22375 8.490406 52 0
3 tall 44.45 59.75 64 68.25 90.00 64.09365 8.388113 52 0
```

Dummy coding takes categorical variables and turns them into a series of binary codes. As you can see from the table below, just giving each person a 0 or 1 on \(X_{1}\) and \(X_{2}\) can uniquely categorize them as short, medium, or tall.

Category (Group) | \(X_1\) Code | \(X_2\) Code |
---|---|---|

Short | 0 | 0 |

Medium | 1 | 0 |

Tall | 0 | 1 |

You may wonder why you need to go through all the details of how R assigns dummy codes for the categorical explanatory variable. It’s important because it gives you a very concrete understanding of how to interpret the model parameters. In this course, we don’t often ask you to calculate numbers on your own. Instead, we want you to focus on thinking about what, exactly, a number means. This will help you do that.

#### Examining the Three-Group Model Fit

You have already done the following: created the **Height3Group** categorical explanatory variable; examined the mean thumb lengths of students in each of the three groups; fit the **Height3Group** model using `lm()`

and interpreted the model parameter estimates; and learned how to represent the three-group model using notation of the GLM.

The final step is to take a look at the ANOVA table so you can compare the fit of the **Height3Group** model to the empty model. Of course, you know how to do this using `supernova()`

. Go ahead and get the ANOVA table for the **Height3Group** model.

```
require(tidyverse)
require(mosaic)
require(lsr)
require(Lock5Data)
require(supernova)
Fingers <- Fingers %>% mutate(
Height2Group = factor(ntile(Height, 2), 1:2, c("short", "tall")),
Height3Group = factor(ntile(Height, 3), 1:3, c("short", "medium", "tall"))
)
Height2Group.model <- lm(Thumb ~ Height2Group, data = Fingers)
Height3Group.model <- lm(Thumb ~ Height3Group, data = Fingers)
```

```
# use supernova() to print the ANOVA table for Height3Group.model
```

```
# use supernova() to print the ANOVA table for Height3Group.model
supernova(Height3Group.model)
```

```
ex() %>% check_output_expr("supernova(Height3Group.model)")
```

Here’s the ANOVA table for the **Height3Group** model. Just for comparison, we pasted in the table for the **Height2Group** model right above it.

**Height2Group Model**

```
Analysis of Variance Table
Outcome variable: Thumb
Model: lm(formula = Thumb ~ Height2Group, data = Fingers)
SS df MS F PRE p
----- ----------------- --------- --- ------- ------ ------ -----
Model (error reduced) | 830.880 1 830.880 11.656 0.0699 .0008
Error (from model) | 11049.331 155 71.286
----- ----------------- --------- --- ------- ------ ------ -----
Total (empty model) | 11880.211 156 76.155
```

**Height3Group Model**

```
Analysis of Variance Table
Outcome variable: Thumb
Model: lm(formula = Thumb ~ Height3Group, data = Fingers)
SS df MS F PRE p
----- ----------------- -------- --- ------- ------ ------ -----
Model (error reduced) | 1690.4 2 845.220 12.774 0.1423 .0000
Error (from model) | 10189.8 154 66.167
----- ----------------- -------- --- ------- ------ ------ -----
Total (empty model) | 11880.2 156 76.155
```

In more advanced classes you will learn how to compare these two models directly. But for our class, we will only compare each model to the empty model.