Course Outline
-
segmentGetting Started (Don't Skip This Part)
-
segmentStatistics and Data Science: A Modeling Approach
-
segmentPART I: EXPLORING VARIATION
-
segmentChapter 1 - Welcome to Statistics: A Modeling Approach
-
segmentChapter 2 - Understanding Data
-
segmentChapter 3 - Examining Distributions
-
segmentChapter 4 - Explaining Variation
-
segmentPART II: MODELING VARIATION
-
segmentChapter 5 - A Simple Model
-
segmentChapter 6 - Quantifying Error
-
segmentChapter 7 - Adding an Explanatory Variable to the Model
-
7.3 GLM Notation for the Group Model
-
segmentChapter 8 - Digging Deeper into Group Models
-
segmentChapter 9 - Models with a Quantitative Explanatory Variable
-
segmentPART III: EVALUATING MODELS
-
segmentChapter 10 - The Logic of Inference
-
segmentChapter 11 - Model Comparison with F
-
segmentChapter 12 - Parameter Estimation and Confidence Intervals
-
segmentChapter 13 - What You Have Learned
-
segmentFinishing Up (Don't Skip This Part!)
-
segmentResources
list High School / Advanced Statistics and Data Science I (ABC)
7.3 GLM Notation for the Group Model
For most of us humans, we are content to describe the Gender
model simply as two means. But as with the empty model, it will be helpful to learn how a two-group model is represented in the notation of the General Linear Model, especially as we develop more complicated models.
The Gender
Model Using GLM Notation
The full GLM equation for the Gender
model incorporates both \(b_0\) and \(b_1\). There are actually a few ways you could write this model but we will write the model like this:
\[Y_i=b_0+b_1X_i+e_i\]
We can also write it in a way more specific to the Gender
model of Thumb
like this:
\[\text{Thumb}_i=b_0+b_1\text{Gendermale}_i+e_i\]
Using the output from lm()
, we can substitute the estimates into the model.
Call:
lm(formula = Thumb ~ Gender, data = Fingers)
Coefficients:
(Intercept) Gendermale
58.256 6.447
It’s important to notice, first, that both the empty model and the two-group Gender
model start with \(Y_i\) to the left of the equals sign and end with \(e_i\). In both models, \(Y_i\) represents the thumb length for student i, and \(e_i\) represents the error or residual between the predicted thumb length and the actual thumb length for student i.
For the two-group model, the MODEL part of DATA = MODEL + ERROR is now more complicated: \(b_0+b_1X_i\) instead of simply \(b_0\) (for the empty model). In both cases, though, the model can be thought of as a function that produces a predicted value on the outcome variable for each observation (in this case, student).
Note that the \(b_0\) parameter estimate has a different meaning than it does in the empty model. It is the first parameter in both models. But for the empty model, which only has one parameter, it represents the mean of Thumb
for the whole sample of data, whereas for the two-group model (with two parameters), it represents the mean of the first group (in this case, female
).
You might find it confusing to use the same symbol to represent two different ideas. But this flexibility is what makes the General Linear Model so powerful and so… general.
Unlike the empty model, this more complicated model (\(b_0 + b_1X_i\)) is able to generate two different predictions depending on whether a student is female or male.
Interpreting \(X_i\)
We have developed the idea that \(b_0\) is the mean of the first group, and \(b_0 + b_1\) is the mean of the second group. But the function that results in a predicted value for each observation under the two-group model is this: \(b_0 + b_1 X_i\). In this model, what does the \(X_i\) do?
It turns out we need the \(X_i\) in order for the model to actually compute two predicted scores. Here’s how it works. \(X_i\) represents the grouping variable – our explanatory variable, Gender
– but in a special way. It is called a dummy variable, which means that R creates it specifically to make the model work.
R takes the variable Gender
and recodes it into a new variable (\(X_i\)) that can only be assigned one of two values: 0 or 1. In the two-group model, \(X_i\) is coded 1 if the student is in the second group (male
), and it is coded 0 if the student is not in the second group (i.e., not male
).
Although in this data, saying a student is not male is the same as saying the student is female, it’s important to think of \(X_i = 0\) as meaning the student is not male. Keeping this subtle distinction in mind will help us understand how dummy variables work when we have models with more than 2 groups.
The reason the \(b_0\) estimate is called Intercept
in the lm()
output is because it is the predicted thumb length when \(X_i\) is equal to 0 – in other words, when the Gender is not male. The estimate that R called Gendermale
(\(b_1\)), by this line of reasoning, is kind of like the slope of a line. It is the adjustment in predicted thumb length for a 1 unit increase in Gender
.