Course Outline
-
segmentGetting Started (Don't Skip This Part)
-
segmentHigh School / Advanced Statistics and Data Science I (ABC)
-
segmentPART I: EXPLORING VARIATION
-
segmentChapter 1 - Welcome to Statistics: A Modeling Approach
-
segmentChapter 2 - Understanding Data
-
segmentChapter 3 - Examining Distributions
-
segmentChapter 4 - Explaining Variation
-
segmentPART II: MODELING VARIATION
-
segmentChapter 5 - A Simple Model
-
segmentChapter 6 - Quantifying Error
-
segmentChapter 7 - Adding an Explanatory Variable to the Model
-
7.3 GLM Notation for the Group Model
-
segmentChapter 8 - Digging Deeper into Group Models
-
segmentChapter 9 - Models with a Quantitative Explanatory Variable
-
segmentPART III: EVALUATING MODELS
-
segmentChapter 10 - The Logic of Inference
-
segmentChapter 11 - Model Comparison with F
-
segmentChapter 12 - Parameter Estimation and Confidence Intervals
-
segmentChapter 13 - What You Have Learned
-
segmentFinishing Up (Don't Skip This Part!)
-
segmentResources
list High School / Advanced Statistics and Data Science I (ABC)
7.3 GLM Notation for the Group Model
For most of us humans, we are content to describe the
Gender
model simply as two means. But as with the empty
model, it will be helpful to learn how a two-group model is represented
in the notation of the General Linear Model, especially as we develop
more complicated models.
The Gender
Model Using GLM Notation
The full GLM equation for the Gender
model incorporates
both \(b_0\) and \(b_1\). There are actually a few ways you
could write this model but we will write the model like this:
\[Y_i=b_0+b_1X_i+e_i\]
We can also write it in a way more specific to the
Gender
model of Thumb
like this:
\[\text{Thumb}_i=b_0+b_1\text{Gendermale}_i+e_i\]
Using the output from lm()
, we can substitute the
estimates into the model.
Call:
lm(formula = Thumb ~ Gender, data = Fingers)
Coefficients:
(Intercept) Gendermale
58.256 6.447
It’s important to notice, first, that both the empty model and the
two-group Gender
model start with \(Y_i\) to the left of the equals sign and
end with \(e_i\). In both models, \(Y_i\) represents the thumb length for
student i, and \(e_i\)
represents the error or residual between the predicted thumb length and
the actual thumb length for student i.
For the two-group model, the MODEL part of DATA = MODEL + ERROR is now more complicated: \(b_0+b_1X_i\) instead of simply \(b_0\) (for the empty model). In both cases, though, the model can be thought of as a function that produces a predicted value on the outcome variable for each observation (in this case, student).
Note that the \(b_0\) parameter
estimate has a different meaning than it does in the empty model. It is
the first parameter in both models. But for the empty model, which only
has one parameter, it represents the mean of Thumb
for the
whole sample of data, whereas for the two-group model (with two
parameters), it represents the mean of the first group (in this case,
female
).
You might find it confusing to use the same symbol to represent two different ideas. But this flexibility is what makes the General Linear Model so powerful and so… general.
Unlike the empty model, this more complicated model (\(b_0 + b_1X_i\)) is able to generate two different predictions depending on whether a student is female or male.
Interpreting \(X_i\)
We have developed the idea that \(b_0\) is the mean of the first group, and \(b_0 + b_1\) is the mean of the second group. But the function that results in a predicted value for each observation under the two-group model is this: \(b_0 + b_1 X_i\). In this model, what does the \(X_i\) do?
It turns out we need the \(X_i\) in
order for the model to actually compute two predicted scores. Here’s how
it works. \(X_i\) represents the
grouping variable – our explanatory variable, Gender
– but
in a special way. It is called a dummy variable, which means
that R creates it specifically to make the model work.
R takes the variable Gender
and recodes it into a new
variable (\(X_i\)) that can only be
assigned one of two values: 0 or 1. In the two-group model, \(X_i\) is coded 1 if the student is in the
second group (male
), and it is coded 0 if the student is
not in the second group (i.e., not
male
).
Although in this data, saying a student is not male is the same as saying the student is female, it’s important to think of \(X_i = 0\) as meaning the student is not male. Keeping this subtle distinction in mind will help us understand how dummy variables work when we have models with more than 2 groups.
The reason the \(b_0\) estimate is
called Intercept
in the lm()
output is because
it is the predicted thumb length when \(X_i\) is equal to 0 – in other words, when
the Gender is not male. The estimate that R called
Gendermale
(\(b_1\)), by
this line of reasoning, is kind of like the slope of a line. It is the
adjustment in predicted thumb length for a 1 unit increase in
Gender
.