Course Outline

segmentGetting Started (Don't Skip This Part)

segmentStatistics and Data Science: A Modeling Approach

segmentPART I: EXPLORING VARIATION

segmentChapter 1  Welcome to Statistics: A Modeling Approach

segmentChapter 2  Understanding Data

segmentChapter 3  Examining Distributions

segmentChapter 4  Explaining Variation

segmentPART II: MODELING VARIATION

segmentChapter 5  A Simple Model

segmentChapter 6  Quantifying Error

segmentChapter 7  Adding an Explanatory Variable to the Model

7.3 GLM Notation for the Group Model

segmentChapter 8  Digging Deeper into Group Models

segmentChapter 9  Models with a Quantitative Explanatory Variable

segmentPART III: EVALUATING MODELS

segmentChapter 10  The Logic of Inference

segmentChapter 11  Model Comparison with F

segmentChapter 12  Parameter Estimation and Confidence Intervals

segmentFinishing Up (Don't Skip This Part!)

segmentResources
list High School / Advanced Statistics and Data Science I (ABC)
7.3 GLM Notation for the Group Model
For most of us humans, we are content to describe the Sex
model simply as two means. But as with the empty model, it will be helpful to learn how a twogroup model is represented in the notation of the General Linear Model, especially as we develop more complicated models.
The Sex
Model Using GLM Notation
The full GLM equation for the Sex
model incorporates both \(b_0\) and \(b_1\). There are actually a few ways you could write this model but we will write the model like this:
\[Y_i=b_0+b_1X_i+e_i\]
We can also write it in a way more specific to the Sex
model of Thumb
like this:
\[\text{Thumb}_i=b_0+b_1\text{Sexmale}_i+e_i\]
Using the output from lm()
, we can substitute the estimates into the model.
Call:
lm(formula = Thumb ~ Sex, data = Fingers)
Coefficients:
(Intercept) Sexmale
58.256 6.447
It’s important to notice, first, that both the empty model and the twogroup Sex
model start with \(Y_i\) to the left of the equals sign and end with \(e_i\). In both models, \(Y_i\) represents the thumb length for student i, and \(e_i\) represents the error or residual between the predicted thumb length and the actual thumb length for student i.
For the twogroup model, the MODEL part of DATA = MODEL + ERROR is now more complicated: \(b_0+b_1X_i\) instead of simply \(b_0\) (for the empty model). In both cases, though, the model can be thought of as a function that produces a predicted value on the outcome variable for each observation (in this case, student).
Note that the \(b_0\) parameter estimate has a different meaning than it does in the empty model. It is the first parameter in both models. But for the empty model, which only has one parameter, it represents the mean of Thumb
for the whole sample of data, whereas for the twogroup model (with two parameters), it represents the mean of the first group (in this case, female
).
You might find it confusing to use the same symbol to represent two different ideas. But this flexibility is what makes the General Linear Model so powerful and so… general.
Unlike the empty model, this more complicated model (\(b_0 + b_1X_i\)) is able to generate two different predictions depending on whether a student is female or male.
Interpreting \(X_i\)
We have developed the idea that \(b_0\) is the mean of the first group, and \(b_0 + b_1\) is the mean of the second group. But the function that results in a predicted value for each observation under the twogroup model is this: \(b_0 + b_1 X_i\). In this model, what does the \(X_i\) do?
It turns out we need the \(X_i\) in order for the model to actually compute two predicted scores. Here’s how it works. \(X_i\) represents the grouping variable – our explanatory variable, Sex
– but in a special way. It is called a dummy variable, which means that R creates it specifically to make the model work.
R takes the variable Sex
and recodes it into a new variable (\(X_i\)) that can only be assigned one of two values: 0 or 1. In the twogroup model, \(X_i\) is coded 1 if the student is in the second group (male
), and it is coded 0 if the student is not in the second group (i.e., not male
).
Although in this data, saying a student is not male is the same as saying the student is female, it’s important to think of \(X_i = 0\) as meaning the student is not male. Keeping this subtle distinction in mind will help us understand how dummy variables work when we have models with more than 2 groups.
The reason the \(b_0\) estimate is called Intercept
in the lm()
output is because it is the predicted thumb length when \(X_i\) is equal to 0 – in other words, when the Sex is not male. The estimate that R called Sexmale
(\(b_1\)), by this line of reasoning, is kind of like the slope of a line. It is the adjustment in predicted thumb length for a 1 unit increase in Sex
.