Course Outline

list High School / Advanced Statistics and Data Science I (ABC)

Book
  • High School / Advanced Statistics and Data Science I (ABC)
  • High School / Statistics and Data Science I (AB)
  • High School / Statistics and Data Science II (XCD)
  • College / Statistics and Data Science (ABC)
  • College / Advanced Statistics and Data Science (ABCD)
  • College / Accelerated Statistics and Data Science (XCDCOLLEGE)
  • Skew the Script: Jupyter

7.3 GLM Notation for the Group Model

For most of us humans, we are content to describe the Sex model simply as two means. But as with the empty model, it will be helpful to learn how a two-group model is represented in the notation of the General Linear Model, especially as we develop more complicated models.

The Sex Model Using GLM Notation

The full GLM equation for the Sex model incorporates both \(b_0\) and \(b_1\). There are actually a few ways you could write this model but we will write the model like this:

\[Y_i=b_0+b_1X_i+e_i\]

We can also write it in a way more specific to the Sex model of Thumb like this:

\[\text{Thumb}_i=b_0+b_1\text{Sexmale}_i+e_i\]

Using the output from lm(), we can substitute the estimates into the model.

Call:
lm(formula = Thumb ~ Sex, data = Fingers)

Coefficients:
(Intercept)      Sexmale  
     58.256        6.447 

It’s important to notice, first, that both the empty model and the two-group Sex model start with \(Y_i\) to the left of the equals sign and end with \(e_i\). In both models, \(Y_i\) represents the thumb length for student i, and \(e_i\) represents the error or residual between the predicted thumb length and the actual thumb length for student i.

For the two-group model, the MODEL part of DATA = MODEL + ERROR is now more complicated: \(b_0+b_1X_i\) instead of simply \(b_0\) (for the empty model). In both cases, though, the model can be thought of as a function that produces a predicted value on the outcome variable for each observation (in this case, student).

Note that the \(b_0\) parameter estimate has a different meaning than it does in the empty model. It is the first parameter in both models. But for the empty model, which only has one parameter, it represents the mean of Thumb for the whole sample of data, whereas for the two-group model (with two parameters), it represents the mean of the first group (in this case, female).

You might find it confusing to use the same symbol to represent two different ideas. But this flexibility is what makes the General Linear Model so powerful and so… general.

Unlike the empty model, this more complicated model (\(b_0 + b_1X_i\)) is able to generate two different predictions depending on whether a student is female or male.

Interpreting \(X_i\)

We have developed the idea that \(b_0\) is the mean of the first group, and \(b_0 + b_1\) is the mean of the second group. But the function that results in a predicted value for each observation under the two-group model is this: \(b_0 + b_1 X_i\). In this model, what does the \(X_i\) do?

It turns out we need the \(X_i\) in order for the model to actually compute two predicted scores. Here’s how it works. \(X_i\) represents the grouping variable – our explanatory variable, Sex – but in a special way. It is called a dummy variable, which means that R creates it specifically to make the model work.

R takes the variable Sex and recodes it into a new variable (\(X_i\)) that can only be assigned one of two values: 0 or 1. In the two-group model, \(X_i\) is coded 1 if the student is in the second group (male), and it is coded 0 if the student is not in the second group (i.e., not male).

Although in this data, saying a student is not male is the same as saying the student is female, it’s important to think of \(X_i = 0\) as meaning the student is not male. Keeping this subtle distinction in mind will help us understand how dummy variables work when we have models with more than 2 groups.

The reason the \(b_0\) estimate is called Intercept in the lm() output is because it is the predicted thumb length when \(X_i\) is equal to 0 – in other words, when the Sex is not male. The estimate that R called Sexmale (\(b_1\)), by this line of reasoning, is kind of like the slope of a line. It is the adjustment in predicted thumb length for a 1 unit increase in Sex.

Responses