Course Outline
-
segmentGetting Started (Don't Skip This Part)
-
segmentStatistics and Data Science: A Modeling Approach
-
segmentPART I: EXPLORING VARIATION
-
segmentChapter 1 - Welcome to Statistics: A Modeling Approach
-
segmentChapter 2 - Understanding Data
-
segmentChapter 3 - Examining Distributions
-
segmentChapter 4 - Explaining Variation
-
segmentPART II: MODELING VARIATION
-
segmentChapter 5 - A Simple Model
-
segmentChapter 6 - Quantifying Error
-
segmentChapter 7 - Adding an Explanatory Variable to the Model
-
7.1 Specifying the Model
-
segmentChapter 8 - Models with a Quantitative Explanatory Variable
-
segmentPART III: EVALUATING MODELS
-
segmentChapter 9 - The Logic of Inference
-
segmentChapter 10 - Model Comparison with F
-
segmentChapter 11 - Parameter Estimation and Confidence Intervals
-
segmentChapter 12 - What You Have Learned
-
segmentFinishing Up (Don't Skip This Part!)
-
segmentResources
list Statistics and Data Science (ABC)
7.1 Specifying the Model
Reviewing the Empty Model
In the previous chapters we introduced the idea of a statistical model as a function that generates a predicted score for each observation. We developed what we called the empty model, in which we use the mean as the predicted score for each observation.
We represented this model in GLM notation like this:
where
It is important to remember that
The empty model is called a one parameter model because we only need to estimate one parameter (
Because we don’t know for sure what the true mean (
In the case of thumb length, this model states that the DATA (each data point, represented as
(Note: We use the term Grand Mean to refer to the mean of everyone in the sample in order to distinguish it clearly from other means, such as the mean for males or the mean for females.)
When we use the notation of the General Linear Model, we must define the meaning of each symbol in context. Thumb
length here, we can also write the empty model like this:
NOTE: In this chapter and the next, we will generally use the non-Greek notation because our focus will be on estimating models from data. Later in the book we will return to the world of the DGP and ask how we can use the models we estimate to help us make inferences about the Data Generating Process.
It’s useful to illustrate the null model (or empty model) with our TinyFingers
data set. TinyFingers
, you will recall, contains six people’s thumb lengths, three males and three females, randomly selected from our complete Fingers
data set.
TinyFingers
Sex Thumb
1 female 56
2 female 60
3 female 61
4 male 63
5 male 64
6 male 68
We could represent the distribution of the six thumb lengths, broken down by Sex
, using a faceted histogram (left panel, below). Or, we could use a scatterplot (gf_point()
, as shown in the right panel), which might be clearer with such a small data set.
In the scatterplot below, the Grand Mean of the distribution, ignoring sex, is represented by the blue line. The Grand Mean is the model prediction for all observations under the empty model. Whether someone is male or female, their predicted thumb length (
Adding an Explanatory Variable to the Model
Now let’s add an explanatory variable, Sex
, into the model. In the Sex
model, which includes sex as an explanatory variable, we model the variation in thumb length with two numbers: the mean for males (65), and the mean for females (59). The model still generates a predicted thumb length for each person, but now the model generates a different prediction for a male than it does for a female.
Error is still measured the same way, as the deviation of each person’s measured thumb length from their predicted thumb length. But this time, the error is calculated from each person’s group mean (male or female) instead of from the Grand Mean (see figure above).
Whereas the empty model was a one-parameter model (we only had to estimate one parameter, the Grand Mean), the Sex
model is a two-parameter model. One of the parameters is the mean for males, the other is the mean for females.
There are actually a few ways you could write this model; we will write the model of the DGP like this:
Replacing the parameters with their estimates, we write the model we are estimating with our data like this:
We can also write it like this:
In this equation,
Interpreting the GLM Notation for a Two-Group Model
In the two-group model, the model statement,
Let’s unpack the GLM notation to see how it generates the two possible model predictions. In the two-group model, TinyFingers
data frame, is 59.
If TinyFingers
data, that increment is 6 mm, meaning that if you add 6 mm to the mean for females you will get the mean for males.
Here is how we would rewrite the model to include the estimated parameters:
You can also write Thumb
and Sex
instead of writing Y and X:
Now let’s see how these two parameter estimates (59 and 6) work inside the overall model (
Sex
. The subscript i reminds us that each person will have their own value on this variable (i.e., each person is either female or male). In order to make the model add up, we code
Let’s see what would happen if we used this model to predict a female’s thumb length. If someone is female, she would be coded 0 for
Let’s consider what would happen if we asked this model to predict a male’s thumb length. If someone is male, he would be coded 1 for
Under this model, as expected, males would all end up with one predicted thumb length (65, the mean for males, or
Comparing the Two-Group Model With the Empty Model
We’ve discussed what happens when
Two-group model:
Empty model:
In the two-group model, if
Notice, then, that if
A Final Note on Error
Note that we are broadening our definition of error from the way we thought of it for the empty model. For the empty model, error was the residual from the mean (i.e., the Grand Mean). Now we need to expand our thinking a bit, seeing error as the residual from the predicted score, which may not necessarily be the Grand Mean.
Of course, under both models (empty and two-group) the error is the residual from the predicted score under the model. It so happens that in the empty model, that predicted score is the Grand Mean. In the two-group model, the error is the residual from the male mean if you are male, or the female mean if you are female.
No matter how complex our models become, error is always defined as the residual, for each data point, from its predicted score under the model.