Course Outline
-
segmentGetting Started (Don't Skip This Part)
-
segmentHigh School / Advanced Statistics and Data Science I (ABC)
-
segmentPART I: EXPLORING VARIATION
-
segmentChapter 1 - Welcome to Statistics: A Modeling Approach
-
segmentChapter 2 - Understanding Data
-
segmentChapter 3 - Examining Distributions
-
segmentChapter 4 - Explaining Variation
-
segmentPART II: MODELING VARIATION
-
segmentChapter 5 - A Simple Model
-
segmentChapter 6 - Quantifying Error
-
segmentChapter 7 - Adding an Explanatory Variable to the Model
-
7.4 How the Model Makes Predictions
-
segmentChapter 8 - Digging Deeper into Group Models
-
segmentChapter 9 - Models with a Quantitative Explanatory Variable
-
segmentPART III: EVALUATING MODELS
-
segmentChapter 10 - The Logic of Inference
-
segmentChapter 11 - Model Comparison with F
-
segmentChapter 12 - Parameter Estimation and Confidence Intervals
-
segmentChapter 13 - What You Have Learned
-
segmentFinishing Up (Don't Skip This Part!)
-
segmentResources
list High School / Advanced Statistics and Data Science I (ABC)
7.4 How the Model Makes Predictions
Here’s the best-fitting model of thumb length using gender as an explanatory variable:
\[\text{Thumb}_i=b_0+b_1\text{Gendermale}_i+e_i\]
\[\text{Thumb}_i=58.3+6.4\text{Gendermale}_i+e_i\]
Armed with this understanding of how \(\text{Gendermale}_i\) (a.k.a. \(X_i\)) works, you can now use the model to make predictions. If \(X_i=1\), then the student is male. In this case, the predicted thumb length would be \(b_0+b_1*1\). (A reminder: The asterisk represents multiplication in R so we’re using it as a way of notating multiplication in GLM.)
If \(X_i = 0\), meaning the student is not male, then the second parameter estimate won’t get added in, because \(b_1\) times 0 is equal to 0. And if the second parameter drops out, the prediction would simply be the \(b_0\), which is the mean of female thumb lengths.
\(X_i\) is a variable, meaning it can take a different value for different students in the data frame. We show that this is a variable by putting a subscript \(i\) after the \(X\). Each student can either have the value of 0 or 1 for \(X_i\) because each student in this dataset is or is not male.
How Does R Know Which Gender to Represent with \(X_i\)?
The answer to this question is: R doesn’t know; it’s just taking
whatever group comes first alphabetically (in this case,
female
) and making it the reference group. The
mean of the reference group is the first parameter estimate (\(b_0\) or the Intercept in the
lm()
output).
R then takes the second group (in this case, male
) and
represents it with the dummy variable \(X_i\). If \(X_i\) is coded 1 then the student is
male
. If it is coded 0, then the student is not
male
.
Let’s say, just for fun, that you changed the label for
male
to man
and the label for
female
to woman
in the data frame. Because
man
comes before woman
alphabetically,
man
becomes the reference group, and its mean is now the
estimate for the intercept (\(b_0\)).
As long as R knows that a variable is categorical (e.g., a factor),
it doesn’t really care how you code it. You can code the categories
under Gender
with any characters you choose (e.g.,
female
or woman
), or with any numbers you
choose (e.g, 1 and 2, 0 and 1, or 1 and 500). However you code it, R
will take the category that comes first as the intercept (\(b_0\)) and then the next one as the dummy
explanatory variable, which R will code 0 or 1.