Course Outline
-
segmentGetting Started (Don't Skip This Part)
-
segmentStatistics and Data Science: A Modeling Approach
-
segmentPART I: EXPLORING VARIATION
-
segmentChapter 1 - Welcome to Statistics: A Modeling Approach
-
segmentChapter 2 - Understanding Data
-
segmentChapter 3 - Examining Distributions
-
segmentChapter 4 - Explaining Variation
-
segmentPART II: MODELING VARIATION
-
segmentChapter 5 - A Simple Model
-
segmentChapter 6 - Quantifying Error
-
segmentChapter 7 - Adding an Explanatory Variable to the Model
-
7.3 Fitting a Model with an Explanatory Variable
-
segmentChapter 8 - Models with a Quantitative Explanatory Variable
-
segmentPART III: EVALUATING MODELS
-
segmentChapter 9 - The Logic of Inference
-
segmentChapter 10 - Model Comparison with F
-
segmentChapter 11 - Parameter Estimation and Confidence Intervals
-
segmentChapter 12 - What You Have Learned
-
segmentFinishing Up (Don't Skip This Part!)
-
segmentResources
list High School / Advanced Statistics and Data Science I (ABC)
7.3 Fitting a Model With an Explanatory Variable
Now that you have learned how to specify a model with an explanatory variable, let’s learn how to fit the model using R.
Fitting a model, as a reminder, simply means calculating the parameter estimates. We use the word “fitting” because we want to calculate the best estimate, the one that will result in the least amount of error. For the tiny data set, we could calculate the parameter estimates in our head—it’s just a matter of calculating the mean for males and the mean for females. But when the data set is larger, it is much easier to use R.
Using R, we will first fit the Sex
model to the tiny data set, just so you can see that R gives you the same parameter estimates you got before. After that we will fit it to the complete data set.
Note that the parts that are going to be different for each person (
We do not need to estimate the variables. Each student in the data set already has a score for the outcome variable (
We do need to estimate the parameters because, as discussed previously, they are features of the population, and thus are unknown. The parameter estimates we calculate are those that best fit our particular sample of data. But we would have gotten different estimates if we had a different sample. Thus, it is important to keep in mind that these estimates are only that, and they are undoubtedly wrong. Calling them estimates keeps us humble!
Parameter estimates don’t vary from person to person, so they don’t carry the subscript
Fitting the Sex Model to the Tiny Data Set
We will refer to this more complex model (more complex than the empty model, that is) as the Sex
model. It has one explanatory variable, Sex
. We will fit the model using R’s lm()
(linear model) function.
To fit the model we run this R code, and get the results below:
lm(Thumb ~ Sex, data=TinyFingers)
Call:
lm(formula = Thumb ~ Sex, data = TinyFingers)
Coefficients:
(Intercept) Sexmale
59 6
Note that the estimates are exactly what you should have expected: the first estimate, for
Notice that the estimate for
The reason the estimate for
If you want—and it’s a good idea—you can save the results of this model fit in an R object. Here’s the code to save the model fit in an object called Tiny_Sex_model
:
Tiny_Sex_model <- lm(Thumb ~ Sex, data = TinyFingers)
Once you’ve saved the model, If you want to see what the model estimates are, you can just type the name of the model and you will get the same output as above:
Tiny_Sex_model
Call:
lm(formula = Thumb ~ Sex, data = TinyFingers)
Coefficients:
(Intercept) Sexmale
59 6
Now that we have estimates for the two parameters, we can put them in our model statement to yield:
How Does R Know to Which Sex to Represent with ?
The answer to this question is: R doesn’t know; it’s just taking whatever group comes first alphabetically (in this case, female
) and making it the reference group. The mean of the reference group is the first parameter estimate (lm()
output).
R then takes the second group (in this case, male
) and represents it with the dummy variable male
. If it is coded 0, then the student is not male
.
Let’s say, just for fun, that you changed the code for female
into woman
in the data frame. Because male
now comes first in the alphabet, male
becomes the reference group, and its mean is now the estimate for the intercept (
As long as R knows that a variable is categorical (e.g., a factor), it doesn’t really care how you code it. You can code the categories under Sex
with any characters you choose (e.g., woman
or female
), or with any numbers you choose (e.g, 1 and 2, 0 and 1, or 1 and 500). However you code it, R will take the category that comes first as the intercept (
Fitting the Sex Model to the Complete Data Set
Now that you have looked in detail at the tiny set of data, find the best estimates for our bigger set of data (found in the data frame called Fingers
) by modifying the code below.
Call:
lm(formula = Thumb ~ Sex, data = Fingers)
Coefficients:
(Intercept) Sexmale
58.256 6.447