Course Outline
-
segmentGetting Started (Don't Skip This Part)
-
segmentHigh School / Advanced Statistics and Data Science I (ABC)
-
segmentPART I: EXPLORING VARIATION
-
segmentChapter 1 - Welcome to Statistics: A Modeling Approach
-
segmentChapter 2 - Understanding Data
-
segmentChapter 3 - Examining Distributions
-
segmentChapter 4 - Explaining Variation
-
segmentPART II: MODELING VARIATION
-
segmentChapter 5 - A Simple Model
-
5.5 Fitting the Empty Model
-
segmentChapter 6 - Quantifying Error
-
segmentChapter 7 - Adding an Explanatory Variable to the Model
-
segmentChapter 8 - Digging Deeper into Group Models
-
segmentChapter 9 - Models with a Quantitative Explanatory Variable
-
segmentPART III: EVALUATING MODELS
-
segmentChapter 10 - The Logic of Inference
-
segmentChapter 11 - Model Comparison with F
-
segmentChapter 12 - Parameter Estimation and Confidence Intervals
-
segmentChapter 13 - What You Have Learned
-
segmentFinishing Up (Don't Skip This Part!)
-
segmentResources
list High School / Advanced Statistics and Data Science I (ABC)
5.5 Fitting the Empty Model
This section might seem overly complicated at first, but it will provide the foundation for understanding subsequent concepts. The simple model we have started with—using the mean to model the distribution of a quantitative variable—is sometimes called the empty model or null model. In the context of students’ thumb lengths, we could write the empty model with a word equation like this.
Thumb = Mean + Error
Note that the model is “empty” because it doesn’t have any explanatory variables in it yet. The empty model does not explain any variation; it merely reveals the variation in the outcome variable (the Error) that could be explained. This empty model will serve as a sort of baseline model to which we can compare more complex models in the future.
If the mean is our model, then fitting the model to data simply means
calculating the mean of the distribution, which we could do using
favstats()
.
favstats(~ Thumb, data = Fingers)
min Q1 median Q3 max mean sd n missing
39 55 60 65 90 60.10366 8.726695 157 0
When we fit a model we are finding the particular number that minimizes error the most; this is what we mean by the “best-fitting” model. The mean of a distribution is one such number because it balances the residuals. In later chapters, we will expand more on why this value is the best-fitting one.
It’s easy to fit the empty model—it’s just the mean (60.1 in this case). But later you will learn to fit more complex models to your data. We are going to teach you a way to fit models in R that you can use now for fitting the empty model, but that will also work later for fitting more complex models.
The R function we are going to use is lm()
, which stands
for “linear model.” (We’ll say more about why it’s called that in a
later chapter.) Here’s the code we use to fit the empty model, followed
by the output.
lm(Thumb ~ NULL, data = Fingers)
Call:
lm(formula = Thumb ~ NULL, data = Fingers)
Coefficients:
(Intercept)
60.1
Although the output seems a little strange, with words like
“Coefficients” and “Intercept,” the lm()
function does give
you back the mean of the distribution (60.1), as expected.
lm()
thus “fits” the empty model in that it finds the
best-fitting number for our model. The word “NULL” is another word for
“empty” (as in “empty model.”)
It will be helpful to save the results of this model fit in an R
object. Here’s code that uses lm()
to fit the empty model,
then saves the results in an R object called
empty_model
:
empty_model <- lm(Thumb ~ NULL, data = Fingers)
If you want to see the contents of the model, you can just type the
name of the R object where you saved it (i.e., just type
empty_model
). Try it in the code block below.
require(coursekata)
# modify this to fit the empty model of Thumb
empty_model <-
# print out the model estimates
# modify this to fit the empty model of Thumb
empty_model <- lm(Thumb ~ NULL, data = Fingers)
# print out the model estimates
empty_model
ex() %>% {
check_object(., "empty_model") %>% check_equal()
}
When we save the result of lm()
as
empty_model
we are saving a new type of R object called a
model, which is different from data frames or vectors. When R
saves a model it saves all sorts of information inside the object, which
we won’t get into here. But some functions we will learn about
specifically take models as their input.
One such function is gf_model()
, which lets us overlay a
model (e.g., empty_model
) onto various types of plots,
including histograms, box plots, jitter plots, and scatter plots.
For example, here is how to chain the empty_model
predictions (using the pipe operator, %>%
) onto a
histogram of thumb lengths.
gf_histogram(~ Thumb, data = Fingers) %>%
gf_model(empty_model)
As we now know, the empty model predicts the mean thumb length for
everyone. This model prediction is represented by the single blue
vertical line right at the average thumb length of 60.1 mm. Let’s try
adding gf_model(empty_model)
to some other plots below.
Try running the code block below and take a look at the resulting
graphs. Then use the pipe operator (%>%
) to add the
empty model prediction to both the scatter plot of thumb lengths by
height and the jitter plot of thumb lengths by gender. (The empty model
has already been fit and saved as empty_model
.)
require(coursekata)
empty_model <- lm(Thumb ~ NULL, data = Fingers)
# saves the best-fitting empty model
empty_model <- lm(Thumb ~ NULL, data = Fingers)
# add gf_model to this scatter plot
gf_point(Thumb ~ Height, data = Fingers)
# add gf_model to this jitter plot
gf_jitter(Thumb ~ Gender, data = Fingers, width = .1)
# saves the best-fitting empty model
empty_model <- lm(Thumb ~ NULL, data = Fingers)
# add gf_model to this scatter plot
gf_point(Thumb ~ Height, data = Fingers) %>%
gf_model(empty_model)
# add gf_model to this jitter plot
gf_jitter(Thumb ~ Gender, data = Fingers, width = .1) %>%
gf_model(empty_model)
ex() %>% {
check_function(., "gf_model", index = 1) %>%
check_arg("object") %>%
check_equal()
check_or(.,
check_function(., "gf_model", index = 1) %>%
check_arg("model") %>%
check_equal(),
override_solution(., "gf_point(Thumb ~ Height, data = Fingers) %>% gf_model(Thumb ~ NULL)") %>%
check_function("gf_model", index = 1) %>%
check_arg("model") %>%
check_equal()
)
check_function(., "gf_model", index = 2) %>%
check_arg("object") %>%
check_equal()
check_or(.,
check_function(., "gf_model", index = 2) %>%
check_arg("model") %>%
check_equal(),
override_solution(., "gf_model(empty_model); gf_point(Thumb ~ Height, data = Fingers) %>% gf_model(Thumb ~ NULL)") %>%
check_function("gf_model", index = 2) %>%
check_arg("model") %>%
check_equal()
)
}
Scatter plot of Thumb ~ Height
|
Jitter plot of Thumb ~ Gender
|
---|---|
|
|
When we overlaid the empty model prediction on the histogram of
Thumb
, it was represented as a vertical line because
Thumb
was on the x-axis. With the scatter plot and jitter
plot above, Thumb
is on the y-axis. For this reason, the
empty model prediction (i.e., the mean of Thumb
) is
represented as a horizontal line, right at the mean of
Thumb
.
The empty model prediction on the scatter plot (on the left, above), as represented by the horizontal line, indicates that regardless of how tall a student is, the empty model will assign them a predicted thumb length of 60.1 mm.
The Mean as an Estimate
If it seems like we’re making a big deal about having calculated the
mean, you’re right! The mean is a concept that will come up throughout
the course and your understanding of it and its uses will deepen over
time. One point worth making now, however, is that the goal of
statistics is to understand the Data Generating Process (DGP). The mean
of the data distribution is an estimate of the mean of the population
that results from the DGP, which is why the numbers generated by
lm()
are called “estimates” (or sometimes,
“coefficients”).
The mean may not be a very good estimate—after all, it is only based on a small amount of data—but it’s the best one we can come up with based on the available data. The mean also is considered an unbiased estimate, meaning that it is just as likely to be too high as it is too low.