Course Outline
-
segmentGetting Started (Don't Skip This Part)
-
segmentStatistics and Data Science: A Modeling Approach
-
segmentPART I: EXPLORING VARIATION
-
segmentChapter 1 - Welcome to Statistics: A Modeling Approach
-
segmentChapter 2 - Understanding Data
-
segmentChapter 3 - Examining Distributions
-
segmentChapter 4 - Explaining Variation
-
segmentPART II: MODELING VARIATION
-
segmentChapter 5 - A Simple Model
-
segmentChapter 6 - Quantifying Error
-
segmentChapter 7 - Adding an Explanatory Variable to the Model
-
7.4 Generating Predictions from the Model
-
segmentChapter 8 - Models with a Quantitative Explanatory Variable
-
segmentPART III: EVALUATING MODELS
-
segmentChapter 9 - The Logic of Inference
-
segmentChapter 10 - Model Comparison with F
-
segmentChapter 11 - Parameter Estimation and Confidence Intervals
-
segmentChapter 12 - What You Have Learned
-
segmentFinishing Up (Don't Skip This Part!)
-
segmentResources
list High School / Advanced Statistics and Data Science I (ABC)
7.4 Generating Predictions From the Model
Predicting Future Observations
Now that you have fit the Sex
model, you can use your estimates to make predictions about future observations. Doing this requires you to use your model as a function. Think of a function like a machine: you put something in, you get something out. In this case, you will put in a value (e.g., “female”) for your explanatory variable (Sex
), and get out a predicted thumb length.
We can think about how to use the Tiny_Sex_model
as a function. Recall that our model, once fit, looked like this:
\[Y_{i}=59+6X_{i}+e_{i}\]
To turn this into a function, we remove the error term. If our goal is to model the variation, we want the error term there. But if our goal is to predict, we are going to ignore error and just do our best! We also change the \(Y_{i}\) to \(\hat{Y}_{i}\), which indicates a predicted score for person i. Our prediction function, then, looks like this:
\[\hat{Y}_{i}=59+6X_{i}\]
We leave out the error term because every person will have a different error term. If we knew their error, we could predict their score exactly. But since we don’t—because remember, we are predicting a new observation—all we can do is predict their score based on their sex.
This prediction function is straightforward to use. If we want to predict what the next observed thumb length will be, we can see that if the next student sampled is female, their predicted thumb length is 59. If they are male, the prediction is (59 + 6), or 65.
Using R to Predict Future Observations
If the numbers are easy to add or subtract, it’s not that hard to do this in your head. But of course, we won’t always want to do it in our heads for more complex models.
We have a couple of new R functions that you can use to make it easier to generate a prediction: b0()
and b1()
. Run the three lines of R code in the window below and see if you can figure out what these new functions do.
require(coursekata)
TinyFingers <- data.frame(
Sex = rep(c("female", "male"), each = 3),
Thumb = c(56, 60, 61, 63, 64, 68)
)
# This creates the Tiny_Sex_model
Tiny_Sex_model <- lm(Thumb ~ Sex, data = TinyFingers)
# Run this code
Tiny_Sex_model
b0(Tiny_Sex_model)
b1(Tiny_Sex_model)
# This creates the Tiny_Sex_model
Tiny_Sex_model <- lm(Thumb ~ Sex, data = TinyFingers)
# Run this code
Tiny_Sex_model
b0(Tiny_Sex_model)
b1(Tiny_Sex_model)
ex() %>% {
check_object(., "Tiny_Sex_model") %>%
check_equal()
check_output_expr(., "Tiny_Sex_model")
check_function(., "b0") %>%
check_arg("object") %>%
check_equal()
check_function(., "b1") %>%
check_arg("object") %>%
check_equal()
}
The b0()
function takes a model as its input and returns the parameter estimate for the first parameter, which in this case is the mean Thumb of females. The function b1()
returns the parameter estimate for the second parameter, the increment from the mean of females to the mean of males.
In the window below, see if you can write a single line of R code that will use both of these new functions (b0()
and b1()
) to return the predicted value for a new male’s thumb length.
require(coursekata)
TinyFingers <- data.frame(
Sex = rep(c("female", "male"), each = 3),
Thumb = c(56, 60, 61, 63, 64, 68)
)
# This creates the Tiny_Sex_model
Tiny_Sex_model <- lm(Thumb ~ Sex, data = TinyFingers)
# Write a line of R code that uses both b0() and b1() functions to return the predicted Thumb length of a male
# This creates the Tiny_Sex_model
Tiny_Sex_model <- lm(Thumb ~ Sex, data = TinyFingers)
# Write a line of R code that uses both b0() and b1() functions to return the predicted Thumb length of a male
b0(Tiny_Sex_model) + b1(Tiny_Sex_model)
ex() %>% {
check_function(., "b0") %>% check_arg("object") %>% check_equal()
check_function(., "b1") %>% check_arg("object") %>% check_equal()
check_output_expr(., "b0(Tiny_Sex_model) + b1(Tiny_Sex_model)")
}
[1] 65
Generating “Predicted” Values for the Sample Data
As we did in Chapter 5, we also will want to generate model predictions for our sample data. It seems odd to predict values when we already know the actual values. But it’s actually very useful to do so, because then we can calculate residuals from the model predictions.
To get predicted values from the Tiny_Sex_model
, we use the predict()
function:
predict(Tiny_Sex_model)
1 2 3 4 5 6
59 59 59 65 65 65
Let’s say you want to save these predicted values for each person as a variable called Sex_predicted
(in the TinyFingers
data frame). See if you can complete the R code to do this.
require(coursekata)
TinyFingers <- data.frame(
Sex = rep(c("female", "male"), each = 3),
Thumb = c(56, 60, 61, 63, 64, 68)
)
Tiny_Sex_model <- lm(Thumb ~ Sex, data = TinyFingers)
TinyFingers$Sex_predicted <-
# this prints the TinyFingers data frame
TinyFingers
TinyFingers$Sex_predicted <- predict(Tiny_Sex_model)
ex() %>% check_object("TinyFingers") %>% check_column("Sex_predicted") %>% check_equal()
Sex Thumb Sex_predicted
1 female 56 59
2 female 60 59
3 female 61 59
4 male 63 65
5 male 64 65
6 male 68 65
Notice that our predictions are a single number for each person: 59 for each female and 65 for each male. Each person gets a single predicted thumb length; we never predict both of these values for a single person. But different people will get different predicted outcomes based on their sex.
Try the function predict()
on the full data set. Recall that you fit the model to the full data set, Fingers
. You saved the model as Sex_model
. Now see if you can generate predictions from the model and save the predictions as a variable in the Fingers
data frame.
require(coursekata)
# here is the model we fit before
Sex_model <- lm(Thumb ~ Sex, data = Fingers)
# generate all possible predictions from Sex_model
Fingers$Sex_predicted <-
# this will print out 10 lines of Fingers
head(select(Fingers, Sex, Thumb, Sex_predicted), 10)
Sex_model <- lm(Thumb ~ Sex, data = Fingers)
Fingers$Sex_predicted <- predict(Sex_model)
head(select(Fingers, Sex, Thumb, Sex_predicted), 10)
ex() %>% {
check_object(., "Sex_model") %>% check_equal()
check_function(., "predict") %>% check_result() %>% check_equal()
check_object(., "Fingers") %>% check_column(., "Sex_predicted") %>% check_equal()
check_function(., "head") %>% check_result() %>% check_equal()
}
Sex Thumb Sex_predicted
1 male 66.00 64.70267
2 female 64.00 58.25585
3 female 56.00 58.25585
4 male 58.42 64.70267
5 female 74.00 58.25585
6 female 60.00 58.25585
7 male 70.00 64.70267
8 female 55.00 58.25585
9 female 60.00 58.25585
10 female 52.00 58.25585
We’ve learned how to specify and fit models. We then took those models and used them (as functions) to make predictions for future observations, and also to generate predictions for each person in our sample data. We turn next to examine the residuals from our model—the variation left over after we subtract out our model.