list

Statistics and Data Science: A Modeling Approach

7.3 Generating Predictions From the Model

Predicting Future Observations

Now that you have fit the Sex model, you can use your estimates to make predictions about future observations. Doing this requires you to use your model as a function. Think of a function like a machine: you put something in, you get something out. In this case, you will put in a value (e.g., “female”) for your explanatory variable (Sex), and get out a predicted thumb length.

We can think about how use the TinySex.model as a function. Recall that our model, once fit, looked like this:

\[Y_{i}=59+6X_{i}+e_{i}\]

To turn this into a function, we remove the error term. If our goal is to model the variation, we want the error term there. But if our goal is to predict, we are going to ignore error and just do our best! We also change the \(Y_{i}\) to \(\hat{Y}_{i}\), which indicates a predicted score for person i. Our prediction function, then, looks like this:

\[\hat{Y}_{i}=59+6X_{i}\]

We leave out the error term because every person will have a different error term. If we knew their error, we could predict their score exactly. But since we don’t—because remember, we are predicting a new observation—all we can do is predict their score based on their sex.

This prediction function is straightforward to use. If we want to predict what the next observed thumb length will be, we can see that if the next student sampled is female, their predicted thumb length is 59. If they are male, the prediction is (59 + 6), or 65.

Using R to Predict Future Observations

If the numbers are easy to add or subtract, it’s not that hard to do this in your head. But of course, we won’t always want to do it in our heads for more complex models.

We have a couple of new R functions that you can use make it easier to generate a prediction: b0() and b1(). Run the three lines of R code in the window below and see if you can figure out what these new functions do.

require(tidyverse) require(mosaic) #require(Lock5Data) require(supernova) TinyFingers <- data.frame( Sex = rep(c("female", "male"), each = 3), Thumb = c(56, 60, 61, 63, 64, 68) ) # This creates the TinySex.model TinySex.model <- lm(Thumb ~ Sex, data = TinyFingers) # Run this code TinySex.model b0(TinySex.model) b1(TinySex.model) TinySex.model <- lm(Thumb ~ Sex, data = TinyFingers) TinySex.model b0(TinySex.model) b1(TinySex.model) ex() %>% check_object("TinySex.model") %>% check_equal() ex() %>% check_output_expr("TinySex.model") ex() %>% check_function("b0") %>% check_arg("fit") %>% check_equal() ex() %>% check_function("b1") %>% check_arg("fit") %>% check_equal()
Just Submit the code as is
DataCamp: ch7-26

The b0() function takes a model as its input and returns the parameter estimate for the first parameter, which in this case is the mean Thumb of females. The function b1() returns the parameter estimate for the second parameter, the increment from the mean of females to the mean of males.

In the window below, see if you can write a single line of R code that will use both of these new functions (b0() and b1()) to return the predicted value for a new male’s thumb length.

require(tidyverse) require(mosaic) #require(Lock5Data) require(supernova) TinyFingers <- data.frame( Sex = rep(c("female", "male"), each = 3), Thumb = c(56, 60, 61, 63, 64, 68) ) # This creates the TinySex.model TinySex.model <- lm(Thumb ~ Sex, data = TinyFingers) # Write a line of R code that uses both b0() and b1() functions # to return the predicted Thumb length of a male # This creates the TinySex.model TinySex.model <- lm(Thumb ~ Sex, data = TinyFingers) # Write a line of R code that uses both b0() and b1() functions # to return the predicted Thumb length of a male b0(TinySex.model) + b1(TinySex.model) ex() %>% { check_function(., "b0") %>% check_arg("fit") %>% check_equal() check_function(., "b1") %>% check_arg("fit") %>% check_equal() check_output_expr(., "b0(TinySex.model) + b1(TinySex.model)") }
DataCamp: ch7-27

[1] 65

Generating “Predicted” Values for the Sample Data

As we did in Chapter 5, we also will want to generate model predictions for our sample data. It seems odd to predict values when we already know the actual values. But it’s actually very useful to do so, because then we can calculate residuals from the model predictions.

To get predicted values from the TinySex.model, we use the predict() function:

predict(TinySex.model)
 1  2  3  4  5  6
59 59 59 65 65 65

Let’s say you want to save these predicted values for each person as a variable called Sex.predicted (in the TinyFingers data frame). See if you can complete the R code to do this.

require(tidyverse) require(mosaic) require(Lock5Data) require(supernova) TinyFingers <- data.frame( Sex = rep(c("female", "male"), each = 3), Thumb = c(56, 60, 61, 63, 64, 68) ) TinySex.model <- lm(Thumb ~ Sex, data = TinyFingers) TinyFingers$Sex.predicted <- # this prints the TinyFingers data frame TinyFingers TinyFingers$Sex.predicted <- predict(TinySex.model) ex() %>% check_object("TinyFingers") %>% check_column("Sex.predicted") %>% check_equal()
Use predict() to predict values for each case.
DataCamp: ch7-4

     Sex Thumb Sex.predicted
1 female    56            59
2 female    60            59
3 female    61            59
4   male    63            65
5   male    64            65
6   male    68            65

Notice that our predictions are a single number for each person: 59 for each female and 65 for each male. Each person gets a single predicted thumb length; we never predict both of these values for a single person. But different people will get different predicted outcomes based on their sex.

Try the function predict() on the full data set. Recall that you fit the model to the full data set, Fingers. You saved the model as Sex.model. Now see if you can generate predictions from the model and save the predictions as a variable in the Fingers data frame.

require(tidyverse) require(mosaic) require(Lock5Data) require(supernova) # here is the model we fit before Sex.model <- lm(Thumb ~ Sex, data = Fingers) # generate all possible predictions from Sex.model Fingers$Sex.predicted <- # this will print out 10 lines of Fingers head(select(Fingers, Sex, Thumb, Sex.predicted), 10) Sex.model <- lm(Thumb ~ Sex, data = Fingers) Fingers$Sex.predicted <- predict(Sex.model) head(select(Fingers, Sex, Thumb, Sex.predicted), 10) ex() %>% { check_object(., "Sex.model") %>% check_equal() check_function(., "predict") %>% check_result() %>% check_equal() check_object(., "Fingers") %>% check_column(., "Sex.predicted") %>% check_equal() check_function(., "head") %>% check_result() %>% check_equal() }
Use predict() to predict values for each case.
DataCamp: ch7-5

      Sex Thumb Sex.predicted
1    male 66.00      64.70267
2  female 64.00      58.25585
3  female 56.00      58.25585
4    male 58.42      64.70267
5  female 74.00      58.25585
6  female 60.00      58.25585
7    male 70.00      64.70267
8  female 55.00      58.25585
9  female 60.00      58.25585
10 female 52.00      58.25585

We’ve learned how to specify and fit models. We then took those models and used them (as functions) to make predictions for future observations, and also to generate predictions for each person in our sample data. We turn next to examine the residuals from our model—the variation left over after we subtract out our model.