*list*

# Statistics and Data Science: A Modeling Approach

## 8.3 Using the Regression Model to Make Predictions

The specific regression line, defined by its slope and intercept, is the one that fits our data *best*. By this we mean that this model reduced leftover error to the smallest level possible given our variables. Specifically, the sum of squared deviations around this line are the lowest of any possible line we could have used instead.

Like the empty and group models, error around the regression line is also balanced. You can almost imagine the data points each pulling on the regression line and the best fitting regression line balances the “pulls” above and below it.

This regression model also is our **best estimate** of the relationship between height and thumb length in the population. As with other models of the population, we can use the regression model to predict future observations. To do so we must turn it into a function, one that will predict thumb length based on height.

Here is the fitted model for using **Height** to predict **Thumb** based on the complete **Fingers** data set:

\[Thumb_{i}= -3.33+.96*Height_{i}+e_{i}\]

Remember, a function takes in some input and spits out a prediction based on a model. Here is the function we can use to predict a thumb length based on a person’s height:

predicted Thumb \(=-3.33+.96*Height_{i}\)

We can write this more generally by replacing the variable **Thumb** with \(Y_i\) and the variable **Height** with \(X_i\). And since \(Y_i\) actually represents the collected data for **Thumb**, we put a little hat over the Y, like this \(\hat{Y}\), to indicate that these are the *predicted* thumb lengths.

\[\hat{Y}=-3.33+.96*X_{i}\]

With the two-group model it was easy to make predictions from the model: no calculation was required to see that if the person was short, the prediction would be the mean for short people; and if the person was tall, the prediction would be the mean for tall people. But with the regression model it’s harder to do the calculation in your head.

Remember the `b0()`

and `b1()`

functions we used on page 7.3? We can use them to pull out the parameter estimates from our best fitting model. For example, this code will return the \(b_0\) from our model.

`b0(Height.model)`

If we wanted to generated a prediction using the **Height.model** of someone who was 60 inches tall, we could write:

`b0(Height.model) + b1(Height.model)*60`

Using the window below, and the `b0()`

and `b1()`

functions, see if you can write a line of R code that would return the predicted thumb length for someone who is 73.5 inches tall based on the parameter estimates of the **Height.model**.

```
require(tidyverse)
require(mosaic)
require(Lock5Data)
require(supernova)
Fingers <- filter(Fingers, Thumb >= 33 & Thumb <= 100)
```

```
# this creates the best fitting Height.model
Height.model <- lm(Thumb ~ Height, data = Fingers)
# What would the Height.model predict for the Thumb length
# of someone who is 73.5 inches tall?
```

```
# this creates the best fitting Height.model
Height.model <- lm(Thumb ~ Height, data = Fingers)
# What would the Height.model predict for the Thumb length
# of someone who is 73.5 inches tall?
b0(Height.model) + b1(Height.model) * 73.5
```

```
ex() %>% check_or(
check_operator(., "+") %>% check_result() %>% check_equal(),
check_output_expr(., b0(Height.model) + b1(Height.model) * 73.5),
check_output(., "67.36895"),
check_output(., "67.369"),
check_output(., "67.37"),
check_output(., "67.4")
)
```

`[1] 67.36895`

This code works fine for making individual predictions, but to check our model against the data, we would want to generate predictions for each student in the **Fingers** data frame. As we’ve said before, we really don’t need predictions when we already know their actual thumb lengths. But this is a way to see how well (or how poorly) the model would have predicted the thumb lengths for the students in our data set.

We will use the `predict()`

function, which you have used before, to make a new variable with the predictions based on **Height.model**. We’ll save those predictions as **Height.pred**.

`Fingers$Height.pred <- predict(Height.model)`

Then we’ll print out the first 10 rows of the data frame—but only the variables **Thumb** length, **Height**, and the predicted thumb length from the **Height** model.

`head(select(Fingers, Thumb, Height, Height.pred), 10)`

```
Thumb Height Height.pred
1 66.00 70.5 64.48330
2 64.00 64.8 59.00056
3 56.00 64.0 58.23105
4 58.42 70.0 64.00235
5 74.00 68.0 62.07859
6 60.00 68.0 62.07859
7 70.00 69.0 63.04047
8 55.00 65.7 59.86625
9 60.00 62.5 56.78823
10 52.00 63.4 57.65392
```

We’ve added the code to calculate **Height.pred** in the DataCamp window below. Add the code to create the scatter plot of **Height.pred** (y-axis) by **Height** (x-axis) using `gf_point`

in the DataCamp window below.

```
require(tidyverse)
require(mosaic)
require(Lock5Data)
require(supernova)
Fingers <- filter(Fingers, Thumb >= 33 & Thumb <= 100)
Height.model <- lm(Thumb ~ Height, data = Fingers)
```

```
# this creates predicted thumb lengths from Height.model
Fingers$Height.pred <- predict(Height.model)
# write code to create a scatter plot of Height.pred by Height
gf_point()
```

```
# this creates predicted thumb lengths from Height.model
Fingers$Height.pred <- predict(Height.model)
# write code to create a scatter plot of Height.pred by Height
gf_point(Height.pred ~ Height, data = Fingers)
```

```
ex() %>% check_function("gf_point") %>% check_result() %>% check_equal()
```