CourseKata - 2.9 Quantifying Total Error Around a Model

High School / Statistics and Data Science II (XCD)

Book

2.9 Quantifying Total Error Around a Model

In this last part of this chapter, we will dig deeper into the ERROR part of our DATA = MODEL + ERROR framework.

The goal of the statistical enterprise is to explain variation. Once we have created a statistical model of our data, we can then define what it means to explain variation in a more specific way, as reducing error around the model. When we add an explanatory variable to the model it will reduce error. But to know how much error it has reduced, we need to know how much error we had to start with.

We have already learned how to calculate a residual, which is the error for an individual data point. Now we will consider how to aggregate all these individual errors together to find out how much total error there is around the empty model.

To make this concrete, let’s consider the empty model for home prices in Ames. Recall that we saved our model in the object empty_model:

empty_model

Call:
lm(formula = PriceK ~ NULL, data = Ames)
 
Coefficients:
(Intercept)  
         181.4

Saving Predictions and Residuals

We will start by calculating the individual errors (residuals) between our model predictions and the actual home prices for each house in the data set. To get the predictions from our empty model, we can use the predict() function, putting empty_model as the input inside the parentheses. Give it a try in the code window below.

require(coursekata)

empty_model <- lm(PriceK ~ NULL, data = Ames)

# generate predictions from this model

empty_model <- lm(PriceK ~ NULL, data = Ames)

# generate predictions from this model
predict(empty_model)

ex() %>% check_output_expr("predict(empty_model)")

CK Code: X2_Code_Quantifying_01

Whoa – that’s a lot of 181.4s! What you see here is the prediction the empty model made for each of the 185 homes in our dataset. Our simple model gave the same prediction for every home: the mean price of about $181,428.

Usually, these predictions are “off” from the actual sale price of the home. How “off” are they? To calculate all the residuals (error) from these predictions, we can use the resid() function: resid(empty_model)

Try it in the code block below.

require(coursekata)

empty_model <- lm(PriceK ~ NULL, data = Ames)

# generate residuals from this model’s predictions

empty_model <- lm(PriceK ~ NULL, data = Ames)

# generate residuals from this model’s predictions
resid(empty_model)

ex() %>% check_output_expr("resid(empty_model)")

CK Code: X2_Code_Quantifying_02

It’s kind of hard to look at the residuals like this. To see the DATA = MODEL + ERROR (or PriceK = Mean + Residual) relationship more clearly, run the code, which will save the predictions and the residuals back into the Ames data frame as new variables. Modify the select() function so you can see the actual home prices, predicted prices from the empty model, and the residuals from the model for the first six rows of the data set.

require(coursekata)
empty_model <- lm(PriceK ~ NULL, data = Ames)

# saves the predictions and residuals from the empty model
Ames$empty_predict <- predict(empty_model)
Ames$empty_resid <- resid(empty_model)

# this will show us the first 6 rows of PriceK
# modify this code also show prediction and residual variables
head(select(Ames, PriceK))

# saves the predictions and residuals from the empty model
Ames$empty_predict <- predict(empty_model)
Ames$empty_resid <- resid(empty_model)

# this will show us the first 6 rows of PriceK
# modify this code also show prediction and residual variables
head(select(Ames, PriceK, empty_predict, empty_resid))

ex() %>% check_function("head") %>%
  check_result() %>% check_equal()

CK Code: X2_Code_Quantifying_03

 PriceK empty_predict empty_resid
1    260      181.4281    78.57191
2    210      181.4281    28.57191
3    155      181.4281   -26.42809
4    125      181.4281   -56.42809
5    110      181.4281   -71.42809
6    100      181.4281   -81.42809

Notice that on each row, DATA = MODEL + ERROR: The home price (PriceK) is the sum of the model prediction and the home’s residual (or error) from that prediction. For example, if you look at the first house, the sale price (260) is equal to the prediction + residual (181.43 + 78.57).

Notice, also, that the residuals for the first two homes (e.g., 78.57, 28.57) are positive. This is because the actual price of these two houses were more than the model prediction (181).

The residuals for the first six homes in the data set can also be depicted as vertical lines from the empty model prediction.

A jitter plot of the distribution of PriceK by Neighborhood in the Ames data frame, overlaid with a horizontal line in blue showing the empty model for PriceK. Three residuals from each group are drawn as blue vertical lines from the data point to the model. The longest blue line of the Old Town homes is below the mean, and the longest blue line of the College Creek homes is above the mean.

Total Error: Sum of Squared Residuals (SS)

We have now saved a residual for each home in the Ames data frame. How might we put these residuals together to get a measure of total error around the empty model?

One approach might be to just add all the residuals together. The problem with this approach, which we explained earlier, is that if you add together the residuals around the mean, the total will be 0 because the residuals are perfectly balanced around the mean.

One of the most common measures of total error around a model in statistics, and the one we will use in this book, is the sum of the squared residuals, or simply, Sum of Squares.

To calculate the sum of squares in R, let’s start by adding a new column to Ames that has the squared residuals for each home. In R, we represent exponents with the caret symbol (^, usually above the 6 on a standard keyboard). So we can use this code to create this new column:

Ames$residual_sqrd <- Ames$residual^2

Run the code in the window below, adding some code to get the sum of the squared residuals (Sum of Squares).

require(coursekata)

# don't delete this part
empty_model <- lm(PriceK ~ NULL, data = Ames)
Ames$empty_resid <- resid(empty_model)

# this creates the squared residuals
Ames$empty_resid_sqrd <- Ames$empty_resid^2

# write code to sum these squared residuals

# this create the squared residuals
Ames$empty_resid_sqrd <- Ames$empty_resid^2

# write code to sum these squared these residuals
sum(Ames$empty_resid_sqrd)

ex() %>% check_function("sum") %>%
  check_result() %>% check_equal()

CK Code: X2_Code_Quantifying_04

You should have gotten a number like this:

633717.215434616

This is the total sum of squares for the empty model of PriceK. We call it sum of squares because we literally turned all those residual lines in the figure above into squares. Here we show the same 6 data points as above, this time with their residuals squared.

A jitter plot of the distribution of PriceK by Neighborhood in the Ames data frame, overlaid with a horizontal line in blue showing the empty model for PriceK. Three residuals from each group are drawn as blue vertical lines from the data point to the model, and those residuals have been scaled into squares.

The sum of squares gives us a quantitative indicator of how much total error there is in the outcome variable, PriceK. When we add explanatory variables to our model, in the next chapter, we will reduce the total sum of squares. The amount by which we reduce it will tell us how good our new model is.

Is the SS better than other measures of total error? We’ll explore why statisticians use sum of squares in the next section.

2.8 Parameters and Estimates 2.10 The Beauty of Sum of Squares

Course Outline

High School / Statistics and Data Science II (XCD)

2.9 Quantifying Total Error Around a Model

Saving Predictions and Residuals

Total Error: Sum of Squared Residuals (SS)

Responses

list High School / Statistics and Data Science II (XCD)

2.9 Quantifying Total Error Around a Model

Saving Predictions and Residuals

Total Error: Sum of Squared Residuals (SS)

High School / Statistics and Data Science II (XCD)