Course Outline
-
segmentGetting Started (Don't Skip This Part)
-
segmentStatistics and Data Science II
-
segmentPART I: EXPLORING AND MODELING VARIATION
-
segmentChapter 1 - Exploring Data with R
-
segmentChapter 2 - From Exploring to Modeling Variation
-
2.9 Quantifying Total Error Around a Model
-
segmentChapter 3 - Modeling Relationships in Data
-
segmentPART II: COMPARING MODELS TO MAKE INFERENCES
-
segmentChapter 4 - The Logic of Inference
-
segmentChapter 5 - Model Comparison with F
-
segmentChapter 6 - Parameter Estimation and Confidence Intervals
-
segmentPART III: MULTIVARIATE MODELS
-
segmentChapter 7 - Introduction to Multivariate Models
-
segmentChapter 8 - Multivariate Model Comparisons
-
segmentChapter 9 - Models with Interactions
-
segmentChapter 10 - More Models with Interactions
-
segmentFinishing Up (Don't Skip This Part!)
-
segmentResources
list High School / Statistics and Data Science II (XCD)
2.9 Quantifying Total Error Around a Model
In this last part of this chapter, we will dig deeper into the ERROR part of our DATA = MODEL + ERROR framework.
The goal of the statistical enterprise is to explain variation. Once we have created a statistical model of our data, we can then define what it means to explain variation in a more specific way, as reducing error around the model. When we add an explanatory variable to the model it will reduce error. But to know how much error it has reduced, we need to know how much error we had to start with.
We have already learned how to calculate a residual, which is the error for an individual data point. Now we will consider how to aggregate all these individual errors together to find out how much total error there is around the empty model.
To make this concrete, let’s consider the empty model for home prices in Ames
. Recall that we saved our model in the object empty_model
:
empty_model
Call:
lm(formula = PriceK ~ NULL, data = Ames)
Coefficients:
(Intercept)
181.4
Saving Predictions and Residuals
We will start by calculating the individual errors (residuals) between our model predictions and the actual home prices for each house in the data set. To get the predictions from our empty model, we can use the predict()
function, putting empty_model
as the input inside the parentheses. Give it a try in the code window below.
require(coursekata)
empty_model <- lm(PriceK ~ NULL, data = Ames)
# generate predictions from this model
empty_model <- lm(PriceK ~ NULL, data = Ames)
# generate predictions from this model
predict(empty_model)
ex() %>% check_output_expr("predict(empty_model)")
Whoa – that’s a lot of 181.4s! What you see here is the prediction the empty model made for each of the 185 homes in our dataset. Our simple model gave the same prediction for every home: the mean price of about $181,428.
Usually, these predictions are “off” from the actual sale price of the home. How “off” are they? To calculate all the residuals (error) from these predictions, we can use the resid()
function: resid(empty_model)
Try it in the code block below.
require(coursekata)
empty_model <- lm(PriceK ~ NULL, data = Ames)
# generate residuals from this model’s predictions
empty_model <- lm(PriceK ~ NULL, data = Ames)
# generate residuals from this model’s predictions
resid(empty_model)
ex() %>% check_output_expr("resid(empty_model)")
It’s kind of hard to look at the residuals like this. To see the DATA = MODEL + ERROR (or PriceK = Mean + Residual) relationship more clearly, run the code, which will save the predictions and the residuals back into the Ames
data frame as new variables. Modify the select()
function so you can see the actual home prices, predicted prices from the empty model, and the residuals from the model for the first six rows of the data set.
require(coursekata)
empty_model <- lm(PriceK ~ NULL, data = Ames)
# saves the predictions and residuals from the empty model
Ames$empty_predict <- predict(empty_model)
Ames$empty_resid <- resid(empty_model)
# this will show us the first 6 rows of PriceK
# modify this code also show prediction and residual variables
head(select(Ames, PriceK))
# saves the predictions and residuals from the empty model
Ames$empty_predict <- predict(empty_model)
Ames$empty_resid <- resid(empty_model)
# this will show us the first 6 rows of PriceK
# modify this code also show prediction and residual variables
head(select(Ames, PriceK, empty_predict, empty_resid))
ex() %>% check_function("head") %>%
check_result() %>% check_equal()
PriceK empty_predict empty_resid
1 260 181.4281 78.57191
2 210 181.4281 28.57191
3 155 181.4281 -26.42809
4 125 181.4281 -56.42809
5 110 181.4281 -71.42809
6 100 181.4281 -81.42809
Notice that on each row, DATA = MODEL + ERROR: The home price (PriceK
) is the sum of the model prediction and the home’s residual (or error) from that prediction. For example, if you look at the first house, the sale price (260) is equal to the prediction + residual (181.43 + 78.57).
Notice, also, that the residuals for the first two homes (e.g., 78.57, 28.57) are positive. This is because the actual price of these two houses were more than the model prediction (181).
The residuals for the first six homes in the data set can also be depicted as vertical lines from the empty model prediction.
Total Error: Sum of Squared Residuals (SS)
We have now saved a residual for each home in the Ames
data frame. How might we put these residuals together to get a measure of total error around the empty model?
One approach might be to just add all the residuals together. The problem with this approach, which we explained earlier, is that if you add together the residuals around the mean, the total will be 0 because the residuals are perfectly balanced around the mean.
One of the most common measures of total error around a model in statistics, and the one we will use in this book, is the sum of the squared residuals, or simply, Sum of Squares.
To calculate the sum of squares in R, let’s start by adding a new column to Ames
that has the squared residuals for each home. In R, we represent exponents with the caret symbol (^
, usually above the 6 on a standard keyboard). So we can use this code to create this new column:
Ames$residual_sqrd <- Ames$residual^2
Run the code in the window below, adding some code to get the sum of the squared residuals (Sum of Squares).
require(coursekata)
# don't delete this part
empty_model <- lm(PriceK ~ NULL, data = Ames)
Ames$empty_resid <- resid(empty_model)
# this creates the squared residuals
Ames$empty_resid_sqrd <- Ames$empty_resid^2
# write code to sum these squared residuals
# this create the squared residuals
Ames$empty_resid_sqrd <- Ames$empty_resid^2
# write code to sum these squared these residuals
sum(Ames$empty_resid_sqrd)
ex() %>% check_function("sum") %>%
check_result() %>% check_equal()
You should have gotten a number like this:
633717.215434616
This is the total sum of squares for the empty model of PriceK
. We call it sum of squares because we literally turned all those residual lines in the figure above into squares. Here we show the same 6 data points as above, this time with their residuals squared.
The sum of squares gives us a quantitative indicator of how much total error there is in the outcome variable, PriceK
. When we add explanatory variables to our model, in the next chapter, we will reduce the total sum of squares. The amount by which we reduce it will tell us how good our new model is.
Is the SS better than other measures of total error? We’ll explore why statisticians use sum of squares in the next section.