## Course Outline

• segmentGetting Started (Don't Skip This Part)
• segmentStatistics and Data Science: A Modeling Approach
• segmentPART I: EXPLORING VARIATION
• segmentChapter 1 - Welcome to Statistics: A Modeling Approach
• segmentChapter 2 - Understanding Data
• segmentChapter 3 - Examining Distributions
• segmentChapter 4 - Explaining Variation
• segmentPART II: MODELING VARIATION
• segmentChapter 5 - A Simple Model
• segmentChapter 6 - Quantifying Error
• segmentChapter 7 - Adding an Explanatory Variable to the Model
• segmentChapter 8 - Digging Deeper into Group Models
• segmentChapter 9 - Models with a Quantitative Explanatory Variable
• segmentPART III: EVALUATING MODELS
• segmentChapter 10 - The Logic of Inference
• segmentChapter 11 - Model Comparison with F
• segmentChapter 12 - Parameter Estimation and Confidence Intervals
• segmentChapter 13 - What You Have Learned
• segmentFinishing Up (Don't Skip This Part!)
• segmentResources

### list High School / Advanced Statistics and Data Science I (ABC)

Book
• High School / Advanced Statistics and Data Science I (ABC)
• High School / Statistics and Data Science I (AB)
• High School / Statistics and Data Science II (XCD)
• College / Statistics and Data Science (ABC)
• College / Advanced Statistics and Data Science (ABCD)
• College / Accelerated Statistics and Data Science (XCDCOLLEGE)
• Skew the Script: Jupyter

## 9.5 Error from the Height Model

Regardless of model, error from the model (or residuals) are always calculated the same way for each data point:

$\text{residual}=\text{observed value} - \text{predicted value}$

For regression models, the predicted value of $$Y_i$$ will be right on the regression line. Error, therefore, is calculated based on the vertical gap between a data point’s value on the Y axis (the outcome variable) and its predicted value based on the regression line.

Below we have depicted just 6 data points (in black) and their residuals (firebrick vertical lines) from the Height model.

Note that a negative residual (shown below the regression line) means that the data (e.g., Thumb) was lower than the predicted thumb length given the student’s height.

The residuals from the Height model represent the variation in Thumb that is leftover after we take out the part that can be explained by Height. As an example, take a particular student (highlighted in the plot below) with a thumb length of 63 mm. The residual of 7 means that the student’s thumb is 7 mm longer than would have been predicted for this student based on their height.

Another way we could say this is: controlling for height, this student’s thumb is 7 mm longer than expected. A positive residual indicates that a thumb is long for a person of that height. A negative residual indicates that a thumb is shorter than expected based on the person’s height. These residuals may prompt us to ask, what other stuff besides Height might account for these differences?

### SS Error for the Height Model

Just like for other models we have seen (e.g., the empty model and the group model), the metric we use for quantifying total error from the Height model is the sum of squared residuals from the model, or SS Error. SS Error is calculated from the residuals in the same way as it is for a group model, by squaring and then summing the residuals (see figure below).

Compare the sums of squares for the empty model (SST) and the height model (SSE) for 6 data points in the figures below.

SS Total, Sum of Squared Residuals from empty model SS Error, Sum of Squared Residuals from height model

### Using R to Compare SS for the Height Model and the Empty Model

Just like we did with the group models (e.g., Height2Group model and Sex model), we can use the resid() function to get the residuals from the Height model. We can then square them and sum them to get the SS Error from the model like this:

Height_model <- lm(Thumb ~ Height, data = Fingers)
sum(resid(Height_model)^2)

The code below will calculate SST from the empty model and SSE from the height model. Run it to check our predictions about SSE.

require(coursekata) # this calculates SST empty_model <- lm(Thumb ~ NULL, data=Fingers) print("SST") sum(resid(empty_model)^2) # this calculates SSE Height_model <- lm(Thumb ~ Height, data = Fingers) print("SSE") sum(resid(Height_model)^2) # no test ex() %>% check_error()
CK Code: B5_Code_Error_01
[1] "SST"
11880.2109191083

[1] "SSE"
10063.3491457795

Notice that the SST is the same as it was for the Height2Group model: 11,880. This is because the empty model hasn’t changed; SST is still based on residuals from the grand mean of Thumb. The SSE (10,063) is smaller than SST because the residuals are smaller (as represented in the figures above by shorter lines). Squaring and summing smaller residuals results in a smaller SSE. The total error has been reduced by this regression model.

### Comparing the Regression Line with the Mean

Recall that the mean is the middle of a univariate distribution. It is the balancing point of the distribution, where the residuals are perfectly balanced above and below. In a similar way, the regression line is the middle of a bivariate distribution between two quantitative variables. Just as the sum of the residuals around the mean add up to 0, so too the sum of the residuals around the regression line also add up to 0.

Here’s another cool relationship between the mean and regression line. It turns out that the best-fitting regression line will always pass through a point that is mean of both variables (this is called the point of means). So, if someone’s height is exactly at the mean, their predicted thumb length will also be exactly at the mean.

Finally, just as the mean is the point in the univariate distribution at which the SS Error is minimized, the same is true of error around the regression line. The sum of the squared deviations of the observed points is at its lowest possible value around the best-fitting regression line.