Course Outline
-
segmentGetting Started (Don't Skip This Part)
-
segmentStatistics and Data Science II
-
segmentPART I: EXPLORING AND MODELING VARIATION
-
segmentChapter 1 - Exploring Data with R
-
segmentChapter 2 - From Exploring to Modeling Variation
-
2.10 The Beauty of Sum of Squares
-
segmentChapter 3 - Modeling Relationships in Data
-
segmentPART II: COMPARING MODELS TO MAKE INFERENCES
-
segmentChapter 4 - The Logic of Inference
-
segmentChapter 5 - Model Comparison with F
-
segmentChapter 6 - Parameter Estimation and Confidence Intervals
-
segmentPART III: MULTIVARIATE MODELS
-
segmentChapter 7 - Introduction to Multivariate Models
-
segmentChapter 8 - Multivariate Model Comparisons
-
segmentChapter 9 - Models with Interactions
-
segmentChapter 10 - More Models with Interactions
-
segmentFinishing Up (Don't Skip This Part!)
-
segmentResources
list High School / Statistics and Data Science II (XCD)
2.10 The Beauty of Sum of Squares
As it turns out, sum of squares (SS) has a special relationship to the mean: the SS is minimized exactly at the mean. Take a look at the video below that explains why this occurs and why it’s important:
In any distribution of a quantitative variable, the mean is the point in the distribution at which SS is lower than at any other single point. Because our goal in statistical modeling is to reduce error, and because we are going to measure total error in sums of squares, this is quite convenient. The sum of squares from the empty model is the least amount of error we can attain without adding in an explanatory variable.
Finding Sum of Squares
Hopefully we have convinced you that SS goes hand-in-hand with the mean. Even more generally, the SS is going to serve us well as a measure of error around any model that is part of the General Linear Model (GLM) family. So far, we have only explored one member of that family—the empty model—in which our prediction for \(Y\) is a single value \(b_0\) – the sample mean.
R has a handy way of helping us find the sum of squared errors (SS) from a particular model. Remember that we stored our empty model for home prices in Ames in this object: empty_model
.
empty_model <- lm(PriceK ~ NULL, data = Ames)
Once we have this model, we can pass that model into a function called supernova()
to create an ANOVA table that shows us the error from this model. ANOVA stands for ANalysis Of VAriance. Analysis means “to break down”, and later we will use this function to break down the variation into parts. But for now, we will use supernova()
just to figure out how much error there is around the empty model, measured in sum of squares.
require(coursekata)
# we’ve created the empty model
empty_model <- lm(PriceK ~ NULL, data = Ames)
# generate the ANOVA table
# we’ve created the empty model
empty_model <- lm(PriceK ~ NULL, data = Ames)
# generate the ANOVA table
supernova(empty_model)
ex() %>% check_function("supernova") %>%
check_result() %>% check_equal()
Analysis of Variance Table (Type III SS)
Model: PriceK ~ NULL
SS df MS F PRE p
----- --------------- | ---------- --- -------- --- --- ---
Model (error reduced) | --- --- --- --- --- ---
Error (from model) | --- --- --- --- --- ---
----- --------------- | ---------- --- -------- --- --- ---
Total (empty model) | 633717.215 184 3444.115
There are a bunch of other things in this output that we will talk about soon. But for now, focus your attention on the row labeled “Total (empty model)” and the column labeled “SS”. We see the same value (633717.2) that we previously calculated for the sum of squares by using resid()
, squaring, and summing.
We are actually going to introduce different kinds of SS later in the course. To help us keep track, we will start calling the SS from the empty model the SS Total.
The supernova()
function saves us from writing a bunch of commands to get the SS Total using multiple R functions. As we make more complex models, this function will be very helpful to us.