Course Outline

list High School / Statistics and Data Science II (XCD)

Book
  • High School / Advanced Statistics and Data Science I (ABC)
  • High School / Statistics and Data Science I (AB)
  • High School / Statistics and Data Science II (XCD)
  • College / Statistics and Data Science (ABC)
  • College / Advanced Statistics and Data Science (ABCD)
  • College / Accelerated Statistics and Data Science (XCDCOLLEGE)
  • Skew the Script: Jupyter

2.10 The Beauty of Sum of Squares

As it turns out, sum of squares (SS) has a special relationship to the mean: the SS is minimized exactly at the mean. Take a look at the video below that explains why this occurs and why it’s important:

Video Transcript

In any distribution of a quantitative variable, the mean is the point in the distribution at which SS is lower than at any other single point. Because our goal in statistical modeling is to reduce error, and because we are going to measure total error in sums of squares, this is quite convenient. The sum of squares from the empty model is the least amount of error we can attain without adding in an explanatory variable.

Finding Sum of Squares

Hopefully we have convinced you that SS goes hand-in-hand with the mean. Even more generally, the SS is going to serve us well as a measure of error around any model that is part of the General Linear Model (GLM) family. So far, we have only explored one member of that family—the empty model—in which our prediction for \(Y\) is a single value \(b_0\) – the sample mean.

R has a handy way of helping us find the sum of squared errors (SS) from a particular model. Remember that we stored our empty model for home prices in Ames in this object: empty_model.

empty_model <- lm(PriceK ~ NULL, data = Ames)

Once we have this model, we can pass that model into a function called supernova() to create an ANOVA table that shows us the error from this model. ANOVA stands for ANalysis Of VAriance. Analysis means “to break down”, and later we will use this function to break down the variation into parts. But for now, we will use supernova() just to figure out how much error there is around the empty model, measured in sum of squares.

require(coursekata) # we’ve created the empty model empty_model <- lm(PriceK ~ NULL, data = Ames) # generate the ANOVA table # we’ve created the empty model empty_model <- lm(PriceK ~ NULL, data = Ames) # generate the ANOVA table supernova(empty_model) ex() %>% check_function("supernova") %>% check_result() %>% check_equal()
CK Code: X2_Code_Beauty_01
Analysis of Variance Table (Type III SS)
Model: PriceK ~ NULL
 
                                SS  df       MS   F PRE   p
----- --------------- | ---------- --- -------- --- --- ---
Model (error reduced) |        --- ---      --- --- --- ---
Error (from model)    |        --- ---      --- --- --- ---
----- --------------- | ---------- --- -------- --- --- ---
Total (empty model)   | 633717.215 184 3444.115  

There are a bunch of other things in this output that we will talk about soon. But for now, focus your attention on the row labeled “Total (empty model)” and the column labeled “SS”. We see the same value (633717.2) that we previously calculated for the sum of squares by using resid(), squaring, and summing.

We are actually going to introduce different kinds of SS later in the course. To help us keep track, we will start calling the SS from the empty model the SS Total.

The supernova() function saves us from writing a bunch of commands to get the SS Total using multiple R functions. As we make more complex models, this function will be very helpful to us.

Responses