Course Outline
-
segmentGetting Started (Don't Skip This Part)
-
segmentStatistics and Data Science: A Modeling Approach
-
segmentPART I: EXPLORING VARIATION
-
segmentChapter 1 - Welcome to Statistics: A Modeling Approach
-
segmentChapter 2 - Understanding Data
-
segmentChapter 3 - Examining Distributions
-
segmentChapter 4 - Explaining Variation
-
segmentPART II: MODELING VARIATION
-
segmentChapter 5 - A Simple Model
-
segmentChapter 6 - Quantifying Error
-
6.2 The Beauty of Sum of Squares
-
segmentChapter 7 - Adding an Explanatory Variable to the Model
-
segmentChapter 8 - Models with a Quantitative Explanatory Variable
-
segmentPART III: EVALUATING MODELS
-
segmentChapter 9 - The Logic of Inference
-
segmentChapter 10 - Model Comparison with F
-
segmentChapter 11 - Parameter Estimation and Confidence Intervals
-
segmentChapter 12 - What You Have Learned
-
segmentFinishing Up (Don't Skip This Part!)
-
segmentResources
list High School / Advanced Statistics and Data Science I (ABC)
6.2 The Beauty of Sum of Squares
As it turns out, sum of squares (SS) has a special relationship to the mean. In the previous chapter we extolled the virtues of the mean. Now it’s time to start appreciating the beauty of sum of squares!
The most obvious advantage of SS as a measure of total error is that it is minimized exactly at the mean. And because our goal in statistical modeling is to reduce error, this is a good thing. In any distribution of a quantitative variable, the mean is the point in the distribution at which SS is lower than at any other point. (Be sure to watch the video in the previous section for more explanation on this point.)
It is worth pointing out that the advantage of SS is only there if our model is the mean. If we were to choose another number, such as the median, as our model of a distribution, we would probably choose a different measure of error. But our focus in this course is primarily on the mean.
At first glance, many of the topics in statistics seem like part of some endless list of unrelated formulas—the mean, the sum of squares, linear models. But hopefully you are starting to see that these fit together. The relationship between the mean and the SS is actually just a peek at the interlocking relationships between all these concepts. The sum of squares will link up with other ideas in statistics later.
It is somewhat like the Pythagorean Theorem. You learned in school that the square of the hypotenuse of a right triangle is equal to the sum of the squares of the two sides. Thus, \(a^2+b^2=c^2\). Squaring the sides makes everything add up and fit together. But if you don’t square them, the theorem no longer holds: \(a+b\neq{c}\). By using sum of squares as a quantification of total error, lots of things will fit together that otherwise would not.
Finding Sum of Squares
Hopefully we have convinced you that SS goes hand-in-hand with the mean. Even more generally, it goes with the General Linear Model (GLM). So far, we have only explored one model—the empty model—in which \(b_0\) represents the sample mean (which is also our estimate of the parameter, the population mean).
\[Y_{i}=b_{0}+e_{i}\]
R has a handy way of helping us find the sum of squared errors (SS) from a particular model. Remember we used lm()
to create a model based on our TinyFingers
data. We called that the Tiny_empty_model
.
Tiny_empty_model <- lm(Thumb ~ NULL, data = TinyFingers)
Once we have this model, we can use a function called supernova()
to create an ANOVA table that allows us to look at the error from this model. ANOVA stands for ANalysis Of VAriance. Analysis means “to break down”, and later we will use this function to break down the variation into parts. But for now, we will use supernova()
just to figure out how much error there is around the model, measured in sum of squares.
supernova(Tiny_empty_model)
Analysis of Variance Table (Type III SS)
Model: Thumb ~ NULL
SS df MS F PRE p
----- ----------------- ------ --- ------ --- --- ---
Model (error reduced) | --- --- --- --- --- ---
Error (from model) | --- --- --- --- --- ---
----- ----------------- ------ --- ------ --- --- ---
Total (empty model) | 82.000 5 16.400
There are a bunch of other things in this output that we will talk about soon. But for now, focus your attention on the row labeled “Total (empty model)” and the column labeled “SS”. We see the same value (82.000) that we previously calculated with the longer sequence of R commands in which we calculated the residuals, squared them, and then summed the squared residuals.
Try creating a NULL or empty model of Thumb
length using the larger Fingers
data frame, and then look at the SS by using supernova()
.
require(coursekata)
# create an empty model of Thumb length from Fingers
empty_model <-
# analyze the model with supernova() to get the SS
supernova()
empty_model <- lm(Thumb ~ NULL, data = Fingers)
supernova(empty_model)
ex() %>% {
check_object(., "empty_model") %>% check_equal()
check_output_expr(., "supernova(empty_model)")
}
Analysis of Variance Table (Type III SS)
Model: Thumb ~ NULL
SS df MS F PRE p
----- ----------------- --------- --- ------ --- --- ---
Model (error reduced) | --- --- --- --- --- ---
Error (from model) | --- --- --- --- --- ---
----- ----------------- --------- --- ------ --- --- ---
Total (empty model) | 11880.211 156 76.155
Let’s try calculating the sum of squares a different way, and see if we get the same result.
Try running this code, and see what the result is.
require(coursekata)
empty_model <- lm(Thumb ~ NULL, data = Fingers)
# try running this code
# will this result in the same SS?
sum(resid(empty_model)^2)
sum(resid(empty_model)^2)
ex() %>% check_function("sum") %>% check_result() %>% check_equal()
11880.2109191083
This lines up with the output we got from supernova()
. Notice, however, that the supernova()
output included three places after the decimal point, whereas this alternative calculation included many places after the decimal point.