list

Statistics and Data Science: A Modeling Approach

6.1 The Beauty of Sum of Squares

As it turns out, sum of squares (SS) has a special relationship to the mean. In the previous chapter we extolled the virtues of the mean. Now it’s time to start appreciating the beauty of sum of squares!

The most obvious advantage of SS as a measure of total error is that it is minimized exactly at the mean. And because our goal in statistical modeling is to reduce error, this is a good thing. In any distribution of a quantitative variable, the mean is the point in the distribution at which SS is lower than at any other point. (Be sure to watch the video in the previous section for more explanation on this point.)

It is worth pointing out that the advantage of SS is only there if our model is the mean. If we were to choose another number, such as the median, as our model of a distribution, we would probably choose a different measure of error. But our focus in this course is primarily on the mean.

There are other things about SS that have attracted statisticians over the years. Most of these things will be hard for you to understand until you get farther into the course. But trust us when we say that the sum of squares will prove its utility, not just because it is minimized at the mean, but because of the way it fits mathematically into the statistics landscape.

At first glance, many of the topics in statistics seem like part of some endless list of unrelated formulas—the mean, the sum of squares, linear models. But hopefully you are starting to see that these fit together. The relationship between the mean and the SS is actually just a peek at the interlocking relationships between all these concepts. Using the squared deviations will actually link up with other ideas in statistics later.

It is somewhat like the Pythagorean Theorem. You learned in school that the square of the hypotenuse of a right triangle is equal to the sum of the squares of the two sides. Thus, \(a^2+b^2=c^2\). Squaring the sides makes everything add up and fit together. But if you don’t square them, the theorem no longer holds: \(a+b\neq{c}\). By using sum of squares as a quantification of total error, lots of things will fit together that otherwise would not.

Finding Sum of Squares

Hopefully we have convinced you that SS goes hand-in-hand with the mean. Even more generally, it goes with the General Linear Model (GLM). So far, we have only explored one model—the empty model—in which \(b_0\) represents the sample mean (which is also our estimate of the parameter, the population mean).

\[Y_{i}=b_{0}+e_{i}\]

R has a handy way of helping us find the sum of squared errors (SS) from a particular model. Remember we used lm() to create a model based on our TinyFingers data. We called that the TinyEmpty.model.

TinyEmpty.model <- lm(Thumb ~ NULL, data = TinyFingers)

Once we have this model, we can use a function called anova() to look at the error from this model. ANOVA stands for ANalysis Of VAriance. Analysis means “to break down”, and later we will use this function to break down the variation into parts. But for now, we will use anova() just to figure out how much error there is around the model, measured in sum of squares.

anova(TinyEmpty.model)
Analysis of Variance Table

Response: Thumb
          Df Sum Sq Mean Sq F value Pr(>F)
Residuals  5     82    16.4

There are a bunch of other things in this output that we will talk about soon. But for now, focus your attention on the column labeled “Sum Sq”. We see the same value (82) that we previously calculated with the longer sequence of R commands in which we calculated the residuals, squared them, and then summed the squared residuals.

Try creating a NULL or empty model of Thumb length using the larger Fingers data frame, and then look at the SS by using anova().

require(tidyverse) require(mosaic) require(Lock5Data) require(supernova) # create an empty model of Thumb length from Fingers Empty.model <- # analyze the model with anova() to get the SS anova() Empty.model <- lm(Thumb ~ NULL, data = Fingers) anova(Empty.model) ex() %>% { check_object(., "Empty.model") %>% check_equal() check_function(., "anova") %>% check_result() %>% check_equal() }
Use lm() to create an empty model.
DataCamp: ch6-3

Analysis of Variance Table

Response: Thumb
           Df Sum Sq Mean Sq F value Pr(>F)
Residuals 156  11880  76.155

Let’s try calculating the sum of squares a different way, and see if we get the same result.

Try running this code, and see what the result is.

require(tidyverse) require(mosaic) require(Lock5Data) require(supernova) Empty.model <- lm(Thumb ~ NULL, data = Fingers) # try running this code # will this result in the same SS? sum(resid(Empty.model)^2) sum(resid(Empty.model)^2) ex() %>% check_function("sum") %>% check_result() %>% check_equal()
Just press Submit without changing the code
DataCamp: ch6-4

[1] 11880.21

This lines up with the output we got from anova(). Notice, however, that the anova() output rounded off to the nearest whole number, whereas this alternative calculation included two places after the decimal point.