Course Outline

list Statistics and Data Science: A Modeling Approach

Book
  • College / Advanced Statistics and Data Science (ABCD)
  • College / Statistics and Data Science (ABC)
  • High School / Advanced Statistics and Data Science I (ABC)
  • High School / Statistics and Data Science I (AB)
  • High School / Statistics and Data Science II (XCD)

6.2 The Beauty of Sum of Squares

As it turns out, sum of squares (SS) has a special relationship to the mean. In the previous chapter we extolled the virtues of the mean. Now it’s time to start appreciating the beauty of sum of squares!

The most obvious advantage of SS as a measure of total error is that it is minimized exactly at the mean. And because our goal in statistical modeling is to reduce error, this is a good thing. In any distribution of a quantitative variable, the mean is the point in the distribution at which SS is lower than at any other point. (Be sure to watch the video in the previous section for more explanation on this point.)

It is worth pointing out that the advantage of SS is only there if our model is the mean. If we were to choose another number, such as the median, as our model of a distribution, we would probably choose a different measure of error. But our focus in this course is primarily on the mean.

At first glance, many of the topics in statistics seem like part of some endless list of unrelated formulas—the mean, the sum of squares, linear models. But hopefully you are starting to see that these fit together. The relationship between the mean and the SS is actually just a peek at the interlocking relationships between all these concepts. The sum of squares will link up with other ideas in statistics later.

It is somewhat like the Pythagorean Theorem. You learned in school that the square of the hypotenuse of a right triangle is equal to the sum of the squares of the two sides. Thus, \(a^2+b^2=c^2\). Squaring the sides makes everything add up and fit together. But if you don’t square them, the theorem no longer holds: \(a+b\neq{c}\). By using sum of squares as a quantification of total error, lots of things will fit together that otherwise would not.

Finding Sum of Squares

Hopefully we have convinced you that SS goes hand-in-hand with the mean. Even more generally, it goes with the General Linear Model (GLM). So far, we have only explored one model—the empty model—in which \(b_0\) represents the sample mean (which is also our estimate of the parameter, the population mean).

\[Y_{i}=b_{0}+e_{i}\]

R has a handy way of helping us find the sum of squared errors (SS) from a particular model. Remember we used lm() to create a model based on our TinyFingers data. We called that the Tiny_empty_model.

Tiny_empty_model <- lm(Thumb ~ NULL, data = TinyFingers)

Once we have this model, we can use a function called supernova() to create an ANOVA table that allows us to look at the error from this model. ANOVA stands for ANalysis Of VAriance. Analysis means “to break down”, and later we will use this function to break down the variation into parts. But for now, we will use supernova() just to figure out how much error there is around the model, measured in sum of squares.

supernova(Tiny_empty_model)
Analysis of Variance Table (Type III SS)
Model: Thumb ~ NULL

                            SS  df     MS   F PRE   p
----- ----------------- ------ --- ------ --- --- ---
Model (error reduced) |    --- ---    --- --- --- ---
Error (from model)    |    --- ---    --- --- --- ---
----- ----------------- ------ --- ------ --- --- ---
Total (empty model)   | 82.000   5 16.400            

There are a bunch of other things in this output that we will talk about soon. But for now, focus your attention on the row labeled “Total (empty model)” and the column labeled “SS”. We see the same value (82.000) that we previously calculated with the longer sequence of R commands in which we calculated the residuals, squared them, and then summed the squared residuals.

Try creating a NULL or empty model of Thumb length using the larger Fingers data frame, and then look at the SS by using supernova().

require(coursekata) # create an empty model of Thumb length from Fingers empty_model <- # analyze the model with supernova() to get the SS supernova() empty_model <- lm(Thumb ~ NULL, data = Fingers) supernova(empty_model) ex() %>% { check_object(., "empty_model") %>% check_equal() check_output_expr(., "supernova(empty_model)") }
CK Code: ch6-3
Analysis of Variance Table (Type III SS)
Model: Thumb ~ NULL

                               SS  df     MS   F PRE   p
----- ----------------- --------- --- ------ --- --- ---
Model (error reduced) |       --- ---    --- --- --- ---
Error (from model)    |       --- ---    --- --- --- ---
----- ----------------- --------- --- ------ --- --- ---
Total (empty model)   | 11880.211 156 76.155            

Let’s try calculating the sum of squares a different way, and see if we get the same result.

Try running this code, and see what the result is.

require(coursekata) empty_model <- lm(Thumb ~ NULL, data = Fingers) # try running this code # will this result in the same SS? sum(resid(empty_model)^2) sum(resid(empty_model)^2) ex() %>% check_function("sum") %>% check_result() %>% check_equal()
CK Code: ch6-4
11880.2109191083

This lines up with the output we got from supernova(). Notice, however, that the supernova() output included three places after the decimal point, whereas this alternative calculation included many places after the decimal point.

Responses