Course Outline

list Statistics and Data Science: A Modeling Approach

Book
  • College / Advanced Statistics and Data Science (ABCD)
  • College / Statistics and Data Science (ABC)
  • High School / Advanced Statistics and Data Science I (ABC)
  • High School / Statistics and Data Science I (AB)
  • High School / Statistics and Data Science II (XCD)

7.6 Quantifying Model Fit With Sums of Squares

In the empty model, you will recall, we used the mean as the model, i.e., as the predicted score for every observation. We developed the intuition that mean was a better-fitting model (that there was less error around the model) if the spread of the distribution was small than if it was large.

Calculating Sums of Squares: Empty Model (Review)

In the previous chapter, we quantified error using the sum of the squared deviations (SS, or sum of squares) around the mean, a measure that is minimized precisely at the mean. Under the empty model, all of the variation is unexplained—that’s why it is called “empty.” But it does show us clearly how much variation there is left to explain, measured in sum of squares.

Remind yourself how to use the supernova() function to get the SS leftover after fitting the empty model (SS Total) for our TinyFingers thumb length data.

require(coursekata) TinyFingers <- data.frame( Sex = as.factor(rep(c("female", "male"), each = 3)), Thumb = c(56, 60, 61, 63, 64, 68) ) Tiny_empty_model <- lm(Thumb ~ NULL, data = TinyFingers) Tiny_Sex_model <- lm(Thumb ~ Sex, data = TinyFingers) TinyFingers <- TinyFingers %>% mutate( Sex_predicted = predict(Tiny_Sex_model), Sex_resid = Thumb - Sex_predicted, Sex_resid2 = resid(Tiny_Sex_model), empty_pred = predict(Tiny_empty_model) ) # here is the code you wrote before Tiny_empty_model <- lm(Thumb ~ NULL, data = TinyFingers) # write code to get the SS leftover from Tiny_empty_model Tiny_empty_model <- lm(Thumb ~ NULL, data = TinyFingers) supernova(Tiny_empty_model) ex() %>% { check_function(., "lm") %>% check_result() %>% check_equal() check_object(., "Tiny_empty_model") %>% check_equal() check_function(., "supernova") %>% check_result() %>% check_equal() }
CK Code: ch7-10
Analysis of Variance Table (Type III SS)
Model: Thumb ~ NULL

                            SS  df     MS   F PRE   p
----- ----------------- ------ --- ------ --- --- ---
Model (error reduced) |    --- ---    --- --- --- ---
Error (from model)    |    --- ---    --- --- --- ---
----- ----------------- ------ --- ------ --- --- ---
Total (empty model)   | 82.000   5 16.400            

Calculating Sums of Squares: Sex Model

How do we quantify the error around our new—more complex—model, where sex is used to predict thumb length?

We quantify error around the more complex model in the same way we did for the empty model. We simply generate the residuals based on predictions of the Sex model, square them, and then sum them to get the sum of squares error from the model.

Go ahead and modify this code to get the SS Error from the Tiny_Sex_model.

require(coursekata) TinyFingers <- data.frame( Sex = as.factor(rep(c("female", "male"), each = 3)), Thumb = c(56, 60, 61, 63, 64, 68) ) Tiny_empty_model <- lm(Thumb ~ NULL, data = TinyFingers) Tiny_Sex_model <- lm(Thumb ~ Sex, data = TinyFingers) TinyFingers <- TinyFingers %>% mutate( Sex_predicted = predict(Tiny_Sex_model), Sex_resid = Thumb - Sex_predicted, Sex_resid2 = resid(Tiny_Sex_model), empty_pred = predict(Tiny_empty_model) ) # modify this code to find the SS of Tiny_Sex_model supernova(empty_model) supernova(Tiny_Sex_model) ex() %>% check_function("supernova") %>% check_result() %>% check_equal(incorrect_msg = "Did you change `empty_model` to `Tiny_Sex_model`?")
CK Code: ch7-11
Analysis of Variance Table (Type III SS)
Model: Thumb ~ Sex

                            SS df     MS     F    PRE     p
----- ----------------- ------  - ------ ----- ------ -----
Model (error reduced) | 54.000  1 54.000 7.714 0.6585 .0499
Error (from model)    | 28.000  4  7.000                   
----- ----------------- ------  - ------ ----- ------ -----
Total (empty model)   | 82.000  5 16.400                   

We now have calculated two leftover (or residual) sums of squares: SS Total and SS Error. SS Total is the total error from the empty model (82); SS Error is the error leftover from the Sex model (28).

SS Total is the smallest SS we could have without adding an explanatory variable to the model. It represents the total variation in the outcome variable that we would want to explain. Taking that as our starting point, we can reduce the error by adding an explanatory variable into the model (in this case Sex).

Adding an explanatory variable to our model can only decrease the sum of squares for error, not increase it. If the new model does not make better predictions than the empty model then the sum of squares would stay the same. But it’s rare for an explanatory variable to have no predictive value at all.

Visualizing Sums of Squares

Let’s watch another video that explains where we are at this point. In her previous video in Chapter 6, Dr. Ji demonstrated the concept of sum of squares using our TinyFingers data set. We literally drew squares when we “squared the residuals.” She showed that the sum of squared deviations is minimized at the mean.

In this video, Dr. Ji shows us how we can visualize sum of squares from the Sex model, and also how we can compare the sum of squares from the Sex model against the empty model.

Video Transcript

If you want to try out the app Dr. Ji uses in this video you can click this link to the sum of squares applet. Copy/paste the data below into the little “sample data” box to reproduce Ji’s examples. (Here’s the full link in case that one doesn’t work: http://www.rossmanchance.com/applets/RegShuffle.htm)

Sex Thumb
0 56
0 60
0 61
1 63
1 64
1 68

Responses