list

Statistics and Data Science: A Modeling Approach

7.5 Quantifying Model Fit With Sums of Squares

In the empty model, you will recall, we used the mean as the model, i.e., as the predicted score for every observation. We developed the intuition that mean was a better-fitting model (that there was less error around the model) if the spread of the distribution was small than if it was large.

Calculating Sums of Squares: Empty Model (Review)

In the previous chapter, we quantified error using the sum of the squared deviations (SS, or sum of squares) around the mean, a measure that is minimized precisely at the mean. Under the empty model, all of the variation is unexplained—that’s why it is called “empty.” But it does show us clearly how much variation there is left to explain, measured in sum of squares.

Remind yourself how to use the anova() function to get the SS leftover after fitting the empty model for our TinyFingers thumb length data.

require(tidyverse) require(mosaic) require(Lock5Data) require(supernova) TinyFingers <- data.frame( Sex = as.factor(rep(c("female", "male"), each = 3)), Thumb = c(56, 60, 61, 63, 64, 68) ) TinyEmpty.model <- lm(Thumb ~ NULL, data = TinyFingers) TinySex.model <- lm(Thumb ~ Sex, data = TinyFingers) TinyFingers <- TinyFingers %>% mutate( Sex.predicted = predict(TinySex.model), Sex.resid = Thumb - Sex.predicted, Sex.resid2 = resid(TinySex.model), Empty.pred = predict(TinyEmpty.model) ) # here is the code you wrote before TinyEmpty.model <- lm(Thumb ~ NULL, data = TinyFingers) # write code to get the SS leftover from TinyEmpty.model TinyEmpty.model <- lm(Thumb ~ NULL, data = TinyFingers) anova(TinyEmpty.model) ex() %>% { check_function(., "lm") %>% check_result() %>% check_equal() check_object(., "TinyEmpty.model") %>% check_equal() check_function(., "anova") %>% check_result() %>% check_equal() }
Use the anova() function
DataCamp: ch7-10

Analysis of Variance Table

Response: Thumb
          Df Sum Sq Mean Sq F value Pr(>F)
Residuals  5     82    16.4

Calculating Sums of Squares: Sex Model

How do we quantify the error around our new—more complex—model, where sex is used to predict thumb length?

We quantify error around the more complex model in the same way we did for the empty model. We simply generate the residuals, square them, and then sum them to get the sum of squares left after fitting our model.

Go ahead and modify this code to get the SS leftover for the TinySex.model.

require(tidyverse) require(mosaic) require(Lock5Data) require(supernova) TinyFingers <- data.frame( Sex = as.factor(rep(c("female", "male"), each = 3)), Thumb = c(56, 60, 61, 63, 64, 68) ) TinyEmpty.model <- lm(Thumb ~ NULL, data = TinyFingers) TinySex.model <- lm(Thumb ~ Sex, data = TinyFingers) TinyFingers <- TinyFingers %>% mutate( Sex.predicted = predict(TinySex.model), Sex.resid = Thumb - Sex.predicted, Sex.resid2 = resid(TinySex.model), Empty.pred = predict(TinyEmpty.model) ) # modify this code to find the SS of TinySex.model anova(Empty.model) anova(TinySex.model) ex() %>% check_function("anova") %>% check_result() %>% check_equal(incorrect_msg = "Did you change `Empty.model` to `TinySex.model`?")
Make sure to specify the correct model
DataCamp: ch7-11

Analysis of Variance Table

Response: Thumb
          Df Sum Sq Mean Sq F value  Pr(>F)  
Sex        1     54      54  7.7143 0.04995 *
Residuals  4     28       7                  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We now have calculated two leftover (or residual) sums of squares. The first, 82, is for the empty model. The second, 28, is for the Sex model.

The sum of squares has been minimized as much as we could with the empty model. We can now take that SS as our starting point—this is how much total error we have to explain. As soon as we add an explanatory variable (in this case Sex) into the model, it can only decrease the sum of squares for error, not increase it. If the new variable has no predictive value, then the sum of squares could stay the same. But it’s rare for a variable to have no predictive value at all.

Visualizing Sums of Squares

Let’s watch another video that explains where we are at this point. In her previous video in Chapter 6, Dr. Ji demonstrated the concept of sum of squares using our TinyFingers data set. We literally drew squares when we “squared the residuals.” She showed that the sum of squared deviations is minimized at the mean.

In this video, Dr. Ji shows us how we can visualize sum of squares from the Sex model, and also how we can compare the sum of squares from the Sex model against the empty model.

If you want to try out the app Dr. Ji uses in this video you can click this link to the applet. Copy/paste the data below into the little “sample data” box to reproduce Ji’s examples. (Here’s the link in case that one doesn’t work: http://www.rossmanchance.com/applets/RegShuffle.htm)

Sex Thumb
0 56
0 60
0 61
1 63
1 64
1 68