list

# Statistics and Data Science: A Modeling Approach

## 7.6 Comparing Two Models: Proportional Reduction in Error

We have seen intuitively how we might compare two models (the empty model and the Sex model) using sums of squares. Now we are going to develop a table for summarizing this comparison, and a new statistic—Proportional Reduction in Error (PRE)—that tells us how much better the complex model fits compared with the simple (or in this case, empty) model.

### A New R Function: supernova()

Using the anova() function, we had to generate two separate tables to get all our sums of squares, one for the empty model and one for the Sex model. But what we really want to do is compare these two models. We want to see which one fits better, and how much better. We’ve created a new R function to help us do that, which we call supernova().

Let’s run supernova() on the TinyEmpty.model and see what the results look like. Press Run in the DataCamp window to run this code.

 require(tidyverse) require(mosaic) require(Lock5Data) require(supernova) TinyFingers <- data.frame( Sex = as.factor(rep(c("female", "male"), each = 3)), Thumb = c(56, 60, 61, 63, 64, 68) ) TinyEmpty.model <- lm(Thumb ~ NULL, data = TinyFingers) TinySex.model <- lm(Thumb ~ Sex, data = TinyFingers) TinyFingers <- TinyFingers %>% mutate( Sex.predicted = predict(TinySex.model), Sex.resid = Thumb - Sex.predicted, Sex.resid2 = resid(TinySex.model), Empty.pred = predict(TinyEmpty.model) )   # run this code supernova(TinyEmpty.model)   supernova(TinyEmpty.model)   ex() %>% check_output_expr("supernova(TinyEmpty.model)") 
Just press Run
DataCamp: ch7-12

Analysis of Variance Table
Outcome variable: Thumb
Model: lm(formula = Thumb ~ NULL, data = TinyFingers)

SS  df     MS   F PRE   p
----- ----------------- ------ --- ------ --- --- ---
Model (error reduced) |    --- ---    --- --- --- ---
Error (from model)    |    --- ---    --- --- --- ---
----- ----------------- ------ --- ------ --- --- ---
Total (empty model)   |     82   5   16.4

This output table has a lot of columns, most of which we haven’t discussed. So, don’t worry at this point that you don’t really know what they are. As we proceed through the course, we will be filling in the rest of this table, which will provide you a really useful way of comparing models.

For now, notice that the only row of the table with any numbers is the last row. You actually know what these numbers mean. First, under SS we see 82, which you probably recognize as the sum of squares for the empty model. They call it Total Sum of Squares.

The next column, labeled df, shows us degrees of freedom. So, for our tiny sample of 6, degrees of freedom is 5, or n-1.

The third column is labeled MS, which stands for mean square error. MS is calculated by dividing SS (82) by degrees of freedom (5), which results in 16.4.

SS divided by df is the formula for variance, which you learned in the previous chapter. Mean square error is like saying the average squared error from the mean. That idea is called variance. It’s the same thing.

### Introduction to Degrees of Freedom

Technically, degrees of freedom of an estimate is the number of independent pieces of information that went into calculating the estimate. We find it’s helpful to think about degrees of freedom as a budget. The more data (represented by n) you have, the more degrees of freedom you have. Whenever you estimate a parameter, you spend (or lose) a degree of freedom because a parameter will limit the freedom of those data points.

What does it mean to “limit the freedom of data points”? Let’s say we have a distribution with three scores and the mean is 5. If we know the mean is 5, do we really have three data points that could be any value? Not really. If we know the mean is 5, then if we know any two of the three numbers, we can figure out the third. So even though two of the data points can be any value, the third is restricted because the mean has to be 5. So, we say there are only two (or n-1) degrees of freedom.

In the supernova table, you see that so far, there is only one row filled in: the bottom row labeled “Total (empty model).” The empty model estimates one parameter, $$b_0$$, which is like having spent one degree of freedom. Our initial budget was 6 because we started off with 6 data points, $$n$$. Given that we spent one, the degrees of freedom is now 5, $$n - 1$$.

We will delve deeper into the concept of degrees of freedom as we go.

### Using supernova() to Examine the Sex Model

Now modify the code in the DataCamp window below to run supernova() on the TinySex.model, and see what you get.

 require(tidyverse) require(mosaic) require(Lock5Data) require(supernova) TinyFingers <- data.frame( Sex = rep(c("female", "male"), each = 3), Thumb = c(56, 60, 61, 63, 64, 68) ) TinyEmpty.model <- lm(Thumb ~ NULL, data = TinyFingers) TinySex.model <- lm(Thumb ~ Sex, data = TinyFingers)   # modify this code for TinySex.model supernova()   supernova(TinySex.model)   ex() %>% check_output_expr("supernova(TinySex.model)") 
Make sure to specify the correct model
DataCamp: ch7-13

Analysis of Variance Table
Outcome variable: Thumb
Model: lm(formula = Thumb ~ Sex, data = TinyFingers)

SS df   MS      F    PRE   p
----- ----------------- --  - ---- ------ ------ -----
Model (error reduced) | 54  1 54.0 7.7143 0.6585 .0499
Error (from model)    | 28  4  7.0
----- ----------------- --  - ---- ------ ------ -----
Total (empty model)   | 82  5 16.4

Looking at this table, we can see a combined view of all the information that we previously got from two anova() tables, one for the empty model and one for the Sex model. The bottom row, labeled “Total”, still shows us the variation around the empty model—we see the SS of 82.

The first two rows show us the result of fitting the Sex model. The row labeled “Error” shows us the SS (28) left after taking out the effect of Sex. We saw before that the Sex model fit better than the empty model, but now we can clearly see what that difference is in a single table. The SS left in the empty model was 82, but now that we have added Sex as an explanatory variable, the SS is only 28.

### Partitioning Sums of Squares

Something new comes out in this table (reproduced again, below) that wasn’t so obvious before. Look just at the column labeled SS (in blue). If it looks like an addition problem to you, you are right. The two rows associated with the Sex model (Model and Error) add up to the Total SS (the empty model). So, 54 + 28 = 82.

Analysis of Variance Table
Outcome variable: Thumb
Model: lm(formula = Thumb ~ Sex, data = TinyFingers)

SS df   MS      F    PRE   p
----- ----------------- --  - ---- ------ ------ -----
Model (error reduced) | 54  1 54.0 7.7143 0.6585 .0499
Error (from model)    | 28  4  7.0
----- ----------------- --  - ---- ------ ------ -----
Total (empty model)   | 82  5 16.4


It turns out this is true about sums of squares. The total sum of squares in the outcome variable—which is the sum of squared deviations around the empty model, or mean—can be equally partitioned into the part that is due to the model itself (we will come back to that in just a moment), and the part that is still left after the model is subtracted out.

We want to take a moment to appreciate why the ANOVA tables (printed by the functions anova() and supernova()) are called ANalysis Of VAriance tables. “Analyze” often means to break up into parts. Analysis of variance is when we break apart, or partition, variation. SS is one measure of variation that we are analyzing here.

So how do we interpret the SS Model (the 54, in this case)? We are going to think of it as the reduction in error (measured in sums of squares) due to the Sex model when compared to the empty model. Let’s use this diagram to help explain this idea. In the diagram, the blue bar represents SS Total, or the total amount of variation in the outcome variable. SS Total, recall, is the sum of the squared distances of each score from the empty model prediction (i.e., the mean). (We saved these distances earlier in a variable called Empty.resid.)

The orange bar (SS Model) represents the part of this total error that is explained by the Sex model.

The area not covered by the orange bar (i.e., the grey bar) represents the amount of error left unexplained from the Sex model. This leftover SS, called SS Error, is based on the distance of each score from the mean of its group (male or female) — i.e., from its model prediction under the two-group model. (We saved these distances earlier in a variable called Sex.resid.)

The orange bar, therefore, also represents the reduction in error that has been achieved by the Sex model in comparison to the empty model.

Just as DATA = MODEL + ERROR, SS Total = SS Model + SS Error. To calculate SS Model (i.e., the sum of squares that is explained by the Sex model), we can simply subtract the SS Error (error around the Sex model) from SS Total (error around the mean, or the empty model).

If SS Total is based on Empty.resid, and SS Error is based on Sex.resid, what distances are used to calculate the SS Model? This is the sum of squared distances between the predictions of the Sex model and the predictions of the empty model. You can think of SS Total = SS Model + SS Error in this more detailed way:

SS(Thumb to Empty.model) = SS(Sex.model to Empty.model) + SS(Thumb to Sex.model)

To directly calculate SS Model, we need to figure out how much error has been reduced by the Sex model in comparison to the empty model. This reduced error is represented by the distance of each person’s predicted score under the Sex.model to their predicted score under the empty model.

To make this concrete, let’s use R to calculate this distance for each person in the TinyFingers data set. We can add a new variable called ErrorReduced in TinyFingers.

TinyFingers$ErrorReduced <- TinyFingers$Sex.predicted - TinyFingers$Empty.pred To calculate the sum of squares for error reduced, we square each person’s value for ErrorReduced, and sum these squares up across people. sum(TinyFingers$ErrorReduced^2)

Let’s try this to see if we get 54 (the SS Model, error reduced, from our supernova table). Try putting all this together in the DataCamp window below.

 require(tidyverse) require(mosaic) require(Lock5Data) require(supernova) TinyFingers <- data.frame( Sex = as.factor(rep(c("female", "male"), each = 3)), Thumb = c(56, 60, 61, 63, 64, 68) ) TinyEmpty.model <- lm(Thumb ~ NULL, data = TinyFingers) TinySex.model <- lm(Thumb ~ Sex, data = TinyFingers) TinyFingers <- TinyFingers %>% mutate( Sex.predicted = predict(TinySex.model), Sex.resid = Thumb - Sex.predicted, Sex.resid2 = resid(TinySex.model), Empty.pred = predict(TinyEmpty.model) )   # this creates a column for error reduced by the Sex.model TinyFingers$ErrorReduced <- TinyFingers$Sex.predicted - TinyFingers$Empty.pred # add code to sum up the squared ErrorReduced   TinyFingers$ErrorReduced <- TinyFingers$Sex.predicted - TinyFingers$Empty.pred sum(TinyFingers\$ErrorReduced^2)   ex() %>% { check_object(., "TinyFingers") %>% check_equal() check_function(., "sum") %>% check_result() %>% check_equal() } 
Look at the example code above if you need help
DataCamp: ch7-13a

 54

### Proportional Reduction in Error (PRE)

We have now quantified how much variation has been explained by our model: 54 square millimeters. Is that good? Is that a lot of explained variation? It would be easier to understand if we knew the proportion of error that has been reduced rather than the raw amount of error measured in $$mm^2$$.

If you take another look at the supernova() table (reproduced below) for the TinySex.model, you will see a column labelled PRE. PRE stands for Proportional Reduction in Error.

Analysis of Variance Table
Outcome variable: Thumb
Model: lm(formula = Thumb ~ Sex, data = TinyFingers)

SS df   MS      F    PRE  p
----- ----------------- --  - ---- ------ ------ -----
Model (error reduced) | 54  1 54.0 7.7143 0.6585 .0499
Error (from model)    | 28  4  7.0
----- ----------------- --  - ---- ------ ------ -----
Total (empty model)   | 82  5 16.4


PRE is calculated using the sums of squares. It is simply the sum of squares for Model (i.e., the sum of squares reduced by the model) divided by the sum of squares Total (or, the total sum of squares in the outcome variable under the empty model). We can represent this in a formula:

$PRE=\frac{SS_{Model}}{SS_{Total}}$

In this course, we always will calculate PRE this way, dividing SS Model by SS Total. Based on this formula, PRE can be interpreted as the proportion of total variation in the outcome variable that is explained by the explanatory variable. It tells us something about the overall strength of our statistical model. For example, in our tiny data set (TinyFingers), the effect of Sex on Thumb is fairly strong, with variation in sex accounting for .66 of the variation in thumb length.

It is important to remember that SS Model in the numerator of the formula above represents the reduction in error when going from the empty model to the more complex model, which includes an explanatory variable. To make this clearer we can re-write the above formula like this:

$PRE=\frac{SS_{Total}-SS_{Error}}{SS_{Total}}$

The numerator of this formula starts with the error from the simple (empty) model (SS Total), and then subtracts the error from the complex model (SS Error) to get the error reduced by the complex model. Dividing this reduction in error by the SS Total yields the proportion of total error in the empty model that has been reduced by the complex model.

Although in the current course we always use PRE to compare a complex model to the empty model, the comparison doesn’t need to be to the empty model. In fact, PRE can be used to compare any two models of an outcome variable as long as one is simpler than the other.

It’s good to start thinking now about how to use PRE for comparing more complicated models to each other. Toward this end, we will add one more version of the same formula that is easier to apply to all model comparisons in the future:

$PRE=\frac{SS_{Error from Simple Model}-SS_{Error from Complex Model}}{SS_{Error from Simple Model}}$

Just as a note: when, as in the current course, PRE is used to compare a complex model to the empty model, it goes by other names as well. In the ANOVA tradition (Analysis of Variance) it is referred to as $$\eta^2$$, or eta squared. In Chapter 8 we will introduce the same concept in the context of regression, where it is called $$R^2$$. For now all you need to know is: these are different terms used to refer to the same thing, in case anyone asks you.

### Fitting the Sex Model to the Full Fingers Data Set

Before moving on, use the DataCamp window below to fit the Sex model to the full Fingers data set. Save the model in an R object called Sex.model. Print out the model (this will show you the values of $$b_{0}$$ and $$b_{1}$$). Then run supernova() to produce the ANOVA table we have been discussing.

 require(tidyverse) require(mosaic) require(Lock5Data) require(supernova)   # fit the model Thumb ~ Sex with the Fingers data # save it to Sex.model Sex.model <- lm(formula = , data = ) # print out the model # print the supernova table for Sex.model   # fit the model Thumb ~ Sex with the Fingers data # save it to Sex.model Sex.model <- lm(formula = Thumb ~ Sex, data = Fingers) # print out the model Sex.model # print the supernova table for Sex.model supernova(Sex.model)   ex() %>% { check_object(., "Sex.model") %>% check_equal() check_output_expr(., "Sex.model") check_output_expr(., "supernova(Sex.model)") } 
Look at previous exercises to see how to fit Sex.model with the Fingers data
DataCamp: ch7-14

Call:
lm(formula = Thumb ~ Sex, data = Fingers)

Coefficients:
(Intercept)      Sexmale
58.256        6.447
Analysis of Variance Table
Outcome variable: Thumb
Model: lm(formula = Thumb ~ Sex, data = Fingers)

SS  df       MS      F    PRE     p
----- ----------------- --------- --- -------- ------ ------ -----
Model (error reduced) |  1334.203   1 1334.203 19.609 0.1123 .0000
Error (from model)    | 10546.008 155   68.039
----- ----------------- --------- --- -------- ------ ------ -----
Total (empty model)   | 11880.211 156   76.155