Course Outline

list High School / Statistics and Data Science II (XCD)

Book
  • College / Advanced Statistics and Data Science (ABCD)
  • College / Statistics and Data Science (ABC)
  • High School / Advanced Statistics and Data Science I (ABC)
  • High School / Statistics and Data Science I (AB)
  • High School / Statistics and Data Science II (XCD)

3.13 Comparing Models with F

There is another method of measuring the reduction in error that seems overly complicated right now but will come in handy as we make more complicated models.

Let’s start by discussing the columns between SS and PRE in the ANOVA table: df and MS. Below is the ANOVA table for our regression model using home size to predict price. The column next to SS, labeled df, shows us degrees of freedom.


Analysis of Variance Table (Type III SS)
 Model: PriceK ~ HomeSizeK
 
                                 SS  df         MS       F    PRE     p
 ----- --------------- | ---------- --- ---------- ------- ------ -----
 Model (error reduced) | 374764.505   1 374764.505 264.843 0.5914 .0000
 Error (from model)    | 258952.711 183   1415.042                     
 ----- --------------- | ---------- --- ---------- ------- ------ -----
 Total (empty model)   | 633717.215 184   3444.115    

Degrees of Freedom (\(df\))

Technically, the degrees of freedom is the number of independent pieces of information that went into calculating a parameter estimate (e.g., \(b_0\) or \(b_1\)). But we find it helpful to think about degrees of freedom (also called \(df\)) as a budget. The more data (represented by \(n\)) you have, the more degrees of freedom you have, which you can use to estimate more parameters (i.e., build more complex models).

In the Ames data, there are 185 homes. When we estimated the single parameter for the empty model (\(b_0\)), we used 1 \(df\), leaving a balance of 184 \(df\) left to spend (called \(df_{total}\)). The home size model required us to estimate one additional parameter (\(b_1\)), which cost us one additional \(df\). This is why, in the ANOVA table \(df_{model}\) is 1. After fitting the home size model we are left with 183 \(df\) (also called \(df_{error}\)).

Mean Square Error (\(MS\))

The column labeled MS, stands for mean square error (also referred to as variance). MS is calculated by dividing SS by degrees of freedom. This is a way of looking at how much error there is per degree of freedom.


Analysis of Variance Table (Type III SS)
 Model: PriceK ~ HomeSizeK
 
                                 SS  df         MS       F    PRE     p
 ----- --------------- | ---------- --- ---------- ------- ------ -----
 Model (error reduced) | 374764.505   1 374764.505 264.843 0.5914 .0000
 Error (from model)    | 258952.711 183   1415.042                     
 ----- --------------- | ---------- --- ---------- ------- ------ -----
 Total (empty model)   | 633717.215 184   3444.115    

The F Ratio

Finally, we arrive at the F column. This column represents the F Ratio, which compares the amount of error reduced per degree of freedom spent (MS Model) to the error we could reduce per degree of freedom if we spent all our remaining degrees of freedom (MS Error) on new parameters.

\[F = \frac{M\!S_{Model}}{M\!S_{Error}} = \frac{S\!S_{Model}/d\!f_{Model}}{S\!S_{Error}/d\!f_{Error}}\]

The F ratio, like PRE, gives us an indication of how much variation is explained by our model. But unlike PRE, it penalizes us if we add too many useless parameters to the model. If we just keep adding more and more explanatory variables to a model, the PRE will just go up and up. But F will start to come down, meaning that you don’t really have enough degrees of freedom to warrant the estimation of so many parameters.

An F ratio of 1 means that the amount of error reduced by your model per degree of freedom spent is about the same as the amount of error left unexplained per remaining degree of freedom. If F is 1, the amount of variation explained by your model could just be due to random chance.

As F gets higher, it indicates that the variation explained by your model is less likely due to luck, and more likely due to an actual effect of your explanatory variable on the outcome. It’s a harder statistic to understand, but one you will find more and more useful as you gain experience.

Comparing Models and Causal Language

The Fs and PREs for these models are quite high. They are so high, in fact, that you may be tempted to conclude that neighborhoods and home sizes cause prices to rise.

But explaining variation is not the same as explaining why things happen. Just because a model explains a lot of variation, it does not imply that the explanatory variable caused a change in the outcome variable.

For example, if we lifted a home from Old Town and plopped it down in College Creek, the home’s value may not increase. In fact, it might decrease, since the home’s historic style might stick out like a sore thumb in the swankier college neighborhood.

By the same token, if we reduce the size of a home’s den to make heating more efficient (thereby reducing the home’s square footage), the home’s value may not decrease. In fact, it might increase, since new home buyers would value the savings that come from a more energy-efficient home.

Our models can tell us whether there are predictable patterns in our data. Neighborhood and square footage are associated with home prices, but this does not necessarily mean that they cause home prices to be higher or lower.

Confounding variables (e.g., neighborhood style and home energy efficiency) prevent us from making true causal claims. We would need other tools (such as the ability to control for other factors or the opportunity to conduct a random assignment experiment) in order to make causal claims.

Having large PREs and Fs means that the model explains a lot of variation, but that doesn’t prove causation. Still, explaining variation is a worthy pursuit because it can help us make better predictions and provide evidence of a causal connection even if it doesn’t prove causality.

Modeling the DGP

Periodically, we try to remind you that the models we fit are models of data. If the data are biased, or have a lot of random error (and many models do), our models could be far off from the truth about the Data Generating Process.

If our models are off, then the predictions we make or conclusions we draw would also be off. Unfortunately, though, we have no way of knowing for sure what the true DGP is. Even though our data might be off, it’s all we have to go on! But we can always hope for more and better data in the future.

In order to keep this hope alive, and to help us figure out by how much our models might be off, we distinguish the models we estimate from data from models we are hope to learn about using different notation.

In this chapter we have learned to specify group models and regression models using the notation of the General Linear Model:

\[Y_i=b_0+b_1X_i+e_i\]

The terms in this model actually add up in our data: the price of a particular home (\(Y_i\)) can be partitioned exactly into the model prediction (\(b_0X_i) and the error from that prediction (\)e_i$).

To represent the same model in the DGP we will use Greek letters:

\[Y_i=\beta_0+\beta_1X_i+\epsilon_i\]

Although theoretically the terms in this model add up, we can’t actually do the math (like we did when modeling data) because we don’t actually know the true values of \(\beta_0\) (pronounced “beta sub 0”) and \(\beta_1\) in the DGP.

And because we don’t know the true values of the \(\beta\)s, we can’t generate a prediction and therefore can’t know how far off the prediction is. This is why we also use a Greek letter (\(\epsilon\), pronounced “epsilon”) to represent error.

Although we will never know the true \(beta\)s (we call these “parameters”), we will continue to estimate them by finding the \(b\)s (parameter estimates) that best fit whatever data we have. We should always keep in mind, though, that these are merely estimates.

NOTE: Be sure to scroll down and answer all questions for each item before clicking NEXT.

Responses