Course Outline

list Statistics and Data Science: A Modeling Approach

Book
  • College / Advanced Statistics and Data Science (ABCD)
  • College / Statistics and Data Science (ABC)
  • High School / Advanced Statistics and Data Science I (ABC)
  • High School / Statistics and Data Science I (AB)
  • High School / Statistics and Data Science II (XCD)

13.4 Inference for Targeted Model Comparisons

By using targeted model comparisons, we can compare a complex model (with two predictors) to a simpler one with just a single predictor. This allows us to see how one variable (e.g., HomeSizeK) in the multivariate model uniquely improves the fit of the model to the data, even after controlling for the effect of other predictors (e.g., Neighborhood).

But just the fact that HomeSizeK reduces error in our data over a model that doesn’t include it does not show that it is a better model of the DGP. For that, we need to rule out the possibility that the simple model of the DGP could have produced an F (or PRE) for the HomeSizeK effect as large as the one we observed in the data.

For the HomeSizeK effect, we are comparing these two models of the DGP (expressed in both R code and GLM notation:

Model R Code GLM Notation
Complex PriceK ~ Neighborhood + HomeSizeK \(PriceK_i= \beta_0 + \beta_1NeighborhoodEastside_{i} + \beta_2HomeSizeK_{i} + \epsilon_i\)
Simple PriceK ~ Neighborhood \(PriceK_i= \beta_0 + \beta_1NeighborhoodEastside_{i} + \colorbox{yellow}{(0)}HomeSizeK_{i} + \epsilon_i\)

We have highlighted a different way of describing the simple Neighborhood model. It is a model where the additional effect of HomeSizeK is 0. Could this simpler DGP produce an F as large as the one we observed in our data?

F and p-value in the ANOVA Table

The answer to this question is summarized by the p-values in the ANOVA table below. The supernova() function uses a mathematical model of the F distribution, assuming that the simpler of the two models being compared is a true model of the DGP. It then looks to see how likely the observed F would be to have resulted in a world in which the simpler model is true and any effect of the additional predictor is only due to randomness.

Analysis of Variance Table (Type III SS)
 Model: PriceK ~ Neighborhood + HomeSizeK

                                        SS df        MS      F    PRE     p
 ------------ --------------- | ---------- -- --------- ------ ------ -----
        Model (error reduced) | 124402.900  2 62201.450 17.216 0.5428 .0000
 Neighborhood                 |  27758.138  1 27758.138  7.683 0.2094 .0096
    HomeSizeK                 |  42003.739  1 42003.739 11.626 0.2862 .0019
        Error (from model)    | 104774.201 29  3612.903                    
 ------------ --------------- | ---------- -- --------- ------ ------ -----
        Total (empty model)   | 229177.101 31  7392.810 

The p-value on the Model row (.0000) means that there is less than a .0001 chance that an F as large as the overall F (17) could be generated by the simple model (which, for this row, is the empty model). This small p-value indicates that we should reject the simple model.

The p-value for HomeSizeK (0.0019) is also very small so we should reject the simpler model.

This p-value means that the probability of getting an F of 11.626 for HomeSizeK in the multivariate model – if HomeSizeK adds no predictive value in the DGP – is very low (0.0096). Based on this, we would reject the simple model that only includes Neighborhood, and go with the complex model that includes both Neighborhood and HomeSizeK.

Responses