Course Outline
-
segmentGetting Started (Don't Skip This Part)
-
segmentStatistics and Data Science II
-
segmentPART I: EXPLORING AND MODELING VARIATION
-
segmentChapter 1 - Exploring Data with R
-
segmentChapter 2 - From Exploring to Modeling Variation
-
segmentChapter 3 - Modeling Relationships in Data
-
segmentPART II: COMPARING MODELS TO MAKE INFERENCES
-
segmentChapter 4 - The Logic of Inference
-
segmentChapter 5 - Model Comparison with F
-
segmentChapter 6 - Parameter Estimation and Confidence Intervals
-
segmentPART III: MULTIVARIATE MODELS
-
segmentChapter 7 - Introduction to Multivariate Models
-
segmentChapter 8 - Multivariate Model Comparisons
-
8.4 Inference for Targeted Model Comparisons
-
segmentFinishing Up (Don't Skip This Part!)
-
segmentResources
list High School / Statistics and Data Science II (XCD)
8.4 Inference for Targeted Model Comparisons
By using targeted model comparisons, we can compare a complex model (with two predictors) to a simpler one with just a single predictor. This allows us to see how one variable (e.g., HomeSizeK
) in the multivariate model uniquely improves the fit of the model to the data, even after controlling for the effect of other predictors (e.g., Neighborhood
).
But just the fact that HomeSizeK
reduces error in our data over a model that doesn’t include it does not show that it is a better model of the DGP. For that, we need to rule out the possibility that the simple model of the DGP could have produced an F (or PRE) for the HomeSizeK
effect as large as the one we observed in the data.
For the HomeSizeK
effect, we are comparing these two models of the DGP (expressed in both R code and GLM notation:
Model | R Code | GLM Notation |
---|---|---|
Complex |
PriceK ~ Neighborhood + HomeSizeK
|
\(PriceK_i= \beta_0 + \beta_1NeighborhoodEastside_{i} + \beta_2HomeSizeK_{i} + \epsilon_i\) |
Simple |
PriceK ~ Neighborhood
|
\(PriceK_i= \beta_0 + \beta_1NeighborhoodEastside_{i} + \colorbox{yellow}{(0)}HomeSizeK_{i} + \epsilon_i\) |
We have highlighted a different way of describing the simple Neighborhood
model. It is a model where the additional effect of HomeSizeK
is 0. Could this simpler DGP produce an F as large as the one we observed in our data?
F and p-value in the ANOVA Table
The answer to this question is summarized by the p-values in the ANOVA table below. The supernova()
function uses a mathematical model of the F distribution, assuming that the simpler of the two models being compared is a true model of the DGP. It then looks to see how likely the observed F would be to have resulted in a world in which the simpler model is true and any effect of the additional predictor is only due to randomness.
Analysis of Variance Table (Type III SS)
Model: PriceK ~ Neighborhood + HomeSizeK
SS df MS F PRE p
------------ --------------- | ---------- -- --------- ------ ------ -----
Model (error reduced) | 124402.900 2 62201.450 17.216 0.5428 .0000
Neighborhood | 27758.138 1 27758.138 7.683 0.2094 .0096
HomeSizeK | 42003.739 1 42003.739 11.626 0.2862 .0019
Error (from model) | 104774.201 29 3612.903
------------ --------------- | ---------- -- --------- ------ ------ -----
Total (empty model) | 229177.101 31 7392.810
The p-value on the Model
row (.0000) means that there is less than a .0001 chance that an F as large as the overall F (17) could be generated by the simple model (which, for this row, is the empty model). This small p-value indicates that we should reject the simple model.
The p-value for HomeSizeK
(0.0019) is also very small so we should reject the simpler model.
This p-value means that the probability of getting an F of 11.626 for HomeSizeK
in the multivariate model – if HomeSizeK
adds no predictive value in the DGP – is very low (0.0096). Based on this, we would reject the simple model that only includes Neighborhood
, and go with the complex model that includes both Neighborhood
and HomeSizeK
.