Course Outline

list Statistics and Data Science: A Modeling Approach

Book
  • College / Advanced Statistics and Data Science (ABCD)
  • College / Statistics and Data Science (ABC)
  • High School / Advanced Statistics and Data Science I (ABC)
  • High School / Statistics and Data Science I (AB)
  • High School / Statistics and Data Science II (XCD)

13.11 Error and Inference from Models with Multiple Quantitative Predictors

Unpacking the ANOVA Table for FEV ~ HEIGHT + AGE

As with all statistical models, this one produces a predicted value on the outcome variable for every data point. By subtracting each predicted value from the actual value in the data we get residuals, and from there we get sums of squares, PRE, and F. Everything works the same way here as with previous models.

Add some code to the window below to generate the ANOVA table for the FEV ~ HEIGHT + AGE model.

require(coursekata) # delete when coursekata-r updated fevdata <- read.table('http://jse.amstat.org/datasets/fev.dat.txt') colnames(fevdata) <- c("AGE", "FEV", "HEIGHT", "SEX", "SMOKE") fevdata <- data.frame(fevdata) # saves the multivariate model multi_model <- lm(FEV ~ HEIGHT + AGE, data = fevdata) # write code to produce the ANOVA table supernova(multi_model) multi_model <- lm(FEV ~ HEIGHT + AGE, data = fevdata) supernova(multi_model) # temporary SCT ex() %>% check_error()
CK Code: D2_Code_Unpacking_01
Analysis of Variance Table (Type III SS)
 Model: FEV ~ HEIGHT + AGE

                               SS  df      MS        F    PRE     p
 ------ --------------- | ------- --- ------- -------- ------ -----
  Model (error reduced) | 376.245   2 188.122 1067.956 0.7664 .0000
 HEIGHT                 |  95.326   1  95.326  541.157 0.4539 .0000
    AGE                 |   6.259   1   6.259   35.532 0.0518 .0000
  Error (from model)    | 114.675 651   0.176                      
 ------ --------------- | ------- --- ------- -------- ------ -----
  Total (empty model)   | 490.920 653   0.752   

There are many things you could have observed. We notice that the PRE for the whole model is .77 (rounded) so this model explains a lot of error. We also noticed that height uniquely reduces error more than age. We also noticed huge Fs for every row (Fs larger than 4 are worth talking about and these are way bigger than that) – for the degrees of freedom we spent, we have reduced a lot of error.

Comparing Models of the DGP

We’ve been able to explain a lot of the variation in the data with this model. But is this a good model of the DGP? We need to engage in some model comparison to decide which model we will select as our best model of the DGP.

Just because the p-values are below our .05 cutoff for rejecting the simpler models, however, doesn’t necessarily mean we should adopt the multivariate model as our preferred model of the DGP. In this case, it’s also smart to look at the single-predictor models for HEIGHT and AGE, especially since there is apparently a lot of overlap between these predictors.

Below we have put the ANOVA tables for three models: the multivariate model, the height model, and the age model.


Model: FEV ~ HEIGHT + AGE

                               SS  df      MS        F    PRE     p
 ------ --------------- | ------- --- ------- -------- ------ -----
  Model (error reduced) | 376.245   2 188.122 1067.956 0.7664 .0000
 HEIGHT                 |  95.326   1  95.326  541.157 0.4539 .0000
    AGE                 |   6.259   1   6.259   35.532 0.0518 .0000
  Error (from model)    | 114.675 651   0.176                      
 ------ --------------- | ------- --- ------- -------- ------ -----
  Total (empty model)   | 490.920 653   0.752   

Model: FEV ~ HEIGHT

                              SS  df      MS        F    PRE     p
 ----- --------------- | ------- --- ------- -------- ------ -----
 Model (error reduced) | 369.986   1 369.986 1994.731 0.7537 .0000
 Error (from model)    | 120.934 652   0.185                      
 ----- --------------- | ------- --- ------- -------- ------ -----
 Total (empty model)   | 490.920 653   0.752                      

Model: FEV ~ AGE

                              SS  df      MS       F    PRE     p
 ----- --------------- | ------- --- ------- ------- ------ -----
 Model (error reduced) | 280.919   1 280.919 872.184 0.5722 .0000
 Error (from model)    | 210.001 652   0.322                     
 ----- --------------- | ------- --- ------- ------- ------ -----
 Total (empty model)   | 490.920 653   0.752 

Same Model, Different Names

We have now learned how to fit models with quantitative outcome variables and various types and numbers of predictor variables (categorical, quantitative, or both). As we have seen, all of these models can be understood through the common framework of the General Linear Model.

Out in the world, however, people will often use specialized terms to refer to models with different numbers and types of variables. Here is a table with some of the examples we have looked at and the special names people give to those models.

Example Description Common Name
PriceK ~ Neighborhood
(with 2 possible neighborhoods)
a model with a single two-group predictor variable t-test
PriceK ~ Neighborhood
(3+ possible neighborhoods)
a model with a single more-than-two-group predictor variable one-way ANOVA (Analysis of Variance)
PriceK ~ HomeSizeK
a model with a single quantitative predictor simple regression
PriceK ~ Neighborhood + HomeSizeK
a model with at least one categorical and one quantitative variable ANCOVA (Analysis of Covariance)
tip_percent ~ condition + gender
a model with two categorical variables two-way ANOVA
FEV ~ HEIGHT + AGE
a model with multiple quantitative variable multiple regression


It’s good for you to become familiar with some of these names. However, the understanding that you have is much more powerful: you see that all of these are variations of one super useful idea – the General Linear Model. The reason these different names arose in the first place was because each technique was historically developed to solve a specific problem in statistics and data analysis. Later, people discovered how they were connected.

Although some people prefer the specialized names, even experts have a hard time keeping all these names straight. There are well known “cheatsheets” (such as this one called Common Statistical Tests Are Linear Models) that help people remember what all these different models can be called. But you know the truth: they are all just variants of the general linear model.

Responses