Course Outline
-
segmentGetting Started (Don't Skip This Part)
-
segmentStatistics and Data Science II
-
segmentPART I: EXPLORING AND MODELING VARIATION
-
segmentChapter 1 - Exploring Data with R
-
segmentChapter 2 - From Exploring to Modeling Variation
-
segmentChapter 3 - Modeling Relationships in Data
-
segmentPART II: COMPARING MODELS TO MAKE INFERENCES
-
segmentChapter 4 - The Logic of Inference
-
segmentChapter 5 - Model Comparison with F
-
segmentChapter 6 - Parameter Estimation and Confidence Intervals
-
segmentPART III: MULTIVARIATE MODELS
-
segmentChapter 7 - Introduction to Multivariate Models
-
segmentChapter 8 - Multivariate Model Comparisons
-
segmentChapter 9 - Models with Interactions
-
segmentChapter 10 - More Models with Interactions
-
10.2 Fitting and Visualizing an Interaction Model with Two Quantitative Predictors
-
segmentFinishing Up (Don't Skip This Part!)
-
segmentResources
list High School / Statistics and Data Science II (XCD)
10.2 Fitting and Visualizing an Interaction Model with Two Quantitative Predictors
In the interaction model, we not only allow the intercepts of the regression lines to differ for different years built, but also the slopes of the lines. To the extent that the slopes do, in fact, differ for different values of YearBuilt
, it means that the relationship between price and home size depends on when the home was built – at least in the data if not in the DGP. Allowing the slopes to differ costs us an additional degree of freedom, but may lead to a better-fitting model than the additive model.
The GLM notation for the interaction model with two quantitative predictors is the same as it was for the model with one categorical and one quantitative predictor. So is the R code! But because both predictor variables are quantitative, the interpretation of the model is a little different.
Add some code to the code window below to fit the interaction model and save it as interaction_model
. Also, add on gf_model()
to the gf_point()
to visualize predictions of the interaction model on the scatterplot.
require(coursekata)
# fit and save the interaction model
interaction_model <-
# add the model to this scatterplot
gf_point(PriceK ~ HomeSizeK, data = Ames, color = ~YearBuilt)
# fit and save the interaction model
interaction_model <- lm(PriceK ~ YearBuilt*HomeSizeK, data = Ames)
# add the model to this scatterplot
gf_point(PriceK ~ HomeSizeK, data = Ames, color = ~YearBuilt) %>%
gf_model(interaction_model)
ex() %>% check_function("gf_model") %>% {
check_arg(., "object") %>% check_equal()
check_arg(., "model") %>% check_equal()
}
Earlier, when we had one categorical predictor and one quantitative predictor, the gf_model()
function overlaid two regression lines – one for each level of the categorical variable. When both predictors are quantitative, however, a different approach is required.
gf_model()
can’t overlay a regression line for each possible value of YearBuilt
because there are too many possible values! Instead, it selects three representative values of YearBuilt
and overlays these. The values it chooses are the mean of YearBuilt
, one standard deviation above the mean, and one standard deviation below the mean.
In the graph, the middle (greenish) line shows the model predictions for the average YearBuilt
(1978), while the two flanking lines represent one standard deviation above the mean (2014, yellowish) and below the mean (1942, bluish).
Just because we graphed three lines doesn’t mean there are only three possible lines. Theoretically there could be an infinite number of lines. The gf_model()
function just shows a few representative examples to help us see what the interaction pattern looks like.
The important thing to notice is that the slope of the line is steeper for newer homes compared to older homes. A way to describe this pattern of increasing steepness is that the effect of home size on price gets larger as houses get newer. In other words, there is an interaction between HomeSizeK
and YearBuilt
.
Different Graphs Can Highlight Different Interpretations
You might wonder why we chose to represent HomeSizeK
on the x-axis, and YearBuilt
with the different lines. Actually, there is no reason you couldn’t present the same model in a different way, as in the graph below.
interaction_model <- lm(PriceK ~ YearBuilt*HomeSizeK, data = Ames)
gf_point(PriceK ~ YearBuilt, data = Ames, color = ~HomeSizeK) %>%
gf_model(interaction_model)
Now each line represents a particular value on HomeSizeK
(the mean, +1 SD, and -1 SD). Although the model and the data are the same as in the previous graph, plotting it in a different way may lead to a different way of describing the pattern of results.