Course Outline

list High School / Statistics and Data Science II (XCD)

Book
  • High School / Advanced Statistics and Data Science I (ABC)
  • High School / Statistics and Data Science I (AB)
  • High School / Statistics and Data Science II (XCD)
  • College / Statistics and Data Science (ABC)
  • College / Advanced Statistics and Data Science (ABCD)
  • College / Accelerated Statistics and Data Science (XCDCOLLEGE)
  • Skew the Script: Jupyter

10.2 Fitting and Visualizing an Interaction Model with Two Quantitative Predictors

In the interaction model, we not only allow the intercepts of the regression lines to differ for different years built, but also the slopes of the lines. To the extent that the slopes do, in fact, differ for different values of YearBuilt, it means that the relationship between price and home size depends on when the home was built – at least in the data if not in the DGP. Allowing the slopes to differ costs us an additional degree of freedom, but may lead to a better-fitting model than the additive model.

The GLM notation for the interaction model with two quantitative predictors is the same as it was for the model with one categorical and one quantitative predictor. So is the R code! But because both predictor variables are quantitative, the interpretation of the model is a little different.

Add some code to the code window below to fit the interaction model and save it as interaction_model. Also, add on gf_model() to the gf_point() to visualize predictions of the interaction model on the scatterplot.

require(coursekata) # fit and save the interaction model interaction_model <- # add the model to this scatterplot gf_point(PriceK ~ HomeSizeK, data = Ames, color = ~YearBuilt) # fit and save the interaction model interaction_model <- lm(PriceK ~ YearBuilt*HomeSizeK, data = Ames) # add the model to this scatterplot gf_point(PriceK ~ HomeSizeK, data = Ames, color = ~YearBuilt) %>% gf_model(interaction_model) ex() %>% check_function("gf_model") %>% { check_arg(., "object") %>% check_equal() check_arg(., "model") %>% check_equal() }
CK Code: D4_Code_Fitting_01

Earlier, when we had one categorical predictor and one quantitative predictor, the gf_model() function overlaid two regression lines – one for each level of the categorical variable. When both predictors are quantitative, however, a different approach is required.

gf_model() can’t overlay a regression line for each possible value of YearBuilt because there are too many possible values! Instead, it selects three representative values of YearBuilt and overlays these. The values it chooses are the mean of YearBuilt, one standard deviation above the mean, and one standard deviation below the mean.

Scatterplot of PriceK predicted by HomeSizeK. The points are colored based on YearBuilt. Three separate regression lines are overlaid on the plot. A yellow line runs through the newer homes and appears above a green that runs through the middle-aged homes, and that line appears above a blue-green line that runs through the older homes. The middle line is labeled as average YearBuilt, the top line is labeled as plus one standard deviation, and the bottom line is labeled as minus one standard deviation.

In the graph, the middle (greenish) line shows the model predictions for the average YearBuilt (1978), while the two flanking lines represent one standard deviation above the mean (2014, yellowish) and below the mean (1942, bluish).

Just because we graphed three lines doesn’t mean there are only three possible lines. Theoretically there could be an infinite number of lines. The gf_model() function just shows a few representative examples to help us see what the interaction pattern looks like.

The important thing to notice is that the slope of the line is steeper for newer homes compared to older homes. A way to describe this pattern of increasing steepness is that the effect of home size on price gets larger as houses get newer. In other words, there is an interaction between HomeSizeK and YearBuilt.

Different Graphs Can Highlight Different Interpretations

You might wonder why we chose to represent HomeSizeK on the x-axis, and YearBuilt with the different lines. Actually, there is no reason you couldn’t present the same model in a different way, as in the graph below.

interaction_model <- lm(PriceK ~ YearBuilt*HomeSizeK, data = Ames)
 
gf_point(PriceK ~ YearBuilt, data = Ames, color = ~HomeSizeK) %>%
gf_model(interaction_model)

Scatterplot of PriceK predicted by YearBuilt. The points are colored based on HomeSizeK. Three separate regression lines are drawn on the plot. The lines show a positive trend.

Now each line represents a particular value on HomeSizeK (the mean, +1 SD, and -1 SD). Although the model and the data are the same as in the previous graph, plotting it in a different way may lead to a different way of describing the pattern of results.

Responses