Course Outline

list High School / Statistics and Data Science II (XCD)

Book
  • High School / Advanced Statistics and Data Science I (ABC)
  • High School / Statistics and Data Science I (AB)
  • High School / Statistics and Data Science II (XCD)
  • College / Statistics and Data Science (ABC)
  • College / Advanced Statistics and Data Science (ABCD)
  • College / Accelerated Statistics and Data Science (XCDCOLLEGE)
  • Skew the Script: Jupyter

7.2 Visualizing Price = Home Size + Neighborhood

Let’s explore this idea with some visualizations. We will start with a graph of the home size model, plotting PriceK by HomeSizeK, with this code: gf_point(PriceK ~ HomeSizeK, data = Smallville). We will then explore some ways we could visualize the effect of Neighborhood above and beyond that of HomeSizeK.

Using Facet Grids

Here’s a scatterplot of PriceK by HomeSizeK for the 32 homes in Smallville.

Scatterplot of PriceK predicted by HomeSizeK from the Smallville data frame.

One way to integrate Neighborhood into the same visualization is to make a grid of scatterplots, each one representing a different neighborhood. We can do this by chaining on gf_facet_grid(Neighborhood ~ .) on top of the scatterplot.

Because we put Neighborhood before the tilde (Neighborhood ~ .) the two graphs will be stacked vertically (i.e., along the y-axis). To put the graphs side-by-side (i.e., in a grid along the x-axis), we would put the variable after the tilde: . ~ Neighborhood. Notice that in R, as in GLM notation, we usually follow the form Y ~ X.

In the code block below, try putting the two scatterplots, one for each Neighborhood, side by side in a horizontal grid.

require(coursekata) # Make a horizontal grid of scatterplots using Neighborhood gf_point(PriceK~ HomeSizeK, data = Smallville) # Make a horizontal grid of scatterplots using Neighborhood gf_point(PriceK~ HomeSizeK, data = Smallville) %>% gf_facet_grid(. ~ Neighborhood) ex() %>% check_function("gf_facet_grid") %>% { check_arg(., "object") %>% check_equal() check_arg(., 2) %>% check_equal() }
CK Code: D1_Code_Visualizing_01

Scatterplot of PriceK predicted by HomeSizeK and horizontally faceted by Neighborhood. The data points for both plots have an upward trend, but the points for the Downtown group are generally higher on both HomeSizeK and PriceK than the Eastside group.

Based on these plots, you can see that knowing both neighborhood and home size would improve your predictions. One way to see this is to look, within each neighborhood, at the prices of homes that are between 1000 and 1500 square feet (i.e., HomeSizeK between 1.0 and 1.5). We have colored them differently in the faceted plot below. You can see that even for homes the same size, there still are higher prices in Downtown than in Eastside.

The same scatterplot of PriceK predicted by HomeSizeK and horizontally faceted by Neighborhood as above, but with the two data points for Downtown that are between 1.0 and 1.5 colored in red, and the three data points for Eastside that are between 1.0 and 1.5 colored in red as well. The two data points for Downtown are higher on PriceK than Eastside.

Using Color

Another approach to adding neighborhood to the scatterplot of PriceK by HomeSizeK is to assign different colors to points representing homes from the different neighborhoods. You can do this by adding color = ~Neighborhood to the scatterplot. (The ~ tilde tells R that Neighborhood is a variable.) Try it in the code block below.

require(coursekata) # Add in the color argument gf_point(PriceK ~ HomeSizeK, data = Smallville) # Add in the color argument gf_point(PriceK ~ HomeSizeK, data = Smallville, color = ~ Neighborhood) ex() %>% check_function("gf_point") %>% check_arg("color") %>% check_equal()
CK Code: D1_Code_Visualizing_02

We used this code (also overlaying the HomeSizeK regression line on the scatterplot) to get the graph below.

HomeSizeK_model <- lm(PriceK ~ HomeSizeK, data = Smallville)
gf_point(PriceK ~ HomeSizeK, data = Smallville, color= ~ Neighborhood) %>%
  gf_model(HomeSizeK_model, color = "black")

Scatterplot of PriceK predicted by HomeSizeK with the points colored by Neighborhood. Most of the Eastside points are lower on HomeSizeK and PriceK. The plot is overlaid with the regression line for the HomeSizeK model. The line has an upward trend. Most of the points for Downtown appear above the regression line.

Adding the regression line makes it easier to see the error (or residuals) leftover from the HomeSizeK model. Notice that the teal dots (homes from Downtown) are mostly above the regression line (i.e., with positive residuals from the HomeSizeK model) while the purple dots (from Eastside) are mostly below the line (negative residuals).

This indicates that Downtown homes are generally more expensive than what the home size model would predict, while Eastside homes are less expensive. This pattern is a clue that tells us that adding Neighborhood into the HomeSizeK model will explain additional variation in PriceK above and beyond that explained by just the home size model alone.

Responses