Course Outline

list High School / Statistics and Data Science II (XCD)

Book
  • College / Advanced Statistics and Data Science (ABCD)
  • College / Statistics and Data Science (ABC)
  • High School / Advanced Statistics and Data Science I (ABC)
  • High School / Statistics and Data Science I (AB)
  • High School / Statistics and Data Science II (XCD)

2.2 Visualizing Two-Variable Relationships

One variable that might explain some of the variation in sale price is Neighborhood. In the world of real estate, home prices often vary across neighborhoods, meaning that home prices in one neighborhood are higher, on average, than those in a different neighborhood. But then again, home prices vary a lot within the same neighborhood, too.

Using Color to Visualize Relationships

Unfortunately, the variable Neighborhood is not included in our previous histogram, so we can’t see how PriceK varies by Neighborhood. But we can visualize the relationship between PriceK and Neighborhood in a few ways. One way is by coloring or filling in the data in the histogram by Neighborhood, assigning College Creek one color and Old Town another.

To do this we use the fill = argument, but instead of putting in a color we put a tilde (~) and then the name of a variable: fill = ~Neighborhood.

A histogram of the distribution of PriceK in the Ames data frame. The observations have been filled in different colors depending on which Neighborhood they are in. Most of the cases for CollegeCreek appear to the right of the distribution, and they make up more area of the distribution than the OldTown observations.

Faceted Histograms

Another way to examine the effect of Neighborhood on PriceK is to split the histogram we made into two—one for College Creek and another for Old Town. Using %>%, we can chain on the command gf_facet_grid() after gf_histogram(). The following code will put the histograms of PriceK for College Creek and Old Town in a grid, one above the other.

gf_histogram(~ PriceK, data = Ames) %>% 
  gf_facet_grid(Neighborhood ~ .)

A histogram of PriceK vertically faceted by Neighborhood. The distribution for CollegeCreek appears to have more observations than the OldTown group. The distribution for CollegeCreek also has more cases that are observed along the middle and right hand side of the x-axis, whereas, all of the cases for OldTown are clumped near the left hand side of the x-axis.

Another way of thinking about Neighborhood explaining variation in PriceK is to see the distribution of PriceK as made up of two different distributions, one from College Creek and one from Old Town. Although these distributions overlap, we can see that the whole College Creek distribution is shifted higher (to the right) along the x-axis, which indicates that the average price of a home in College Creek is higher than in Old Town.

The variation we see by neighborhood can be called between-group variation. This variation among observations in the same neighborhood is an example of within-group variation.

A histogram of PriceK vertically faceted by Neighborhood, with an additional histogram in the facet labeled as All Together that combines both neighborhoods but fills them in different colors.

Just for comparison, we’ve put the original histogram (showing all the homes all together in one histogram) along with the separate histograms for each neighborhood.

Notice that each neighborhood’s histogram appears to have less variation than the “all together” histogram. It’s as if some of the variation in PriceK has been accounted for or reduced by Neighborhood. Because we can only see the within-group variation after we separate the distribution according to Neighborhood, another name for within-group variation is leftover variation.

Even though there is still a lot of variation in home prices left over after accounting forNeighborhood, it is still true that if we know a home’s neighborhood, we can be a little better at predicting its value. A little better may not be great, but it is better than nothing.

Using the Tilde (~) to Rearrange Plots

Let’s take a closer look at the R code for making these facet grids of histograms.


gf_histogram(~ PriceK, data = Ames) %>% 
  gf_facet_grid(Neighborhood ~ .)

Putting PriceK after the ~ in the gf_histogram() function above tells R to put the values of PriceK on the x-axis (think y ~ x).

gf_facet_grid() works the same way. Putting the variable Neighborhood before the ~ stacks the graphs for each neighborhood vertically, along the y-axis. In this case, it’s easier to compare the graphs when we arrange them vertically, one above the other..

Notice also that gf_facet_grid() has a . (period) after the ~ (tilde). The period is a placeholder, which you could replace with another variable. Try replacing the . with the variable Floors (whether the home has 1 or 2 floors) in the code below. This will create a faceted grid of homes in different neighborhoods split up by whether they have 1 or 2 floors.

require(coursekata) # replace the . with Floors gf_histogram(~ PriceK, data = Ames) %>% gf_facet_grid(Neighborhood ~ .) # replace the . with Floors gf_histogram(~ PriceK, data = Ames) %>% gf_facet_grid(Neighborhood ~ Floors) ex() %>% check_function(., "gf_histogram") %>% { check_arg(., "object") %>% check_equal() check_arg(., "data") %>% check_equal() }
CK Code: X2_Code_Visualizing_01

Faceted histograms of PriceK by Neighborhood and by Number of Floors. The distribution of CollegeCreek homes with 1 floor is roughly normal with a range of values from about 100 to 300. The distribution of CollegeCreek homes with 2 floors is roughly normal with a range of values from about 150 to 300, with a few outliers that appear near 400. The distribution of OldTown homes with 1 floor is slightly skewed left with a range of values from about 50 to 150. The distribution of OldTown homes with 2 floors is roughly uniform with a range of values from about 100 to 200.

Responses