Course Outline

list Statistics and Data Science: A Modeling Approach

Book
  • College / Advanced Statistics and Data Science (ABCD)
  • College / Statistics and Data Science (ABC)
  • High School / Advanced Statistics and Data Science I (ABC)
  • High School / Statistics and Data Science I (AB)
  • High School / Statistics and Data Science II (XCD)

13.6 Using shuffle() for Targeted Model Comparisons (Part 2)

Having saved the residuals from Neighborhood model, let’s now see how we can use them, along with the shuffle() function, to create a sampling distribution for the unique effect of HomeSizeK.

Step Two: Create the Sampling Distribution of F for Home Size

A sampling distribution of Fs provides us a way to calculate how likely it would be for the simple model of the DGP (i.e., the one with no unique effect of HomeSizeK) to generate an F for HomeSizeK as large or larger than the one found in the data (11.626).

Before we create the sampling distribution of F for the HomeSizeK effect, we will show you how to get the sample F for HomeSizeK. Our previous method, using the f() function, won’t work; it only gives us the overall F for the full model. To get the F for HomeSizeK you can run this code:

f(PriceK_N_resids ~ Neighborhood + HomeSizeK, data = Smallville, predictor = ~HomeSizeK)

The first part of this code creates a supernova table for the multivariate model using PriceK_N_resids as the outcome. The highlighted part above then reads the sample F for HomeSizeK out of the table (without ever printing it out). We’ve put this code in the window below, so you have this F available.

In the window below, modify the code where indicated, using the shuffle() function, to produce a single F for HomeSizeK that assumes a DGP with 0 effect of home size. Run the code a few times just to see what it does.

require(coursekata) # delete when coursekata-r updated Smallville <- read.csv("https://docs.google.com/spreadsheets/d/e/2PACX-1vTUey0jLO87REoQRRGJeG43iN1lkds_lmcnke1fuvS7BTb62jLucJ4WeIt7RW4mfRpk8n5iYvNmgf5l/pub?gid=1024959265&single=true&output=csv") Smallville$Neighborhood <- factor(Smallville$Neighborhood) Smallville$HasFireplace <- factor(Smallville$HasFireplace) # don't delete this part # code to fit neighborhood model and save residuals Neighborhood_model <- lm(PriceK~ Neighborhood, data = Smallville) Smallville$PriceK_N_resids <- resid(Neighborhood_model) # this code prints sample F for HomeSizeK f(PriceK_N_resids ~ Neighborhood + HomeSizeK, data = Smallville, predictor = ~HomeSizeK) # modify the code below to produce the F when residuals are shuffled f(PriceK_N_resids ~ Neighborhood + HomeSizeK, data = Smallville, predictor = ~HomeSizeK) f(PriceK_N_resids ~ Neighborhood + HomeSizeK, data = Smallville, predictor = ~HomeSizeK) f(shuffle(PriceK_N_resids) ~ Neighborhood + HomeSizeK, data = Smallville, predictor = ~HomeSizeK) # temporary SCT ex() %>% check_error()
CK Code: D2_Code_Targeted_01

Now let’s add some code to create a sampling distribution of 1000 Fs for HomeSizeK assuming no effect of home size in the DGP. Save these Fs into a data frame called HomeSizeK_sdof.

require(coursekata) # delete when coursekata-r updated Smallville <- read.csv("https://docs.google.com/spreadsheets/d/e/2PACX-1vTUey0jLO87REoQRRGJeG43iN1lkds_lmcnke1fuvS7BTb62jLucJ4WeIt7RW4mfRpk8n5iYvNmgf5l/pub?gid=1024959265&single=true&output=csv") Smallville$Neighborhood <- factor(Smallville$Neighborhood) Smallville$HasFireplace <- factor(Smallville$HasFireplace) # don't delete # code to fit neighborhood model and save residuals Neighborhood_model <- lm(PriceK~ Neighborhood, data = Smallville) Smallville$PriceK_N_resids <- resid(Neighborhood_model) # This code generates one shuffled HomeSizeK F # Modify it to make a sampling distribution of 1000 shuffled Fs # Save them in a data frame called HomeSizeK_sdof f(shuffle(PriceK_N_resids) ~ Neighborhood + HomeSizeK, data = Smallville, predictor = ~HomeSizeK) # This code will put these Fs into a histogram gf_histogram(~ f, data = HomeSizeK_sdof) %>% gf_labs(title = "shuffled HomeSizeK Fs") HomeSizeK_sdof <- do(1000) * f(shuffle(PriceK_N_resids) ~ Neighborhood + HomeSizeK, data = Smallville, predictor = ~HomeSizeK) gf_histogram(~ f, data = HomeSizeK_sdof) %>% gf_labs(title = "shuffled HomeSizeK Fs") # temporary SCT ex() %>% check_error()
CK Code: D2_Code_Targeted_02

Below we have graphed out the sampling distribution of 1000 shuffled Fs for the HomeSizeK effect. We also have added to the plot, as a black dot, the sample F for the HomeSizeK row of the ANOVA table (11.63). We’ll save this value as HomeSizeK_f. As you can see, the sample F is far out in the tail of the sampling distribution.

To calculate the exact p-value for the HomeSizeK F, we can use tally.

Try copying and pasting the appropriate code into the code block below. Also generate an ANOVA table – to check out whether the p-value obtained from tally() is similar to the p-value for HomeSizeK in the ANOVA table.

require(coursekata) # delete when coursekata-r updated Smallville <- read.csv("https://docs.google.com/spreadsheets/d/e/2PACX-1vTUey0jLO87REoQRRGJeG43iN1lkds_lmcnke1fuvS7BTb62jLucJ4WeIt7RW4mfRpk8n5iYvNmgf5l/pub?gid=1024959265&single=true&output=csv") Smallville$Neighborhood <- factor(Smallville$Neighborhood) Smallville$HasFireplace <- factor(Smallville$HasFireplace) # don't delete # code to fit neighborhood model and save residuals Neighborhood_model <- lm(PriceK~ Neighborhood, data = Smallville) Smallville$PriceK_N_resids <- resid(Neighborhood_model) # This saves the sample HomeSizeK F HomeSizeK_f <- f(PriceK_N_resids ~ Neighborhood + HomeSizeK, data = Smallville, predictor = ~HomeSizeK) # This code generates a sampling distribution of shuffled HomeSizeK Fs HomeSizeK_sdof <- do(1000) * f(shuffle(PriceK_N_resids) ~ Neighborhood + HomeSizeK, data = Smallville, predictor = ~HomeSizeK) # Paste in the code for tallying the p-value for HomeSizeK # Modify the code below to generate an ANOVA table from the multivariate model lm(PriceK ~ Neighborhood + HomeSizeK, data = Smallville) HomeSizeK_f <- f(PriceK_N_resids ~ Neighborhood + HomeSizeK, data = Smallville, predictor = ~HomeSizeK) HomeSizeK_sdof <- do(1000) * f(shuffle(PriceK_N_resids) ~ Neighborhood + HomeSizeK, data = Smallville, predictor = ~HomeSizeK) tally(~ f > HomeSizeK_f, data=HomeSizeK_sdof, format="proportion") supernova(lm(PriceK ~ Neighborhood + HomeSizeK, data = Smallville)) # temporary SCT ex() %>% check_error()
CK Code: D2_Code_Targeted_03

The p-value we got from tally() is close to the p-value reported on the HomeSizeK row of the multivariate ANOVA table: 0.0019.

Responses