Course Outline
-
segmentGetting Started (Don't Skip This Part)
-
segmentStatistics and Data Science II
-
segmentPART I: EXPLORING AND MODELING VARIATION
-
segmentChapter 1 - Exploring Data with R
-
segmentChapter 2 - From Exploring to Modeling Variation
-
segmentChapter 3 - Modeling Relationships in Data
-
segmentPART II: COMPARING MODELS TO MAKE INFERENCES
-
segmentChapter 4 - The Logic of Inference
-
segmentChapter 5 - Model Comparison with F
-
segmentChapter 6 - Parameter Estimation and Confidence Intervals
-
segmentPART III: MULTIVARIATE MODELS
-
segmentChapter 7 - Introduction to Multivariate Models
-
segmentChapter 8 - Multivariate Model Comparisons
-
8.6 Using `shuffle()` for Targeted Model Comparisons (Part 2)
-
segmentFinishing Up (Don't Skip This Part!)
-
segmentResources
list High School / Statistics and Data Science II (XCD)
8.6 Using shuffle()
for Targeted Model Comparisons (Part 2)
Having saved the residuals from Neighborhood
model, let’s now see how we can use them, along with the shuffle()
function, to create a sampling distribution for the unique effect of HomeSizeK
.
Step Two: Create the Sampling Distribution of F for Home Size
A sampling distribution of Fs provides us a way to calculate how likely it would be for the simple model of the DGP (i.e., the one with no unique effect of HomeSizeK
) to generate an F for HomeSizeK
as large or larger than the one found in the data (11.626).
Before we create the sampling distribution of F for the HomeSizeK
effect, we will show you how to get the sample F for HomeSizeK
. Our previous method, using the f()
function, won’t work; it only gives us the overall F for the full model. To get the F for HomeSizeK
you can run this code:
f(PriceK_N_resids ~ Neighborhood + HomeSizeK, data = Smallville, predictor = ~HomeSizeK)
The first part of this code creates a supernova table for the multivariate model using PriceK_N_resids
as the outcome. The highlighted part above then reads the sample F for HomeSizeK
out of the table (without ever printing it out). We’ve put this code in the window below, so you have this F available.
In the window below, modify the code where indicated, using the shuffle()
function, to produce a single F for HomeSizeK
that assumes a DGP with 0 effect of home size. Run the code a few times just to see what it does.
require(coursekata)
# delete when coursekata-r updated
Smallville <- read.csv("https://docs.google.com/spreadsheets/d/e/2PACX-1vTUey0jLO87REoQRRGJeG43iN1lkds_lmcnke1fuvS7BTb62jLucJ4WeIt7RW4mfRpk8n5iYvNmgf5l/pub?gid=1024959265&single=true&output=csv")
Smallville$Neighborhood <- factor(Smallville$Neighborhood)
Smallville$HasFireplace <- factor(Smallville$HasFireplace)
# don't delete this part
# code to fit neighborhood model and save residuals
Neighborhood_model <- lm(PriceK~ Neighborhood, data = Smallville)
Smallville$PriceK_N_resids <- resid(Neighborhood_model)
# this code prints sample F for HomeSizeK
f(PriceK_N_resids ~ Neighborhood + HomeSizeK, data = Smallville, predictor = ~HomeSizeK)
# modify the code below to produce the F when residuals are shuffled
f(PriceK_N_resids ~ Neighborhood + HomeSizeK, data = Smallville, predictor = ~HomeSizeK)
f(PriceK_N_resids ~ Neighborhood + HomeSizeK, data = Smallville, predictor = ~HomeSizeK)
f(shuffle(PriceK_N_resids) ~ Neighborhood + HomeSizeK, data = Smallville, predictor = ~HomeSizeK)
# temporary SCT
ex() %>% check_error()
Now let’s add some code to create a sampling distribution of 1000 Fs for HomeSizeK
assuming no effect of home size in the DGP. Save these Fs into a data frame called HomeSizeK_sdof
.
require(coursekata)
# delete when coursekata-r updated
Smallville <- read.csv("https://docs.google.com/spreadsheets/d/e/2PACX-1vTUey0jLO87REoQRRGJeG43iN1lkds_lmcnke1fuvS7BTb62jLucJ4WeIt7RW4mfRpk8n5iYvNmgf5l/pub?gid=1024959265&single=true&output=csv")
Smallville$Neighborhood <- factor(Smallville$Neighborhood)
Smallville$HasFireplace <- factor(Smallville$HasFireplace)
# don't delete
# code to fit neighborhood model and save residuals
Neighborhood_model <- lm(PriceK~ Neighborhood, data = Smallville)
Smallville$PriceK_N_resids <- resid(Neighborhood_model)
# This code generates one shuffled HomeSizeK F
# Modify it to make a sampling distribution of 1000 shuffled Fs
# Save them in a data frame called HomeSizeK_sdof
f(shuffle(PriceK_N_resids) ~ Neighborhood + HomeSizeK, data = Smallville, predictor = ~HomeSizeK)
# This code will put these Fs into a histogram
gf_histogram(~ f, data = HomeSizeK_sdof) %>%
gf_labs(title = "shuffled HomeSizeK Fs")
HomeSizeK_sdof <- do(1000) * f(shuffle(PriceK_N_resids) ~ Neighborhood + HomeSizeK, data = Smallville, predictor = ~HomeSizeK)
gf_histogram(~ f, data = HomeSizeK_sdof) %>%
gf_labs(title = "shuffled HomeSizeK Fs")
# temporary SCT
ex() %>% check_error()
Below we have graphed out the sampling distribution of 1000 shuffled Fs for the HomeSizeK
effect. We also have added to the plot, as a black dot, the sample F for the HomeSizeK
row of the ANOVA table (11.63). We’ll save this value as HomeSizeK_f
. As you can see, the sample F is far out in the tail of the sampling distribution.
To calculate the exact p-value for the HomeSizeK
F, we can use tally.
Try copying and pasting the appropriate code into the code block below. Also generate an ANOVA table – to check out whether the p-value obtained from tally()
is similar to the p-value for HomeSizeK
in the ANOVA table.
require(coursekata)
# delete when coursekata-r updated
Smallville <- read.csv("https://docs.google.com/spreadsheets/d/e/2PACX-1vTUey0jLO87REoQRRGJeG43iN1lkds_lmcnke1fuvS7BTb62jLucJ4WeIt7RW4mfRpk8n5iYvNmgf5l/pub?gid=1024959265&single=true&output=csv")
Smallville$Neighborhood <- factor(Smallville$Neighborhood)
Smallville$HasFireplace <- factor(Smallville$HasFireplace)
# don't delete
# code to fit neighborhood model and save residuals
Neighborhood_model <- lm(PriceK~ Neighborhood, data = Smallville)
Smallville$PriceK_N_resids <- resid(Neighborhood_model)
# This saves the sample HomeSizeK F
HomeSizeK_f <- f(PriceK_N_resids ~ Neighborhood + HomeSizeK, data = Smallville, predictor = ~HomeSizeK)
# This code generates a sampling distribution of shuffled HomeSizeK Fs
HomeSizeK_sdof <- do(1000) * f(shuffle(PriceK_N_resids) ~ Neighborhood + HomeSizeK, data = Smallville, predictor = ~HomeSizeK)
# Paste in the code for tallying the p-value for HomeSizeK
# Modify the code below to generate an ANOVA table from the multivariate model
lm(PriceK ~ Neighborhood + HomeSizeK, data = Smallville)
HomeSizeK_f <- f(PriceK_N_resids ~ Neighborhood + HomeSizeK, data = Smallville, predictor = ~HomeSizeK)
HomeSizeK_sdof <- do(1000) * f(shuffle(PriceK_N_resids) ~ Neighborhood + HomeSizeK, data = Smallville, predictor = ~HomeSizeK)
tally(~ f > HomeSizeK_f, data=HomeSizeK_sdof, format="proportion")
supernova(lm(PriceK ~ Neighborhood + HomeSizeK, data = Smallville))
# temporary SCT
ex() %>% check_error()
The p-value we got from tally()
is close to the p-value reported on the HomeSizeK
row of the multivariate ANOVA table: 0.0019.