Course Outline
-
segmentGetting Started (Don't Skip This Part)
-
segmentStatistics and Data Science II
-
segmentPART I: EXPLORING AND MODELING VARIATION
-
segmentChapter 1 - Exploring Data with R
-
segmentChapter 2 - From Exploring to Modeling Variation
-
segmentChapter 3 - Modeling Relationships in Data
-
segmentPART II: COMPARING MODELS TO MAKE INFERENCES
-
segmentChapter 4 - The Logic of Inference
-
segmentChapter 5 - Model Comparison with F
-
segmentChapter 6 - Parameter Estimation and Confidence Intervals
-
6.5 Using the Bootstrapped Sampling Distribution to Find the Confidence Interval
-
segmentPART III: MULTIVARIATE MODELS
-
segmentChapter 7 - Introduction to Multivariate Models
-
segmentChapter 8 - Multivariate Model Comparisons
-
segmentChapter 9 - Models with Interactions
-
segmentChapter 10 - More Models with Interactions
-
segmentFinishing Up (Don't Skip This Part!)
-
segmentResources
list High School / Statistics and Data Science II (XCD)
6.5 Using the Bootstrapped Sampling Distribution to Find the Confidence Interval
Using resample()
to Bootstrap a Sampling Distribution
Now that we have reviewed how the resample()
function works, let’s use it to create a sampling distribution of 1000 \(b_1\)s.
Modify the code in the window below to create a sampling distribution of 1000 \(b_1\)s, each based on resampled data, and save it into a new data frame called sdob1_boot
. Add some code to produce a histogram of the sampling distribution.
require(coursekata)
# modify this to make 1000 bootstrapped b1s
sdob1_boot <- do( ) * b1(Tip ~ Condition, data = resample(TipExperiment))
# visualize sdob1_boot with a histogram
# modify this to make 1000 bootstrapped b1s
sdob1_boot <- do(1000) * b1(Tip ~ Condition, data = resample(TipExperiment))
# visualize sdob1_boot with a histogram
gf_histogram(~b1, data = sdob1_boot)
ex() %>% {
check_function(., "do") %>%
check_arg("object") %>%
check_equal()
check_or(.,
check_function(., "gf_histogram") %>% {
check_arg(., "object") %>% check_equal()
check_arg(., "data") %>% check_equal(eval = FALSE)
},
override_solution(., '{
sdob1_boot <- do(1000) * b1(Tip ~ Condition, data = resample(TipExperiment))
gf_histogram(sdob1_boot, ~b1)
}') %>%
check_function("gf_histogram") %>% {
check_arg(., "object") %>% check_equal(eval = FALSE)
check_arg(., "gformula") %>% check_equal()
},
override_solution(., '{
sdob1_boot <- do(1000) * b1(Tip ~ Condition, data = resample(TipExperiment))
gf_histogram(~sdob1_boot$b1)
}') %>%
check_function("gf_histogram") %>%
check_arg("object") %>%
check_equal(eval = FALSE)
)
}
Use favstats()
in the code window below to see what the average of the \(b_1\)s is in sdob1_boot
.
require(coursekata)
# we have created the sampling distribution for you
sdob1_boot <- do(1000) * b1(Tip ~ Condition, data = resample(TipExperiment))
# run favstats to check out the mean of the sampling distribution
# we have created the sampling distribution for you
sdob1_boot <- do(1000) * b1(Tip ~ Condition, data = resample(TipExperiment))
# run favstats to check out the mean of the sampling distribution
favstats(~b1, data = sdob1_boot)
ex() %>%
check_function("favstats") %>%
check_result() %>%
check_equal()
min Q1 median Q3 max mean sd n missing
-3.219048 3.772727 5.921166 8.480083 15.96154 6.110566 3.381418 1000 0
The mean is fairly close to 6.05, the estimate of \(\beta_1\) from the tipping study. Because the resampled sampling distribution is roughly centered at the sample \(b_1\), it provides us what we were looking for in order to calculate the 95% confidence interval for \(\beta_1\): a sampling distribution centered at the sample \(b_1\).
Using the Bootstrapped Sampling Distribution to Find the Confidence Interval
We have now succeeded in creating a bootstrapped sampling distribution of 1000 \(b_1\)s centered at the sample \(b_1\) (roughly 6.05) using the resample()
function. To find the lower and upper bounds of the confidence interval, we will use our sampling distribution of \(b_1\)s as a probability distribution, interpreting proportions of \(b_1\)s falling in a certain range as a probability that future \(b_1\)s would fall into the same range.
We want to find the cutoffs that separate the middle .95 of the resampled sampling distribution from the lower and upper .025 tails because these cutoffs will correspond perfectly with the lower and upper bound of the confidence interval.
To do this, we start by putting the 1000 \(b_1\)s in order. Then we can find the cutoffs that separate the top 25 and the bottom 25 \(b_1\)s from the middle 950 \(b_1\)s.
We can visualize this task by shading in the middle .95 differently from the tails (.025 in each tail) as shown in the histogram below. (The histogram will show all the values of resampled \(b_1\) in order from smallest to largest on the x-axis.)
gf_histogram(~b1, data = sdob1_boot, fill = ~middle(b1, .95), bins = 80)
As illustrated below, the cutoff for the lowest .025 of \(b_1\)s is at the 26th \(b_1\). The cutoff for the highest .025 of \(b_1\)s is at the 975th \(b_1\). These two cutoffs correspond to the lower and upper bound of the confidence interval.
Here’s some code that will arrange the \(b_1\)s in order (from lowest to highest) and save the re-arranged data back into sdob1_boot
.
sdob1_boot <- arrange(sdob1_boot, b1)
To identify the 26th b1
in the arranged data frame (26th from the lowest), we can use these brackets (e.g., [26]
).
sdob1_boot$b1[26]
Use the code block below to print out both the 26th and 975th b1
s.
require(coursekata)
sdob1_boot <- do(1000) * b1(Tip ~ Condition, data = resample(TipExperiment))
sdob1_boot <- arrange(sdob1_boot, b1)
# we’ve written code to print the 26th b1
sdob1_boot$b1[26]
# write code to print the 975th b1
# we’ve written code to print the 26th b1
sdob1_boot$b1[26]
# write code to print the 975th b1
sdob1_boot$b1[975]
ex() %>% {
check_output_expr(., "sdob1_boot$b1[26]")
check_output_expr(., "sdob1_boot$b1[975]")
}
[1] -0.02484472
[1] 13.3
Based on our bootstrapped sampling distribution of \(b_1\), the 95% confidence interval runs from around 0 to 13 (give or take). Your numbers will be slightly different from ours, of course, because they are generated randomly. Based on this analysis, we can be 95% confident that the true value of \(\beta_1\) in the DGP lies in this range.