Course Outline
-
segmentGetting Started (Don't Skip This Part)
-
segmentHigh School / Advanced Statistics and Data Science I (ABC)
-
segmentPART I: EXPLORING VARIATION
-
segmentChapter 1 - Welcome to Statistics: A Modeling Approach
-
segmentChapter 2 - Understanding Data
-
segmentChapter 3 - Examining Distributions
-
segmentChapter 4 - Explaining Variation
-
segmentPART II: MODELING VARIATION
-
segmentChapter 5 - A Simple Model
-
segmentChapter 6 - Quantifying Error
-
segmentChapter 7 - Adding an Explanatory Variable to the Model
-
segmentChapter 8 - Digging Deeper into Group Models
-
segmentChapter 9 - Models with a Quantitative Explanatory Variable
-
segmentPART III: EVALUATING MODELS
-
segmentChapter 10 - The Logic of Inference
-
10.7 A Mathematical Model of the Sampling Distribution of b1
-
segmentChapter 11 - Model Comparison with F
-
segmentChapter 12 - Parameter Estimation and Confidence Intervals
-
segmentChapter 13 - What You Have Learned
-
segmentFinishing Up (Don't Skip This Part!)
-
segmentResources
list High School / Advanced Statistics and Data Science I (ABC)
10.7 A Mathematical Model of the Sampling Distribution of b1
The early statisticians who developed the ideas behind sampling
distributions and p-values didn’t have computers. They could only
imagine what it might be like to shuffle()
their data to
imitate a random DGP. What we have been able to do with R would seem
like a miracle to them! Instead of using computational techniques to
create sampling distributions, the early statisticians had to develop
mathematical models of what the sampling distributions should look like,
and then calculate probabilities based on these mathematical
distributions.
In fact, the p-value you see in the ANOVA table generated by the
supernova()
function (as well as most other statistical
software) is calculated from a mathematical model of the sampling
distribution.
The code in the window below fits the Condition
model to
the TipExperiment
data and saves the model as
Condition_model
. Use supernova()
to generate
the ANOVA table for this model, and look at the p-value (in the
right-most column of the table).
Analysis of Variance Table (Type III SS)
Model: Tip ~ Condition
SS df MS F PRE p
----- --------------- | -------- -- ------- ----- ------ -----
Model (error reduced) | 402.023 1 402.023 3.305 0.0729 .0762
Error (from model) | 5108.955 42 121.642
----- --------------- | -------- -- ------- ----- ------ -----
Total (empty model) | 5510.977 43 128.162
The p-value from supernova()
, rounded to the nearest
hundredth, is about .08, which is very close to what we calculated using
our sampling distribution of 1000 shuffled supernova()
is faster, some people find
the concept of sampling distribution easier to understand when they
generate the sampling distribution of shuffle()
.)
The t-Distribution
The mathematical function that supernova()
uses to model
the sampling distribution of
In the figure below we have overlaid the t-distribution (depicted as
a red line) on top of the sampling distribution we constructed using
shuffle()
. You can see that it looks very much like the
normal distribution you learned about previously.
Whereas the sampling distribution we created using the
shuffle()
function looks jagged (because it was made up of
just 1000 separate
Whereas the shape of the normal distribution is completely determined
by its mean and standard deviation, the t-distribution changes shape
slightly depending on how many data points are included in the samples
that make up the sampling distribution. (Actually,
You can see how
Using the t-Distribution to Calculate Probabilities
In the sampling distribution you created using shuffle()
you were able to just count the number of supernova()
function).
The Two-Sample T-Test
If you’ve taken statistics before you probably learned about the
t-test. The t-test is used to calculate the p-value for the difference
between two independent groups. The tipping experiment is just such a
case: the
You can use R to do a t-test on the tipping data:
t.test(Tip ~ Condition, data = TipExperiment, var.equal=TRUE)
If you run this code it will give you the p-value of .0762, which is
exactly what you saw in the ANOVA table produced by
supernova()
. Even though the supernova()
output does not show you the t-statistic or other details of how it
calculates the p-value, behind the scenes it uses the t-distribution for
calculating p-values.
Although we want you to know what a t-test is, we don’t recommend
using it. The technique you have learned, of creating a two-group model
and comparing it with the empty model, is far more powerful and
generalizable than the t-test. But if someone asks if you learned the
t-test, you can say yes. (The test you did using shuffle()
is sometimes called a randomization test or permutation test.)