Course Outline

segmentGetting Started (Don't Skip This Part)

segmentStatistics and Data Science: A Modeling Approach

segmentPART I: EXPLORING VARIATION

segmentChapter 1  Welcome to Statistics: A Modeling Approach

segmentChapter 2  Understanding Data

segmentChapter 3  Examining Distributions

segmentChapter 4  Explaining Variation

segmentPART II: MODELING VARIATION

segmentChapter 5  A Simple Model

segmentChapter 6  Quantifying Error

segmentChapter 7  Adding an Explanatory Variable to the Model

segmentChapter 8  Digging Deeper into Group Models

segmentChapter 9  Models with a Quantitative Explanatory Variable

segmentPART III: EVALUATING MODELS

segmentChapter 10  The Logic of Inference

10.5 A Mathematical Model of the Sampling Distribution of b1

segmentChapter 11  Model Comparison with F

segmentChapter 12  Parameter Estimation and Confidence Intervals

segmentChapter 13  What You Have Learned

segmentFinishing Up (Don't Skip This Part!)

segmentResources
list High School / Advanced Statistics and Data Science I (ABC)
10.5 A Mathematical Model of the Sampling Distribution of b1
The early statisticians who developed the ideas behind sampling distributions and pvalues didn’t have computers. They could only imagine what it might be like to shuffle()
their data to imitate a random DGP. What we have been able to do with R would seem like a miracle to them! Instead of using computational techniques to create sampling distributions, the early statisticians had to develop mathematical models of what the sampling distributions should look like, and then calculate probabilities based on these mathematical distributions.
In fact, the pvalue you see in the ANOVA table generated by the supernova()
function (as well as most other statistical software) is calculated from a mathematical model of the sampling distribution.
The code in the window below fits the Condition
model to the TipExperiment
data and saves the model as Condition_model
. Use supernova()
to generate the ANOVA table for this model, and look at the pvalue (in the rightmost column of the table).
require(coursekata)
# This code finds the bestfitting Condition model
Condition_model < lm(Tip ~ Condition, data = TipExperiment)
# Generate the ANOVA table for this model
# This code finds the bestfitting Condition model
Condition_model < lm(Tip ~ Condition, data = TipExperiment)
# Generate the ANOVA table for this model
supernova(Condition_model)
ex() %>%
check_function("supernova") %>%
check_result() %>%
check_equal()
Analysis of Variance Table (Type III SS)
Model: Tip ~ Condition
SS df MS F PRE p
        
Model (error reduced)  402.023 1 402.023 3.305 0.0729 .0762
Error (from model)  5108.955 42 121.642
        
Total (empty model)  5510.977 43 128.162
The pvalue from supernova()
, rounded to the nearest hundredth, is about .08, which is very close to what we calculated using our sampling distribution of 1000 shuffled \(b_1s\). The approach that uses the mathematical model is not necessarily better than the shuffling approach; the point is that both methods yield a similar result. (Although running supernova()
is faster, some people find the concept of sampling distribution easier to understand when they generate the sampling distribution of \(b_1\)s using shuffle()
.)
The tDistribution
The mathematical function that supernova()
uses to model the sampling distribution of \(b_1\) (as well as the sampling distributions of many other parameter estimates) is known as the tdistribution. The tdistribution is closely related to the normal distribution, and in fact it looks very much like the normal distribution.
In the figure below we have overlaid the tdistribution (depicted as a red line) on top of the sampling distribution we constructed using shuffle()
. You can see that it looks very much like the normal distribution you learned about previously.
Whereas the sampling distribution we created using the shuffle()
function looks jagged (because it was made up of just 1000 separate \(b_1\)s), the tdistribution is a smooth continuous mathematical function. If you want to see the fancy equation that describes this shape, you can see it here.
Whereas the shape of the normal distribution is completely determined by its mean and standard deviation, the tdistribution changes shape slightly depending on how many data points are included in the samples that make up the sampling distribution. (Actually, \(t\) is based on degrees of freedom, or \(\text{df}\), within each group, which you’ve learned is \(n1\). For the tipping study, the \(\text{df}\) is 42, 21 for each group).
You can see how \(\text{df}\) affects the shape of the tdistribution in the figure below. Once the degrees of freedom reaches 30, however, the tdistribution looks very similar to the normal distribution.
Using the tDistribution to Calculate Probabilities
In the sampling distribution you created using shuffle()
you were able to just count the number of \(b_1\)s more extreme than the sample \(b_1\) in order to calculate the pvalue. The tdistribution works the same way, except that it takes some complicated math to calculate the probabilities in the upper and lower tails. Fortunately, you don’t have to do this math; R will do it for you (e.g., when you tell it to use the supernova()
function).
The TwoSample TTest
If you’ve taken statistics before you probably learned about the ttest. The ttest is used to calculate the pvalue for the difference between two independent groups. The tipping experiment is just such a case: the \(b_1\) we’ve been working with is the difference between two groups of tables, those that got the smiley face and those that did not.
You can use R to do a ttest on the tipping data:
t.test(Tip ~ Condition, data = TipExperiment, var.equal=TRUE)
If you run this code it will give you the pvalue of .0762, which is exactly what you saw in the ANOVA table produced by supernova()
. Even though the supernova()
output does not show you the tstatistic or other details of how it calculates the pvalue, behind the scenes it uses the tdistribution for calculating pvalues.
Although we want you to know what a ttest is, we don’t recommend using it. The technique you have learned, of creating a twogroup model and comparing it with the empty model, is far more powerful and generalizable than the ttest. But if someone asks if you learned the ttest, you can say yes. (The test you did using shuffle()
is sometimes called a randomization test or permutation test.)