Course Outline
-
segmentGetting Started (Don't Skip This Part)
-
segmentStatistics and Data Science: A Modeling Approach
-
segmentPART I: EXPLORING VARIATION
-
segmentChapter 1 - Welcome to Statistics: A Modeling Approach
-
segmentChapter 2 - Understanding Data
-
segmentChapter 3 - Examining Distributions
-
segmentChapter 4 - Explaining Variation
-
segmentPART II: MODELING VARIATION
-
segmentChapter 5 - A Simple Model
-
segmentChapter 6 - Quantifying Error
-
segmentChapter 7 - Adding an Explanatory Variable to the Model
-
segmentChapter 8 - Digging Deeper into Group Models
-
segmentChapter 9 - Models with a Quantitative Explanatory Variable
-
segmentPART III: EVALUATING MODELS
-
segmentChapter 10 - The Logic of Inference
-
10.2 Constructing a Sampling Distribution
-
segmentChapter 11 - Model Comparison with F
-
segmentChapter 12 - Parameter Estimation and Confidence Intervals
-
segmentChapter 13 - What You Have Learned
-
segmentFinishing Up (Don't Skip This Part!)
-
segmentResources
list High School / Advanced Statistics and Data Science I (ABC)
10.2 Constructing a Sampling Distribution
The Tipping Study: Again
We have introduced two ideas that probably sound quite abstract at this point: sampling distribution and rejecting the empty model. To make these ideas more concrete, let’s revisit the tipping study that we explored previously.
In the tipping study (for reference, here you can download the article for the tipping study (PDF, 419KB)), you may recall, researchers examined whether putting hand-drawn smiley faces on the back of a restaurant check would cause customers to give higher tips to their server. Each table was randomly assigned to receive their check with either a smiley face on it or not. The outcome variable was the amount of tip left by each table.
Here’s a random sample of six observations from the data frame TipExperiment
:
sample(TipExperiment, 6)
TableID Tip Condition
20 20 20 Control
26 26 44 Smiley Face
19 19 21 Control
15 15 25 Control
25 25 47 Smiley Face
18 18 21 Control
The researchers want to explore the hypothesis that Tips = Condition + Other Stuff. The GLM notation for this two-group model would be:
in which Control
tables and 1 for Smiley Face
tables). The parameter estimate the researchers are most interested in is
Before we remind ourselves of what the results of the study look like, let’s imagine what we would expect to see if a particular model of the DGP were true. For example, if there is a benefit of drawing smiley faces in the DGP (i.e., if
Although we couldn’t predict any single
The empty model is a special case in which
Constructing a Sampling Distribution Assuming the Empty Model
Let’s engage in some hypothetical thinking. If there were no effect of drawing smiley faces, these tables would have tipped the same amount whether they were randomly assigned to one group or the other. (We discussed this in a previous chapter)
One great thing about modern statistics and data science is that we are not limited to simply imagining what the
We can use the shuffle()
function to simulate this hypothetical situation, randomly assigning each Tip
(representing each table) in the data frame to be in either the Smiley Face
or Control
condition.
The figure below shows the real sample data (the green jitter plot in the upper left) along with 8 different random pairings of tips (for each table) with conditions. For each randomization, we have plotted the average tip (the black lines) for each of the newly-randomized groups.
The code below will generate the best-fitting do()
function) to simulate 1000
require(coursekata)
b1(shuffle(Tip) ~ Condition, data = TipExperiment)
do(1000) * b1(shuffle(Tip) ~ Condition, data = TipExperiment)
ex() %>%
check_function("do") %>%
check_arg("object") %>%
check_equal()
Woah, that’s a lot of numbers! We can notice a few things, though, just by looking at the first few numbers on the list. We can see, for example, that the
Even though the 1000 numbers generated by R seem similar to a distribution of sample data, they are different in two important ways. First, they are not based on measurement of a variable but on a random generation process; the numbers are generated by R. Second, and most critical, each number (i.e., each
Distributions that share these features are called sampling distributions. They aren’t data, though, as in this case, they can be constructed using data. But whereas you have only one sample of data for a given study, sampling distributions are simulations of what it might look like if you had done the same study multiple times.
A sampling distribution is a distribution of parameter estimates (or sample statistics) computed on randomly generated samples of a given size.
Sampling distributions let us see what the sampling variation across multiple studies might look like if you were to repeat the same data collection process (i.e., selecting a random sample, or randomly assigning cases to conditions) a very large number of times.