Course Outline
-
segmentGetting Started (Don't Skip This Part)
-
segmentStatistics and Data Science: A Modeling Approach
-
segmentPART I: EXPLORING VARIATION
-
segmentChapter 1 - Welcome to Statistics: A Modeling Approach
-
segmentChapter 2 - Understanding Data
-
segmentChapter 3 - Examining Distributions
-
segmentChapter 4 - Explaining Variation
-
segmentPART II: MODELING VARIATION
-
segmentChapter 5 - A Simple Model
-
segmentChapter 6 - Quantifying Error
-
segmentChapter 7 - Adding an Explanatory Variable to the Model
-
segmentChapter 8 - Models with a Quantitative Explanatory Variable
-
segmentPART III: EVALUATING MODELS
-
segmentChapter 9 - The Logic of Inference
-
segmentChapter 10 - Model Comparison with F
-
segmentChapter 11 - Parameter Estimation and Confidence Intervals
-
segmentPART IV: MULTIVARIATE MODELS
-
segmentChapter 12 - Introduction to Multivariate Models
-
segmentChapter 13 - Multivariate Model Comparisons
-
segmentFinishing Up (Don't Skip This Part!)
-
segmentResources
list College / Advanced Statistics and Data Science (ABCD)
9.3 Exploring the Sampling Distribution of b1
It’s hard to look at a long list of
The code below will save the sdob1
, which is an acronym for sampling distribution of b1s. (We made up this name for the data frame just to help us remember what it is. You could make up your own name if you prefer.) Add some code to this window to take a look at the first 6 rows of the data frame, and then run the code.
b1
1 -0.1363636
2 6.7727273
3 0.6818182
4 -0.5909091
5 -5.7727273
6 7.5000000
In the window below, write an additional line of code to display the variation in b1
in a histogram.
Although this looks similar to other histograms you have seen in this book, it is not the same! This histogram visualizes the sampling distribution of
Because the sampling distribution is based on the empty model, where
You can see from the histogram that while it’s not impossible to generate a
Just eyeballing the histogram can give us a rough idea of the probability of getting a particular sample
Using the Sampling Distribution to Evaluate the Empty Model
We used R to simulate a world where the empty model is true in order to construct a sampling distribution. Now let’s return to our original goal, to see how this sampling distribution can be used to evaluate whether the empty model might explain the data we collected, or whether it should be rejected.
The basic idea is this: using the sampling distribution of possible sample
If we judge the
Let’s see how this works in the context of the tipping study, where
Samples that are extreme in either a positive (e.g., average tips that are $8 higher in the smiley face group) or negative direction (e.g., -$8, representing much lower average tips in the smiley face group), are unlikely to be generated if the true
Put another way: if we had a sample that fell in either the extreme upper tail or extreme lower tail of the sampling distribution (see figure below), we might reject the empty model as the true model of the DGP.
In statistics, this is commonly referred to as a two-tailed test because whether our actual sample falls in the extreme upper tail or extreme lower tail of this sampling distribution, we would have reason to reject the empty model as the true model of the DGP. By rejecting the model in which
Of course, even if we observe a
What Counts as Unlikely?
All of this, however, begs the question of how extreme a sample
One common standard used in the social sciences is that a sample counts as unlikely if there is less than a .05 chance of generating one that extreme (in either the negative or positive direction) from a particular DGP. We notate this numerical definition of “unlikely” with the Greek letter
Let’s try setting an alpha level of .05 to the sampling distribution of
In a two-tailed test, we will reject the empty model of the DGP if the sample is not in the middle .95 of randomly generated middle()
to fill the middle .95 of
gf_histogram(~b1, data = sdob1, fill = ~middle(b1, .95))
The fill=
part tells R that we want the bars of the histogram to be filled with particular colors. The ~
tells R that the fill color should be conditioned on whether the
Here’s what the histogram of the sampling distribution looks like when you add fill = ~middle(b1, .95)
to gf_histogram()
.
You might be wondering why some of the bars of the histogram include both red and blue. This is because the data in a histogram is grouped into bins. The value 6.59, for example, is grouped into the same bin as the value 6.68, but while 6.59 falls within the middle .95 (thus colored blue), 6.68 falls just outside the .025 cutoff for the upper tail (and thus is colored red).
If you would like to see a more sharp delineation, you could try making your bins smaller, or to put it another way, making more bins. Doing so would increase the chances of having just one color in each bin.
We re-made the histogram, but this time added the argument bins = 100
to the code (the default number of bins is 30). We also added show.legend = FALSE
to get rid of the legend, and thus provide more space for the plot.
gf_histogram(~b1, data = sdob1, fill = ~middle(b1, .95), , bins = 100, show.legend = FALSE)
Increasing the number of bins resulted in each bin being represented by only one color. But it also created some holes in the histogram, i.e., empty bins in which none of the sample
Remember, this histogram represents a sampling distribution. All these
In the actual experiment of course, we only have one sample. If our actual sample
But it might be the wrong decision. If the empty model is true, .05 of the
What is the Opposite of Unlikely?
We’re going to be interested in whether our sample
To be precise, if the sample falls in the middle .95 of the sampling distribution, it means that the sample is not unlikely. But saying that it is likely is a little bit sloppy, and possibly misleading.
In statistics, even if an event has a probability of .06, we will say it is not unlikely because our definition of unlikely is .05 or lower. But a regular person would not call something with a likelihood of .06 “likely”.
It gets tiring to say not unlikely all the time, and sometimes sentences read a little bit easier if we just say likely. Just remember that when we say likely we usually mean not unlikely. But this is not what normal people mean by the word likely.