Course Outline
-
segmentGetting Started (Don't Skip This Part)
-
segmentHigh School / Advanced Statistics and Data Science I (ABC)
-
segmentPART I: EXPLORING VARIATION
-
segmentChapter 1 - Welcome to Statistics: A Modeling Approach
-
segmentChapter 2 - Understanding Data
-
segmentChapter 3 - Examining Distributions
-
segmentChapter 4 - Explaining Variation
-
segmentPART II: MODELING VARIATION
-
segmentChapter 5 - A Simple Model
-
segmentChapter 6 - Quantifying Error
-
segmentChapter 7 - Adding an Explanatory Variable to the Model
-
segmentChapter 8 - Digging Deeper into Group Models
-
segmentChapter 9 - Models with a Quantitative Explanatory Variable
-
segmentPART III: EVALUATING MODELS
-
segmentChapter 10 - The Logic of Inference
-
10.9 Hypothesis Testing for Regression Models
-
segmentChapter 11 - Model Comparison with F
-
segmentChapter 12 - Parameter Estimation and Confidence Intervals
-
segmentChapter 13 - What You Have Learned
-
segmentFinishing Up (Don't Skip This Part!)
-
segmentResources
list High School / Advanced Statistics and Data Science I (ABC)
10.9 Hypothesis Testing for Regression Models
We have gone through the logic of hypothesis testing for group
models. We have used shuffle()
to create a sampling
distribution assuming that
Now let’s apply the same ideas to regression models. As you will see,
the strategy is exactly the same. We still want to create a sampling
distribution of
Tips = Food Quality + Other Stuff
We have explored the effect of a smiley face on how much people tip
at a restaurant. But there surely are other factors that can help us
explain the variation in tip percentage. One of these might be the
perceived quality of the food. We can explore this hypothesis by looking
at another variable available in the TipExperiment
data
frame: FoodQuality
.
Each adult diner at each table was asked to rate the quality of the
food on a 100-point scale. They were told to consider 50 (the middle of
the scale) as “about average for this type of restaurant,” and then to
go up or down the scale from there, where 100 would be the best food
they’ve ever tasted in their life, and 0 would be the worst.
FoodQuality
is the average rating for each table of
diners.
TableID Tip Condition FoodQuality
1 1 39 Control 54.9
2 2 36 Control 51.7
3 3 34 Control 60.5
4 4 34 Control 56.7
5 5 33 Control 51.0
6 6 31 Control 43.3
We created a scatter plot to explore the hypothesis that
FoodQuality
might explain some of the variation in
Tip
.
gf_point(Tip ~ FoodQuality, data = TipExperiment)
Modeling Variation in Tips as a Function of Food Quality
Use the code window below to fit a regression model in which
FoodQuality
is used to explain Tip
.
Call:
lm(formula = Tip ~ FoodQuality, data = TipExperiment)
Coefficients:
(Intercept) FoodQuality
10.1076 0.3776
A .38 percentage point increase in tip for every additional point
increase in FoodQuality
does not seem like very much. In
fact, it seems pretty close to 0. Is it possible that this FoodQuality
does effect
Tip
?
Evaluating the Empty Model of the DGP
Just as we did with the Condition
model, we can use
shuffle()
to simulate the case where the empty model is
true (i.e., where the true value of the slope in the DGP is 0), create a
sampling distribution of Tip
, and then use the sampling distribution to
calculate the likelihood of a
In the code block below we have written code to create a scatter plot
of the data. Add shuffle()
around the outcome
(Tip
) to generate a sample of shuffled data from the empty
model of the DGP and plot the data with the best-fitting regression
line. Run it a few times just to see what kinds of slopes (
The actual data from the tipping study is shown in blue (the panel in the upper left) along with the best-fitting regression line (the slope is .05). The 5 other plots (with red dots) are shuffled data, along with their best-fitting regression lines.
From the shuffled data, we saw that many of the regression lines are
flatter than the line for the actual data. This makes sense given that
we are simulating a DGP in which b1()
function.
Complete the first line of code below to generate a sampling
distribution of 1000 sdob1
) from the FoodQuality
model fit to
shuffled data. We have added some additional code to generate a
histogram of the sampling distribution of
From this sampling distribution we can see that a value as extreme as
.38 falls just outside the region of the sampling distribution we are
considering likely. We might have thought a .38 percentage point
increase per one-point increase in food quality was close to 0, but it
is not one of the likely
To make sure, let’s take a look at the p-value from the ANOVA table.
Analysis of Variance Table (Type III SS)
Model: Tip ~ FoodQuality
SS df MS F PRE p
----- --------------- | -------- -- ------- ----- ----- -----
Model (error reduced) | 525.576 1 525.576 4.428 .0954 .0414
Error (from model) | 4985.401 42 118.700
----- --------------- | -------- -- ------- ----- ----- -----
Total (empty model) | 5510.977 43 128.162
The p-value is .04. There is only a 4% chance that the observed
This sampling distribution of