Course Outline

list High School / Advanced Statistics and Data Science I (ABC)

Book
  • High School / Advanced Statistics and Data Science I (ABC)
  • High School / Statistics and Data Science I (AB)
  • High School / Statistics and Data Science II (XCD)
  • High School / Algebra + Data Science (G)
  • College / Introductory Statistics with R (ABC)
  • College / Advanced Statistics with R (ABCD)
  • College / Accelerated Statistics with R (XCD)
  • CKHub: Jupyter made easy

11.8 Using F to Test a Regression Model

We can use sampling distributions of \(b_1\) or \(F\) to compare the empty model versus a two-group model. It turns out we can do the same for regression models. Previously, we used a sampling distribution of \(b_1\) to test the \(b_1\) in a regression model. Now let’s use the F-distribution to see how the F ratio from a regression model compares to the Fs generated by the empty model of the DGP.

Recap: Using Food Quality to Predict Tips

In the previous chapter we created a regression model that used each table’s average rating of food quality (FoodQuality) to predict the outcome Tip. We were interested in exploring the model: Tip = Food Quality + Other Stuff. To remind us of this example, here is a scatter plot of the data as well as the best-fitting regression model depicted as a blue line.

gf_point(Tip ~ FoodQuality, data = TipExperiment) %>%
  gf_lm(color = "blue")

A scatter plot of Tip predicted by FoodQuality. There is a regression line depicting an upward trend in the data.

Using lm(), we fit the regression model and found the best-fitting parameter estimates:

lm(Tip ~ FoodQuality, data = TipExperiment)
Call:
lm(formula = Tip ~ FoodQuality, data = TipExperiment)

Coefficients:
 (Intercept)    FoodQuality
     10.1076        0.3776 

The best-fitting \(b_1\) estimate was 0.38, meaning that for every one point increase in FoodQuality, the Tip increased, on average, by 0.38 percentage points. This is the slope of the regression line. But while this is the best-fitting estimate of the slope based on the data, is it possible that this \(b_1\) could have been generated by the empty model, one in which the true slope in the DGP is 0?

In the previous chapter we asked this question using the sampling distribution of \(b_1\). But we can ask the same question using the sampling distribution of F. Let’s start by finding the F for the FoodQuality model. You can do this in the code window below using the f() function.

require(coursekata) TipExperiment <- select(TipExperiment, -Check) # calculate the sample F for the food quality model f() # calculate the sample F for the Check model f(Tip ~ FoodQuality, data = TipExperiment) ex() %>% check_function("f") %>% check_result() %>% check_equal()
4.42776650396444

The F ratio for the FoodQuality model is 4.43, which suggests that the variance explained by the model is 4.43 times the variance left unexplained by the model. Clearly, the FoodQuality model explains more variation than the empty model in the data.

But is it possible that the sample F could have been generated by a DGP in which there is no effect of FoodQuality on Tip? To answer this question we will need the sampling distribution of F generated under the empty model in which the true slope of the regression line in the DGP is 0.

Creating a Sampling Distribution of F

Following the same approach we used for group models, we can use shuffle() to simulate many samples from a DGP in which there is no effect of FoodQuality on Tip, fit the FoodQuality model to each shuffled sample, and then calculate the corresponding value of F for each shuffled sample.

Write code to create a randomized sampling distribution of 1000 Fs (called sdof) under the assumption that FoodQuality has no relationship with Tip. Print the first six lines of sdof.

require(coursekata) TipExperiment <- select(TipExperiment, -Check) # create a sampling distribution of 1000 Fs generated from the empty model # print the first 6 rows # create a sampling distribution of 1000 Fs generated from the empty model sdof <- do(1000) * f(shuffle(Tip) ~ FoodQuality, data = TipExperiment) # print the first 6 rows head(sdof) ex() %>% { check_object(., "sdof") %>% check_equal() check_function(., "head") %>% check_arg("x") %>% check_equal() }
          f
1 0.5689833
2 3.7782664
3 0.3390408
4 1.4000646
5 1.8874035
6 0.7876602

Use the code block below to make a histogram of the randomized F ratios.

require(coursekata) TipExperiment <- select(TipExperiment, -Check) # this creates a sampling distribution of 1000 Fs generated from the empty model sdof <- do(1000) * f(shuffle(Tip) ~ FoodQuality, data = TipExperiment) # make a histogram depicting the sdof # this creates a sampling distribution of 1000 Fs generated from the empty model sdof <- do(1000) * f(shuffle(Tip) ~ FoodQuality, data = TipExperiment) # make a histogram depicting the sdof gf_histogram(~ f, data = sdof) ex() %>% check_or( check_function(., "gf_histogram") %>% { check_arg(., "object") %>% check_equal() check_arg(., "data") %>% check_equal(eval = FALSE) }, override_solution(., '{ sdof <- do(1000) * f(shuffle(Tip) ~ FoodQuality, data = TipExperiment) gf_histogram(sdof, ~ f) }') %>% check_function("gf_histogram") %>% { check_arg(., "object") %>% check_equal(eval = FALSE) check_arg(., "gformula") %>% check_equal() }, override_solution(., '{ sdof <- do(1000) * f(shuffle(Tip) ~ FoodQuality, data = TipExperiment) gf_histogram(~ sdof$f) }') %>% check_function("gf_histogram") %>% check_arg("object") %>% check_equal(eval = FALSE) )

A histogram of the sampling distribution of F. It is skewed right, with most F's between zero and 2.5. and the tail extends from about 2.5 to about 12.

As you can see, the sampling distribution of F created based on the assumption that the empty model is true is still shaped like the F distribution, even though the model is a regression model and not a group model.

Use the code window and the tally() function to find out how many (or what proportion) of the 1000 Fs were greater than the sample F of 4.43.

require(coursekata) TipExperiment <- select(TipExperiment, -Check) # we’ve saved the f value (4.43) as sample_f sample_f <- f(Tip ~ FoodQuality, data = TipExperiment) # we have also generated 1000 Fs from the empty model of the DGP sdof <- do(1000) * f(shuffle(Tip) ~ FoodQuality, data = TipExperiment) # how many Fs are larger than sample_f? tally() # we’ve saved the f value (4.43) as sample_f sample_f <- f(Tip ~ FoodQuality, data = TipExperiment) # we have also generated 1000 Fs from the empty model of the DGP sdof <- do(1000) * f(shuffle(Tip) ~ FoodQuality, data = TipExperiment) # how many Fs are larger than sample_f? tally(~ f > sample_f, data = sdof) ex() %>% check_or( check_function(., "tally") %>% { check_arg(., "x") %>% check_equal() check_arg(., "data") %>% check_equal() }, override_solution_code(., '{tally(~(f > sample_f), data= sdof)}') %>% check_function(., "tally") %>% { check_arg(., "x") %>% check_equal() check_arg(., "data") %>% check_equal() } )
f > sample_f
 TRUE FALSE 
   41   959 

Only 41 of the 1000 randomized Fs (in our shuffled distribution; yours might be a little different) were as large as our sample F of 4.43. Based on this, we would estimate the p-value to be 0.041. And indeed, this is very close to the p-value we get in the ANOVA table for the regression model (reproduced below),which is calculated using the mathematical F distribution.

supernova(lm(Tip ~ FoodQuality, data = TipExperiment))
Analysis of Variance Table (Type III SS)
Model: Tip ~ FoodQuality

                              SS df      MS     F   PRE     p
----- --------------- | -------- -- ------- ----- ----- -----
Model (error reduced) |  525.576  1 525.576 4.428 .0954 .0414
Error (from model)    | 4985.401 42 118.700                  
----- --------------- | -------- -- ------- ----- ----- -----
Total (empty model)   | 5510.977 43 128.162 

Thus, it is unlikely (about a 4% chance), but not impossible that the F value resulting from the FoodQuality model could have resulted if the empty model were true.

Does FoodQuality Cause an Increase in Tip?

There appears to be a relationship between the average rating of food quality and the amount that a table tips. And the p-value of .041 indicates that observed relationship would be unlikely to have appeared in the data if the empty model were true. But should the low p-value give us confidence that higher food quality actually caused the increase in Tip?

Causality cannot be inferred purely based on the results of statistical analysis; we must also take into account the research design.

When we investigated the effect of a smiley face on Tip, we had the advantage of random assignment: each table at the restaurant was randomly assigned to Condition, to either get a smiley face or not. This meant that any differences in tips could be attributed to either the Condition (the one thing the researchers varied across conditions) or the result of randomness. A low p-value would have supported the inference that the smiley faces caused higher tips, because it would rule out randomness.

Analyzing the relationship between FoodQuality and Tip is different because this study used a correlational design where researchers measured FoodQuality without randomly assigning tables to levels of it. Without random assignment, other factors not measured by the researchers might have influenced both FoodQuality and Tip. For example, if tables with higher food quality ratings were celebrating special occasions, the celebration might have influenced both the tips and the perceived food quality.

Even in a correlational study, a low p-value can help rule out randomness as the only cause of variation in tips. But it cannot tell us that the variable FoodQuality is therefore a cause of variation in Tip. The observed effect could be due to FoodQuality or other unmeasured variables, so a low p-value does not necessarily indicate a causal relationship. A low p-value in the context of a correlational design can rule out the empty model of the DGP, but leaves many other models that could have caused this pattern of variation in tips.

Responses