CourseKata - 13.9 Using the Sampling Distribution of F

College / Advanced Statistics with R (ABCD)

Book

13.9 Using the Sampling Distribution of F

We look at the x-axis (that represents values of f) to figure out where the sample F would fall. The sample F of 17 would be pretty far away from the bulk of the Fs in the sampling distribution (which are mostly between 0 and 5).

Let’s calculate the p-value from the shuffled sampling distribution of F using tally(). It should result in a similar number as the p-value in the model row of the ANOVA table.

require(coursekata)

# this calculates sample_f
sample_f <- f(PriceK ~ Neighborhood + HomeSizeK, data = Smallville)

# this generates a sampling distribution of fs
sdof <- do(1000) * f(shuffle(PriceK) ~ Neighborhood + HomeSizeK, data = Smallville)

# use tally to calculate p-value from the sdof
# remember to set the format as proportion

# this calculates sample_f
sample_f <- f(PriceK ~ Neighborhood + HomeSizeK, data = Smallville)

# this generates a sampling distribution of fs
sdof <- do(1000) * f(shuffle(PriceK) ~ Neighborhood + HomeSizeK, data = Smallville)

# use tally to calculate p-value from the sdof
# remember to set the format as proportion
tally(~ f > sample_f, data = sdof, format = "proportion")

ex() %>%
  check_function("tally") %>%
  check_result() %>%
  check_equal()

f > sample_f
 TRUE FALSE
    0     1

Finding the p-value in the ANOVA Table

If we check our ANOVA table (printed below), the value we got from tally() (0) corresponds to the first row of the p column.

Model: PriceK ~ Neighborhood + HomeSizeK

                                        SS df        MS      F    PRE     p
 ------------ --------------- | ---------- -- --------- ------ ------ -----
        Model (error reduced) | 124403.028  2 62201.514 17.216 0.5428 .0000
 Neighborhood                 |  27758.259  1 27758.259  7.683 0.2094 .0096
    HomeSizeK                 |  42003.677  1 42003.677 11.626 0.2862 .0019
        Error (from model)    | 104774.465 29  3612.913
 ------------ --------------- | ---------- -- --------- ------ ------ -----
        Total (empty model)   | 229177.493 31  7392.822

You might have noticed there are a few different p-values in this ANOVA table – how do we know to check the first row for the p-value? The Model row corresponds to the model comparison between the multivariate model and the empty model. The other two p-values (in the Neighborhood and HomeSizeK rows) represent comparisons between the model with and without that variable. We’ll delve into those model comparisons in the next pages.

From the Model p-value, we see that our sample F is very unlikely to be generated by the empty model of the DGP. The p-value is so small, we would say that p < .001.

This small p-value suggests that the empty model is unlikely to generate an F as extreme as the one fit from our multivariate model. Thus we reject the empty model (in which both \(\beta_1\) for Neighborhood and \(\beta_2\) for HomeSizeK are equal to 0) in favor of the multivariate model (in which \(\beta_1\) and \(\beta_2\) are not 0).

Confidence Intervals with the Multivariate Model

With mere sample data and the power of simulations, we have been able to rule out the empty model as a model of the DGP. After concluding that \(\beta_1\) and \(\beta_2\) are not both equal to 0, we might wonder: what are they?

This is where confidence intervals, also called parameter estimation, comes in. Just as before, we can use confint() to estimate the lowest and highest \(\beta\)s that could still reasonably produce our sample. Try it in the code block below.

require(coursekata)

# this saves the multivariate model
multi_model <- lm(PriceK ~ Neighborhood + HomeSizeK, data = Smallville)

# write one line of code that will calculate
# confidence intervals for all parameters

# this saves the multivariate model
multi_model <- lm(PriceK ~ Neighborhood + HomeSizeK, data = Smallville)

# write one line of code that will calculate
# confidence intervals for all parameters
confint(multi_model)

ex() %>%
  check_function("confint") %>%
  check_result() %>%
  check_equal()

                         2.5 %    97.5 %
(Intercept)            87.95358 266.54808
NeighborhoodEastside -115.07739 -17.35826
HomeSizeK              27.15159 108.54819

We can interpret these confidence intervals in the same way we did before. We imagined many different plausible alternative values of \(\beta_1\) that could, with 95% likelihood, have produced the sample \(b_1\). We can also do the same for \(\beta_2\) and any other \(\beta\)s that might be in our complex model.

We have previously specified our multivariate model as \(Y_i = \beta_0 + \beta_1X_{1i} + \beta_2X_{2i} + \epsilon_i\). The confidence intervals tell us a range of plausible values for \(\beta_0\), \(\beta_1\), and \(\beta_2\).

Because these confidence intervals do not include 0, we are 95% confident that there is some effect of Neighborhood and HomeSizeK on PriceK in the DGP. The fact that 0 is not included in the intervals tells us that there is some contribution, and the confidence intervals themselves suggest a range for how big that contribution might be.

13.8 The Logic of Inference with the Multivariate Model 14.1 Targeted Model Comparisons

Course Outline

College / Advanced Statistics with R (ABCD)

13.9 Using the Sampling Distribution of F

Finding the p-value in the ANOVA Table

Confidence Intervals with the Multivariate Model

Responses

list College / Advanced Statistics with R (ABCD)

13.9 Using the Sampling Distribution of F

Finding the p-value in the ANOVA Table

Confidence Intervals with the Multivariate Model

College / Advanced Statistics with R (ABCD)