## Course Outline

• segmentGetting Started (Don't Skip This Part)
• segmentStatistics and Data Science: A Modeling Approach
• segmentPART I: EXPLORING VARIATION
• segmentChapter 1 - Welcome to Statistics: A Modeling Approach
• segmentChapter 2 - Understanding Data
• segmentChapter 3 - Examining Distributions
• segmentChapter 4 - Explaining Variation
• segmentPART II: MODELING VARIATION
• segmentChapter 5 - A Simple Model
• segmentChapter 6 - Quantifying Error
• segmentChapter 7 - Adding an Explanatory Variable to the Model
• segmentChapter 8 - Models with a Quantitative Explanatory Variable
• segmentPART III: EVALUATING MODELS
• segmentChapter 9 - The Logic of Inference
• segmentChapter 10 - Model Comparison with F
• segmentChapter 11 - Parameter Estimation and Confidence Intervals
• segmentPART IV: MULTIVARIATE MODELS
• segmentChapter 12 - Introduction to Multivariate Models
• segmentChapter 13 - Multivariate Model Comparisons
• segmentFinishing Up (Don't Skip This Part!)
• segmentResources

### list Statistics and Data Science: A Modeling Approach

Book
• College / Advanced Statistics and Data Science (ABCD)
• College / Statistics and Data Science (ABC)
• High School / Advanced Statistics and Data Science I (ABC)
• High School / Statistics and Data Science I (AB)
• High School / Statistics and Data Science II (XCD)

## 12.9 Using the Sampling Distribution of F

We look at the x-axis (that represents values of f) to figure out where the sample F would fall. The sample F of 17 would be pretty far away from the bulk of the Fs in the sampling distribution (which are mostly between 0 and 5).

Let’s calculate the p-value from the shuffled sampling distribution of F using tally(). It should result in a similar number as the p-value in the model row of the ANOVA table.

require(coursekata) # delete when coursekata-r updated Smallville <- read.csv("https://docs.google.com/spreadsheets/d/e/2PACX-1vTUey0jLO87REoQRRGJeG43iN1lkds_lmcnke1fuvS7BTb62jLucJ4WeIt7RW4mfRpk8n5iYvNmgf5l/pub?gid=1024959265&single=true&output=csv") Smallville$Neighborhood <- factor(Smallville$Neighborhood) Smallville$HasFireplace <- factor(Smallville$HasFireplace) # this calculates sample_f sample_f <- f(PriceK~ Neighborhood + HomeSizeK, data = Smallville) # this generates a sampling distribution of fs sdof <- do(1000) * f(shuffle(PriceK) ~ Neighborhood + HomeSizeK, data = Smallville) # use tally to calculate p-value from the sdof # remember to set the format as proportion sample_f <- f(PriceK~ Neighborhood + HomeSizeK, data = Smallville) sdof <- do(1000) * f(shuffle(PriceK) ~ Neighborhood + HomeSizeK, data = Smallville) tally(~ f > sample_f, data = sdof, format = “proportion”) # temporary SCT ex() %>% check_error()
CK Code: D1_Code_DistF_01
f > sample_f
TRUE FALSE
0     1

### Finding the p-value in the ANOVA Table

If we check our ANOVA table (printed below), the value we got from tally() (0) corresponds to the first row of the p column.

Analysis of Variance Table (Type III SS)
Model: PriceK ~ Neighborhood + HomeSizeK

SS df        MS      F    PRE     p
------------ --------------- | --------- -- --------- ------ ------ -----
Model (error reduced) | 22254.020  2 11127.010 21.364 0.5957 .0000
Neighborhood                 | 16832.423  1 16832.423 32.319 0.5271 .0000
HomeSizeK                 | 10471.705  1 10471.705 20.106 0.4094 .0001
Error (from model)    | 15103.892 29   520.824
------------ --------------- | --------- -- --------- ------ ------ -----
Total (empty model)   | 37357.912 31  1205.094


You might have noticed there are a few different p-values in this ANOVA table – how do we know to check the first row for the p-value? The Model row corresponds to the model comparison between the multivariate model and the empty model. The other two p-values (in the Neighborhood and HomeSizeK rows) represent comparisons between the model with and without that variable. We’ll delve into those model comparisons in the next pages.

From the Model p-value, we see that our sample F is very unlikely to be generated by the empty model of the DGP. The p-value is so small, we would say that p < .001.

This small p-value suggests that the empty model is unlikely to generate an F as extreme as the one fit from our multivariate model. Thus we reject the empty model (in which both $$\beta_1$$ for Neighborhood and $$\beta_2$$ for HomeSizeK are equal to 0) in favor of the multivariate model (in which $$\beta_1$$ and $$\beta_2$$ are not 0).

### Confidence Intervals with the Multivariate Model

With mere sample data and the power of simulations, we have been able to rule out the empty model as a model of the DGP. After concluding that $$\beta_1$$ and $$\beta_2$$ are not both equal to 0, we might wonder: what are they?

This is where confidence intervals, also called parameter estimation, comes in. Just as before, we can use confint() to estimate the lowest and highest $$\beta$$s that could still reasonably produce our sample. Try it in the code block below.

require(coursekata) # delete when coursekata-r updated Smallville <- read.csv("https://docs.google.com/spreadsheets/d/e/2PACX-1vTUey0jLO87REoQRRGJeG43iN1lkds_lmcnke1fuvS7BTb62jLucJ4WeIt7RW4mfRpk8n5iYvNmgf5l/pub?gid=1024959265&single=true&output=csv") Smallville$Neighborhood <- factor(Smallville$Neighborhood) Smallville$HasFireplace <- factor(Smallville$HasFireplace) # this saves the multivariate model multi_model <- lm(PriceK ~ Neighborhood + HomeSizeK, data = Smallville) # write one line of code that will calculate # confidence intervals for all parameters  multi_model <- lm(PriceK ~ Neighborhood + HomeSizeK, data = Smallville) confint(multi_model) # temporary SCT ex() %>% check_error()
CK Code: D1_Code_DistF_02
                         2.5 %    97.5 %
(Intercept)            87.95358 266.54808
NeighborhoodEastside -115.07739 -17.35826
HomeSizeK              27.15159 108.54819


We can interpret these confidence intervals in the same way we did before. We imagined many different plausible alternative values of $$\beta_1$$ that could, with 95% likelihood, have produced the sample $$b_1$$. We can also do the same for $$\beta_2$$ and any other $$\beta$$s that might be in our complex model.

We have previously specified our multivariate model as $$Y_i = \beta_0 + \beta_1X_{1i} + \beta_2X_{2i} + \epsilon_i$$. The confidence intervals tell us a range of plausible values for $$\beta_0$$, $$\beta_1$$, and $$\beta_2$$.

Because these confidence intervals do not include 0, we are 95% confident that there is some effect of Neighborhood and HomeSizeK on PriceK in the DGP. The fact that 0 is not included in the intervals tells us that there is some contribution, and the confidence intervals themselves suggest a range for how big that contribution might be.