Course Outline

list College / Accelerated Statistics with R (XCD)

Book College / Accelerated Statistics with R (XCD)
  • High School / Advanced Statistics and Data Science I (ABC)
  • High School / Statistics and Data Science I (AB)
  • High School / Statistics and Data Science II (XCD)
  • High School / Algebra + Data Science (G)
  • College / Introductory Statistics with R (ABC)
  • College / Advanced Statistics with R (ABCD)
  • College / Accelerated Statistics with R (XCD)
  • CKHub: Jupyter made easy

4.6 Calculating the P-Value for a Sample

To calculate the probability of getting a b1 within a particular region (e.g., greater than 6.05 and less than -6.05) we can simply calculate the proportion of b1s in the sampling distribution that fall within these regions. In this way, we are using the simulated sampling distribution of 1000 b1s as a probability distribution.

We can use tally() to figure out how many simulated samples are more extreme than our sample b1. The first line of code will tell us how many b1s are more extreme on the positive side than our sample_b1 (6.05), the second line, how many are more extreme than our sample on the negative side (-6.05).

tally(~ b1 > sample_b1, data = sdob1)
tally(~ b1 < -sample_b1, data = sdob1)

(Coding note: R thinks of <- with no space between the two characters as an assignment operator; it’s supposed to look like an arrow. For the second line of code, you need to be sure to put a space between the < and - so R interprets it to mean less than the negative of sample_b1.)

The two lines of tally() code will produce:

b1 > sample_b1
 TRUE FALSE
   38   962
b1 < -sample_b1
 TRUE FALSE
   41   959

When we add up the two tails (the extreme positive b1s and negative b1s), there are about 80 b1s that are more extreme than our sample b1.

Since there are about 80 randomly generated b1s (out of a 1000) that are more extreme than our sample, we would say there is roughly a .08 likelihood of the empty model generating a sample as extreme as 6.05. This probability is the p-value.

tally(~ b1 > sample_b1, data = sdob1)
tally(~ b1 < -sample_b1, data = sdob1)

Instead of using two lines of code - one to find the number of b1s at the upper extreme, the other at the lower extreme - we can use a single line of code like this:

tally(sdob1$b1 > sample_b1 | sdob1$b1 < -sample_b1)

Note the use of the | operator, which means or, to put the two criteria together: this code tallies up the total number of b1s that are either greater than positive 6.05 or less than negative 6.05. You can run the code in the code window below. Try adding the argument format = "proportion" to get the proportion or p-value directly.

The p-value for the b1 in the tipping experiment was .08, which is greater than our alpha of .05. Therefore, we would say our sample is not unlikely to have been generated by this DGP. Thus, we would consider the empty model a plausible model of the DGP and therefore not reject the empty model. Even a DGP where there is no effect of smiley face can produce a b1 that is as extreme as our sample or more extreme about .08 of the time.

If our p-value had been less than .05, we might have declared our sample unlikely to have been generated by the empty model of the DGP, and thus rejected the empty model.

What It Means to Reject – or Not – the Empty Model (or Null Hypothesis)

The concept of p-value, and using it to decide whether or not to reject the empty model in favor of the more complex model we have fit to the data, comes from a tradition known as Null Hypothesis Significance Testing (NHST). The null hypothesis is, in fact, the same as what we call the empty model. It refers to a world in which β1=0.

While we want you to understand the logic of NHST, we also want you to be thoughtful in your interpretation of the p-value. The NHST tradition has been criticized lately because it often is applied thoughtlessly, in a highly ritualized manner (download NHST article by Gigerenzer, Krauss, & Vitouch, 2004 (PDF, 286KB)). People who don’t really understand what the p-value means may draw erroneous conclusions.

For example, we just decided, based on a p-value of .08, to not reject the empty model of Tip. But does this mean that β1 in the true DGP is actually equal to 0? No. This means it could be 0, and the data are consistent with it being 0. But it could be something else instead.

It could, for example, be 6.05, which was the best-fitting estimate of β1 based on the sample data. If the true β1 were equal to 6.05, we could be sure that 6.05 would be one of the many possible b1s that would be considered likely.

If both the empty model and the complex “best-fitting” model are possible true models of the DGP, how should we decide which model to use?

Some people from the null hypothesis testing tradition would say that if you cannot reject the empty model then you should use the empty model. From this perspective, we should avoid Type I error at all costs; we don’t want to say there is an effect of smiley face when there is not one in the DGP. In this tradition, it is worse to make a Type I error than a Type II error, to say there is no effect when there is, in fact, an effect in the DGP.

But this might not be the best course of action in some situations. For example, if you simply want to make better predictions, you might decide to use the complex model, even if you cannot rule out the empty model. On the other hand, if your goal is to better understand the DGP, there is some value in having the simplest theory that is consistent with your data. Scientists call this preference for simplicity “parsimony.”

Judd, McClelland, and Ryan (some statisticians we greatly admire) once said that you just have to decide whether a model is “better enough to adopt.” Much of statistical inference involves imagining a variety of models that are consistent with your data and looking to see which ones will help you to achieve your purpose.

We prefer to think in terms of model comparison instead of null hypothesis testing. With too much emphasis on null hypothesis testing, you might think your job is done when you either reject or fail to reject the empty model. But in the modeling tradition, we are always seeking a better model: one that helps us understand the DGP, or one that makes better predictions about future events.

Responses