Statistics and Data Science: A Modeling Approach
11.7 The Problem of Simultaneous Comparisons
When we move from the overall F test of the three-group model to the individual F tests of the three two-group models, we create a new problem: the problem of simultaneous inference.
It works like this: when we do an F test, we decide that we will reject the empty model if the probability of getting the F we observed is less than .05 assuming that the null model is true. As we have discussed, by setting our alpha level at .05, we have reduced the probability of Type I error to less than .05. Notice that if we keep living this way, doing lots of F tests, we will be wrong, on average, one out of every 20 times (that is, .05 of the time)—rejecting the empty model when, in fact, the empty model is true.
This is okay if all we care about is a single F test. But if we do three F tests simultaneously (as we do when we try to test the simple effects of each pair of means), we really want to know not that there is a .05 chance of each one being wrong, but that the overall chance of any of them being wrong is less than .05.
Probability theory would tell us that if each of the three simple effects has a .05 chance of being wrong, the probability that one of the three tests is wrong is actually quite a bit higher. (This is a bit of an aside but it’s basically 1 minus the probability that all three tests are not wrong, \(1-.95^3=.14\)). Our probability of rejecting the null model when we shouldn’t (Type I error) shoots up to .14 (yikes!) when all three of the individual tests are taken as a whole.
There are a number of ways to correct for this problem. The simplest is called the Bonferroni adjustment, named after the gentleman who proposed it. If you want to make sure that your overall error rate is reduced to .05, you simply divided your acceptable alpha (.05) by the number of simultaneous inferences you want to make (in this case, 3). This would mean that in order to be 95% confident that all three of your simultaneous model comparisons are correct, you would need to set your alpha level as .05 divided by 3, or .0167.
If our alpha level was .0167, then the probability that one of the three tests are wrong returns to about .05 (if you want to see the math, 1 minus the probability that all three tests are not wrong becomes \(1 - .9833^3 = .0492\), which is pretty close to .05).
What is p-hacking?
Related to the problem of simultaneous inference is the problem referred to recently as p-hacking. Here’s the basic idea: Because of selection bias in the way studies are published, the probability that any particular published finding is just Type I error could well be higher than .05.
If you are a scientist, you probably run many experiments, day after day, week after week. Each of these experiments results in a model comparison, an F test, and a decision as to whether the complex model (generally the one that supports your theory) should be accepted and the empty model rejected.
Every scientist wants to publish their findings. But in general, it’s hard to publish null results. Null results means you tried an experiment, manipulating some variable you hypothesized to affect the outcome, but found in the end that it didn’t have a strong enough effect to warrant rejection of the empty model. Most people aren’t that interested in null results. “We didn’t find anything” is not a very exciting storyline, and so null results are rarely published. Usually the scientist just puts them in a drawer and doesn’t even submit them for publication. Thus, only a biased selection of studies are published in journals (i.e., those with significant findings).
Here’s the problem: we know that if we set the alpha level at .05 we will erroneously reject the empty model one out of every 20 studies, even if the empty model is true. An industrious scientist may do 20 studies in three or six months. That means that even if the empty model is true, the scientist will, by chance, produce a study with significant results every three to six months. If this study is submitted for publication and it gets published, it means that a study with erroneous results got published, misleading the field into thinking that the complex model is true.
The consequence of this situation is that many studies we love because they support our favorite theory turn out, in the end, not to be true. When people try to replicate the study, it doesn’t replicate. The experimenter is embarrassed. Science is set back, having to come up with new theories that will replicate over time and across different laboratories.
It will be hard to convince journals to publish studies with null findings. But it would be great if scientists would find a way to describe all the studies they did that didn’t work out. It’s also critical that if you like a theory, believe in it, and want to extend it, you start out by making sure you can replicate the effect in your own lab. This is what good scientists do. Replication is a critical part of the work, and we should all appreciate it when someone replicates our own favorite research finding.