list

Statistics and Data Science: A Modeling Approach

4.10 Research Design

We now have a sense of what it means to explain variation in an outcome variable with an explanatory variable. And, we’ve visited briefly with the idea that if we add more explanatory variables we can reduce the amount of variation that is unexplained. But let’s now dig a little deeper into the meaning of the word “explained.”

The Problem of Inferring Causation

Thus far, we have used the word “explain” to mean that variation in one variable (the explanatory variable) can account for variation in another variable (the outcome variable). Put another way, the unexplained variation in the outcome variable is reduced when the explanatory variable is included in the model.

The problem with this definition of “explain” is that it sometimes does not result in an explanation that makes sense. In our attempt to make sense of the world, we usually are looking for a causal explanation for variation in an outcome variable. A variable may be related to another variable, but that doesn’t necessarily mean that the relationship is causal.

The examples above illustrate two specific problems we can run into as we try to understand the relationships among variables. First, the direction of the causal relationship may not be what we assume. Researcher B erroneously concluded that skimpy clothes cause temperatures to rise, based on a relationship between the two variables.

In one sense he is correct: people do tend to wear skimpier clothes in hotter weather. But he is undoubtedly wrong in his causal inference: it is not the wearing of skimpy clothing that causes hot weather, but instead hot weather that causes people to wear skimpy clothing. The fact that two variables are related does not in and of itself tell us what the direction of causality might be. We’ll call this the directionality problem.

A second problem is the problem of confounding. Researcher A has found a pattern: variation in shoe size is related to variation in academic achievement. But it is possible that neither of these two variable causes the other. Instead, there may be a confound (sometimes called a “lurking variable” or a third variable) that, though not measured in our study, nevertheless causes both shoe size and academic achievement.

Sometimes we may not care about causality. For example, if our goal is merely to use an explanatory variable to predict an outcome in a future sample, it doesn’t really matter if the relationship is causal; just based on the relationship itself, we can bet on the same pattern emerging in a future study.

But more often than not, we do want to know what the causal relationships are. If we want to understand why things turn out the way they do, we usually won’t be satisfied unless we have identified the causes of the outcome we are trying to understand.

We also will need to identify the variables that cause an outcome if our goal is to change the outcome (e.g., help children score higher, or impact climate change). No matter how good shoe size might be at predicting test scores, giving kids bigger shoes won’t help them score higher.

Indeed, being able to change one thing and see an effect on something else is our common-sense notion of what causality means. It’s one of the main ways we know when we truly understand a system.

Research Designs and Causality

The example cases portrayed above are intentionally silly. But in reality, we often do not know when there is a directionality problem (i.e., when you have two variables and you don’t know which is the cause and which is the effect) or a confounding problem. It’s something that researchers always need to worry about. Research design is our best tool for differentiating real causal relationships from spurious relationships (i.e., relationships that are not causal but that instead result from some unmeasured third variable).

The simplest research design is just taking a random sample from a population, and then measuring some variables. This kind of design is referred to as a correlational study, or an observational study. We don’t control any of the variables in this type of study, we just measure them. Sometimes it’s the best we can do, but it can make it difficult to interpret the results of our analyses, as we have seen.

If we really want to be sure that a relationship is causal, we generally will need to make a change in one variable, then observe the outcome in another. In fact, this is the way we judge causality in our everyday life. However, this is not as straightforward as it sounds, as pointed out long ago by R.A. Fisher, the father of experimental design.

For example, consider a server at a restaurant who thinks that customers in general should tip more generously. She has a hunch that if a table of customers get a smiley face on the back of their check at the end of the meal, they will tip more. So she decides to give it a try. She draws smiley face on a check and gets a huge tip! Has she proven a causal relationship?

Unfortunately, no. The particular table she chose for her experiment may have had generous tippers or the food might have been exceptionally good that day. The server might have gotten a larger tip anyway, even without drawing a smiley face on the check. And also: maybe it wasn’t the smiley face that caused the bigger tip, but just something about this particular waitress, excited to be running her first experiment!

To figure this out, we need to use an experimental design. Because we want our findings to generalize to lots of tables and lots of customers, we decide to study a sample of tables. Once we have a sample, we randomly assign each table to the experimental group or the comparison group. Tables in the experimental group get a smiley face on the back of their check, while those in the comparison group get a plain check.

Having launched our experiment, we simply measure the average tip per customer for the tables in the two groups. Because we manipulated the explanatory variable (drawing a smiley face or not), and because we randomly assigned tables to the two groups, we assume that if we do see a difference in tips between the two groups, it must be due to the explanatory variable.

The Beauty of Random Assignment

This might be a good place just to pause and extol the beauty of random assignment! First of all, random assignment is not the same as random sampling. Both are random, but random assignment can accomplish so much more than simple random sampling.

The beauty of random assignment lies in the fact that by randomly assigning cases in a study to one group or the other, we are in essence constructing two groups that are comparable to each other (although they could be very diverse). Any difference between the two groups is completely random.

Because the two groups are assumed to be equivalent, except for differences due to random chance, we can infer that if some intervention (e.g., drawing a smiley face) leads to a difference in distributions across the groups (e.g., in the tips), the difference must be due to the smiley face, and not to other factors. (Later we will learn how to take random variation across the groups into account when making this inference.)

Provided we assign tables randomly to groups, generous tables are no more likely to be assigned to one group than to the other, and the same would be true for stingy tables. Random assignment helps us rule out confounding variables by ensuring that any variables that affect our outcome variable, whether positively or negatively, will balance each other out across groups.

This does not mean that the level of a confounding variable will be exactly the same in two groups, even if they are randomly assigned. But it does mean that the various confounding factors should, over time, balance each other out.

There will inevitably be variation in the outcome variable within each group. After all, a bunch of tables might give different sizes of tips depending on many factors. Later we will learn more about ways to model this within-group variation as a random process.