list

Statistics and Data Science: A Modeling Approach

3.3 The Data Generating Process

We can learn a lot by examining distributions of data. But our interest usually goes beyond the data, to the Data Generating Process (or DGP). We are generally looking at data because we want to find out something about the way the world works—something that is hard to see because there is so much variation in the world.

Most statistics textbooks distinguish between the sample and population. In fact, these are the first two types of distributions included in what we refer to as the Distribution Triad (we will introduce the third distribution much later in this course). Our data are a sample (they’re the units we actually selected) on which we collected our measures. We can look at sample distributions by putting our data in visualizations like histograms, boxplots, etc. Because actual data are always from a sample, we will use the phrases “sample distribution” and “distribution of data” interchangeably.

But our interest is not generally in the distribution of data, but in the population from which it was drawn. We study a sample because we want to generalize to a population. In this course we dig a little deeper into the population. Not only do we want to generalize from our data to the population, but our real interest is in understanding the processes that produced the variation in the population itself, and then in the data—this is what we refer to as the Data Generating Process (DGP).

If our answer to the question, “Why does our sample distribution look the way it does?” is just “Because that’s the way the population distribution looks,” it’s not very satisfying. What we really want to know is: Why does the population distribution look like that? The answer to this question gets at the DGP. We want you to develop a mental habit of always asking yourself: what might the process be that could have generated a distribution of data that looks like this?

Whether we are examining the distribution of a single variable (like we are in this chapter), or the relationships among variables (like in the next chapter), we always want to be digging deeper, trying to understand what could have produced the variation we see in our data.

Here’s a simple example. The histogram below shows the distribution of 60,000 waiting times at a bus stop on the corner of Fifth Avenue and 97th Street in New York City [source].

Answering questions like this one requires going far beyond just the information in the histogram. You need to imagine yourself waiting at a bus stop, and think about why you got there when you did. You need to bring to bear your knowledge about bus systems and how they work. What causes a bus to arrive when it does?

From the histogram you can see that most people wait just a short time for the bus, while some people end up waiting longer times. This makes sense. Buses have schedules, and because many of the passengers are regulars, they roughly know when the bus will come and try to get to the bus stop just before it comes.

Consider again the passengers that know the bus schedule well. If they just miss the bus, and arrive right after the bus leaves, they will end up waiting the longest, until the next bus comes.

Population, the Result of the DGP Over a Long Period of Time

The term “population” has some limitations. If you are taking a sample of likely voters in order to predict an election result, you can imagine the complete population being “out there,” just waiting to be sampled (or not). But for people waiting at a bus stop, the population is constantly shifting.

In cases like this, it makes a lot more sense to think of the population as the result of a process or of many processes—what we refer to in aggregate as the Data Generating Process (DGP). You could think of the DGP as a lot of causal factors, each with some attached probability of occurrence, that produce the population distribution as they play out over time.

Data are concrete, in hand. The DGP, on the other hand, is unknown; we can’t see it directly. We can get clues from data as to what the population produced by the DGP might eventually look like, but we can never get a perfect understanding of the DGP.