Chapter 10: The Logic of Inference

10.1 The Problem of Inference

In previous chapters, you learned how to specify and fit statistical models to data, and to use GLM notation to represent those models (e.g., \(Y_i=b_0+b_1X_i+e_i\)). Such models are truly the best-fitting model of the data, but data don’t always accurately represent the Data Generating Process (DGP).

What we really care about is the best model of the DGP (e.g., \(Y_i=\beta_{0}+\beta_{1}X_i+\epsilon_i\)). The complex model is always the best model of the data, but is it a better model of the DGP than the empty model? What are the true values of \(\beta_{0}\) and \(\beta_{1}\)?

Unfortunately, we can’t directly calculate the parameters in the DGP. We can estimate the parameters with \(b_0\) and \(b_1\), but we don’t know how accurate our estimates are. In the next few chapters, we will discuss how to infer what’s true about the DGP based on models we have estimated from a sample of data.

How to close the gap between our data and the DGP is often referred to as the problem of statistical inference. We’ve explored this problem of inference informally before. We know that the same DGP can produce a lot of different samples. Similarly, it’s hard to look at a sample and know exactly what DGP it came from. In these chapters, we explore solutions to this problem, outlining the logic of statistical inference, and what we have to gain from it.

A New Concept: Sampling Distribution

Key to solving the problem of inference will be a new and important concept that enables us to see how different samples from the same DGP could vary and how much the parameter estimates calculated from many different samples could vary. You can think of these many parameter estimates as a new kind of distribution called a sampling distribution.

Up to this point in the book, we have explored two kinds of distributions: the distribution of data (the sample), and the distribution of DGP (also called the population). The sampling distribution is the third leg of what we call the Distribution Triad – the distribution of an estimate across many possible samples, of equal size, from a given DGP.

Samples and populations are made up of objects you can measure (for example, thumb lengths or heights of students). Sampling distributions, in contrast, are made up of parameter estimates that you could calculate based on different samples of data from the same DGP (for example, a bunch of means or a whole distribution of \(b_1\)s). In this chapter we will focus on the sampling distribution of \(b_1\), the estimate of \(\beta_{1}\).

Rejecting the Empty Model: The Basic Idea

When we see a difference between two groups in the data we may be tempted to conclude that there is also a difference between the two groups in the DGP. That is, when \(b_1\) is some number that is not zero, we might be fooled into thinking that \(\beta_1\) is also not zero. The problem with that thinking is that even an empty model of the DGP, in which \(\beta_1=0\), can produce samples that have a difference.

The basic idea, which will unfold in this chapter, will require you to use your hypothetical thinking skills. You must ask: if the empty model is true, how likely would you be to observe the sample \(b_1\) that we found in our data? We’ll do this by using R to create a DGP in which \(\beta_1=0\), and letting it produce multiple samples of data. We’ll take a look at the \(b_1\)s calculated based on these multiple samples of simulated data and examine whether our real data look similar to the simulated data or not.

Part III: Evaluating Models 10.2 Constructing a Sampling Distribution