Statistics and Data Science: A Modeling Approach
5.1 Modeling a Distribution with a Single Number
Building on this concept of model, let’s now develop what we mean by a statistical model. Whereas in the previous section we were building a model to help us estimate the area of the state of California, we now want to build a model we can use to characterize a distribution.
As you will see in subsequent chapters, statistical models are very useful. We use them to summarize distributions. We use them to make predictions about what the next observation added to a sample distribution might be. And we use them to explain variation in one variable with another. But we will start with the simplest model, which uses a single number to characterize a distribution.
At its most basic level, a statistical model can be thought of as a function that produces a predicted score for each observation in a distribution. By “function” we don’t mean an R function; we mean a mathematical procedure for generating a number based on the data. The simplest models we consider generate the same predicted score for every observation in a distribution—a single number to characterize a whole distribution.
If you had to pick one number to represent an entire distribution, how would you pick it? And what would it be? Thought of in a different way: if you wanted to predict what the value of the next randomly chosen observation would be, what would be your best prediction?
Depending on how a variable is measured (e.g., quantitative or categorical), and on the shape of the distribution, we will use different procedures (or different functions) for choosing one number as a model.
For a quantitative variable whose distribution is roughly symmetrical and bell shaped, a number right in the middle might be the best-fitting model. (Remember, we aren’t saying that such a simple model is a good model—just better than nothing!) If a distribution is skewed left or right, the best model might be a number toward where the middle would be if you ignored the long tail on one side or the other. For a categorical variable, the best model is generally the category that is most frequent.
Model and Error
Let’s zero in on just distributions of quantitative variables. Take a look at the two distributions below for variables 1 and 2.
A single number—even a well-chosen number—is not a very good model. It may be a better model for variable 1 than variable 2 above, but it’s still not very good. Most scores are not the same as the number we choose as the model.
Which brings us to another important concept: once we choose a number to model a distribution (and we’ll talk soon about how we do that), we can think of the variation around that number as error, just as we considered the parts of California not covered by geometric shapes as error.
If we use a single number to model the distribution of a quantitative variable, error from the model can be seen as deviations of the observed scores from that predicted number. As we just saw, a one-number model for a distribution with less spread seems to have less error, and thus a better fit, than a one-number model for a distribution with more spread. The reason for this is that the error around the model is greater for the distribution with more spread.
The idea of modeling a distribution with a single number gives us a more concrete and detailed way of thinking about our models. Whereas we thought about the California example like this:
AREA OF CALIFORNIA = AREA OF GEOMETRIC FIGURES + ERROR
We can think of a statistical model like this:
DATA = MODEL + ERROR
Each data point in a distribution can be decomposed into two parts: the model (i.e., the number we are using to represent the whole distribution), and the data point’s deviation from the model (the error).