Course Outline

list College / Accelerated Statistics with R (XCD)

Book
  • High School / Advanced Statistics and Data Science I (ABC)
  • High School / Statistics and Data Science I (AB)
  • High School / Statistics and Data Science II (XCD)
  • High School / Algebra + Data Science (G)
  • College / Introductory Statistics with R (ABC)
  • College / Advanced Statistics with R (ABCD)
  • College / Accelerated Statistics with R (XCD)
  • CKHub: Jupyter made easy

2.5 The Mean as a Model

Before, we made the following informal model for home prices in the Ames dataframe:

PriceK = Neighborhood + Other Stuff

How good is this model? Is it better than a model that doesn’t include Neighborhood? If so, how much better? (And what, exactly, is a model, anyway?)

To get a sense of how good a model is, we need a point of comparison - a simple model that we’d like to improve upon. We also need a way to quantify how good a model is, and how much better it is than a different, simpler model. Informal word equations won’t be good enough for this.

Our First Statistical Model

One very simple model that we can use for comparison is the mean. You may be thinking to yourself: “Wait, the mean is a model?” It turns out, yes it can be! We can write a word equation to represent this simple model like this:

PriceK = Mean + Other Stuff

This is our first statistical model. A statistical model is a function that generates a predicted value on an outcome variable for each observation in a data set. This particular model uses the mean to predict the PriceK for every home in the Ames data frame.

Models like the mean have no explanatory variables (like neighborhood or home size) so they are called empty models. Let’s show you how this simple model would work with a very small data set.

Imagine that you wanted to predict the price of a house in the little town of… Bames. You have the sale price data for 5 houses that have recently sold (in thousands of dollars): 50, 50, 50, 100, 200. Without any other information, what would you predict the price of the next house sold in Bames will be?

A reasonable prediction just might be the mean of the 5 prices you know. In this case, if you add up the 5 prices and divide by 5, you’d find the mean. Try it in the code block below.

require(coursekata) # use R as a calculator to find the mean of these 5 houses # we have started it out for you by summing the prices together (50 + 50 + 50 + 100 + 200) # use R as a calculator to find the mean of these 5 houses # we have started it out for you by summing the prices together (50 + 50 + 50 + 100 + 200) / 5 ex() %>% check_output_expr("(50 + 50 + 50 + 100 + 200) / 5")

The mean is 90, so under the empty model we would predict that the next house will sell for $90K. Will we be exactly right? Almost certainly not. Our guess will probably be “off” because it is hard to predict the future. In statistics, we have a word for how far off our prediction turns out to be: error.

If we predict a really high price, and the house sells for a low price, the error would be large. The error would also be large if we predict a really low price, and then the house sells for a lot. The mean is special because if we predict the mean, the errors will, over time, balance out. The mean, by virtue of being not too high nor too low, is a balancing point.

The mean does not balance the values above and below the mean. What it balances is the amount of error above and below the mean, i.e., how “off” each observation is from the mean.

Residuals as a Measure of Prediction Error

Statisticians measure error from a model prediction with something called a residual. As shown below, the observed value (200 for the home in the figure represented by the labeled dot farthest to the right) minus the value predicted by the model (in this case the mean, which is 90) is the residual (+110). If the observed value is less than the model prediction, then the residual would be negative.

Diagram to demonstrate the concept of residuals. A horizontal number line from 0 to 200 represents an x-axis, and it is plotted with 4 data points. Three data points are stacked at 50, 1 data point appears at 100, and the last data point is observed at 200. A vertical line runs through the number line at the mean of 90. Each data point has a dashed line that runs horizontally from the point to the mean line. The longest of these is the line from 200 to 90 and is labeled as: residual equals plus 110.

Now that we have defined prediction errors as residuals, we can rewrite our word equation one more time, replacing the term Other Stuff with Error:

PriceK = Mean + Error.

We started with a word equation but now we have a statistical model in which each of these terms (PriceK, Mean, and Error) is a quantity. And importantly, the terms add up: the actual price of any home in the data frame can be expressed as the mean (the model prediction) plus the error (the home’s residual from the mean).

We can rearrange this word equation to calculate the residual for each observation like this:

Error = PriceK - Mean

The error (or residual) would have been +110K. The house was 110K more than we initially predicted. The two answer choices above show you different ways of conceptualizing that error.

The Residuals Sum to 0

If we tried predicting the 5 houses in our small data set with our simple model, we would have been wrong on all of them. Yet, as you can see in the figure below, the sum of all the residuals below the mean [(-40) + (-40) + (-40) = -120] balances out the sum of residuals above the mean [10 + 110 = 120]. This is what we mean when we say the mean is the balancing point for a distribution.

The same diagram to demonstrate the concept of residuals featured previously, with additional information. The residuals are labeled with their positive or negative distance from the mean to the data point. The residuals from 90 to the three data points at 50 are labeled as negative 40. The residual from 90 to the data point at 100 is labeled as positive 10, and the residual from 90 to 200 is labeled as positive 110. The sum of all those positive and negative residuals equals zero.

No number other than the mean (not 80, not 85, not 91!) has this property. There will always be exactly the same amount of error above the mean as below it, which is kind of an amazing result.

This is why it’s reasonable to use the mean as your prediction for a future data point. Although your predictions will undoubtedly be wrong a lot of the time, on average your under-predictions will balance out your over-predictions over time. It’s a way to be “least wrong.”

DATA = MODEL + ERROR

The empty model serves an important purpose: it provides a comparison point for other, more complex models. For example, if we want to know to what extent Neighborhood explains variation in home prices, we will compare a model that includes Neighborhood to one that does not – the empty model:

PriceK = Neighborhood + Error

PriceK = Mean + Error

Word equations provided a way for us to think about these models. But turning them into statistical models gives us the ability to use the models for predicting future cases and to assess, based on quantitative comparisons, which models are better than others.

The empty model provides a starting place, by allowing us to quantify a model prediction, which then lets us quantify the error in the model’s predictions. Later, in the next chapter, we will turn the Neighborhood model into a statistical model, which will let us quantify the extent to which adding Neighborhood as an explanatory variable might reduce error over the empty model.

We have shown, with the empty model, that each data point can be composed into two parts: the model prediction, and the residual from that model prediction. We want to point out now that this basic idea can be generalized like this:

DATA = MODEL + ERROR

All models will follow this same structure, so this is an idea we want you to take with you to the next chapter and beyond. For the empty model, the MODEL part is simply the mean. But as our models get more complex, they still will result in a single prediction for each case, and this prediction will have a residual.

Responses