Course Outline
-
segmentGetting Started (Don't Skip This Part)
-
segmentStatistics and Data Science
-
segmentPART I: EXPLORING AND MODELING VARIATION
-
segmentChapter 1 - Exploring Data with R
-
segmentChapter 2 - From Exploring to Modeling Variation
-
2.5 The Mean as a Model
-
segmentChapter 3 - Modeling Relationships in Data
-
segmentPART II: COMPARING MODELS TO MAKE INFERENCES
-
segmentChapter 4 - The Logic of Inference
-
segmentChapter 5 - Model Comparison with F
-
segmentChapter 6 - Parameter Estimation and Confidence Intervals
-
segmentPART III: MULTIVARIATE MODELS
-
segmentChapter 7 - Introduction to Multivariate Models
-
segmentChapter 8 - Multivariate Model Comparisons
-
segmentChapter 9 - Models with Interactions
-
segmentChapter 10 - More Models with Interactions
-
segmentFinishing Up (Don't Skip This Part!)
-
segmentResources
list College / Accelerated Statistics with R (XCD)
2.5 The Mean as a Model
Before, we made the following informal model for home prices in the Ames
dataframe:
PriceK = Neighborhood + Other Stuff
How good is this model? Is it better than a model that doesn’t include Neighborhood
? If so, how much better? (And what, exactly, is a model, anyway?)
To get a sense of how good a model is, we need a point of comparison - a simple model that we’d like to improve upon. We also need a way to quantify how good a model is, and how much better it is than a different, simpler model. Informal word equations won’t be good enough for this.
Our First Statistical Model
One very simple model that we can use for comparison is the mean. You may be thinking to yourself: “Wait, the mean is a model?” It turns out, yes it can be! We can write a word equation to represent this simple model like this:
PriceK = Mean + Other Stuff
This is our first statistical model. A statistical model is a function that generates a predicted value on an outcome variable for each observation in a data set. This particular model uses the mean to predict the PriceK
for every home in the Ames
data frame.
Models like the mean have no explanatory variables (like neighborhood or home size) so they are called empty models. Let’s show you how this simple model would work with a very small data set.
Imagine that you wanted to predict the price of a house in the little town of… Bames. You have the sale price data for 5 houses that have recently sold (in thousands of dollars): 50, 50, 50, 100, 200. Without any other information, what would you predict the price of the next house sold in Bames will be?
A reasonable prediction just might be the mean of the 5 prices you know. In this case, if you add up the 5 prices and divide by 5, you’d find the mean. Try it in the code block below.
require(coursekata)
# use R as a calculator to find the mean of these 5 houses
# we have started it out for you by summing the prices together
(50 + 50 + 50 + 100 + 200)
# use R as a calculator to find the mean of these 5 houses
# we have started it out for you by summing the prices together
(50 + 50 + 50 + 100 + 200) / 5
ex() %>% check_output_expr("(50 + 50 + 50 + 100 + 200) / 5")
The mean is 90, so under the empty model we would predict that the next house will sell for $90K. Will we be exactly right? Almost certainly not. Our guess will probably be “off” because it is hard to predict the future. In statistics, we have a word for how far off our prediction turns out to be: error.
If we predict a really high price, and the house sells for a low price, the error would be large. The error would also be large if we predict a really low price, and then the house sells for a lot. The mean is special because if we predict the mean, the errors will, over time, balance out. The mean, by virtue of being not too high nor too low, is a balancing point.
The mean does not balance the values above and below the mean. What it balances is the amount of error above and below the mean, i.e., how “off” each observation is from the mean.
Residuals as a Measure of Prediction Error
Statisticians measure error from a model prediction with something called a residual. As shown below, the observed value (200 for the home in the figure represented by the labeled dot farthest to the right) minus the value predicted by the model (in this case the mean, which is 90) is the residual (+110). If the observed value is less than the model prediction, then the residual would be negative.
Now that we have defined prediction errors as residuals, we can rewrite our word equation one more time, replacing the term Other Stuff with Error:
PriceK = Mean + Error.
We started with a word equation but now we have a statistical model in which each of these terms (PriceK, Mean, and Error) is a quantity. And importantly, the terms add up: the actual price of any home in the data frame can be expressed as the mean (the model prediction) plus the error (the home’s residual from the mean).
We can rearrange this word equation to calculate the residual for each observation like this:
Error = PriceK - Mean
The error (or residual) would have been +110K. The house was 110K more than we initially predicted. The two answer choices above show you different ways of conceptualizing that error.
The Residuals Sum to 0
If we tried predicting the 5 houses in our small data set with our simple model, we would have been wrong on all of them. Yet, as you can see in the figure below, the sum of all the residuals below the mean [(-40) + (-40) + (-40) = -120] balances out the sum of residuals above the mean [10 + 110 = 120]. This is what we mean when we say the mean is the balancing point for a distribution.
No number other than the mean (not 80, not 85, not 91!) has this property. There will always be exactly the same amount of error above the mean as below it, which is kind of an amazing result.
This is why it’s reasonable to use the mean as your prediction for a future data point. Although your predictions will undoubtedly be wrong a lot of the time, on average your under-predictions will balance out your over-predictions over time. It’s a way to be “least wrong.”
DATA = MODEL + ERROR
The empty model serves an important purpose: it provides a comparison point for other, more complex models. For example, if we want to know to what extent Neighborhood
explains variation in home prices, we will compare a model that includes Neighborhood
to one that does not – the empty model:
PriceK = Neighborhood + Error
PriceK = Mean + Error
Word equations provided a way for us to think about these models. But turning them into statistical models gives us the ability to use the models for predicting future cases and to assess, based on quantitative comparisons, which models are better than others.
The empty model provides a starting place, by allowing us to quantify a model prediction, which then lets us quantify the error in the model’s predictions. Later, in the next chapter, we will turn the Neighborhood
model into a statistical model, which will let us quantify the extent to which adding Neighborhood
as an explanatory variable might reduce error over the empty model.
We have shown, with the empty model, that each data point can be composed into two parts: the model prediction, and the residual from that model prediction. We want to point out now that this basic idea can be generalized like this:
DATA = MODEL + ERROR
All models will follow this same structure, so this is an idea we want you to take with you to the next chapter and beyond. For the empty model, the MODEL part is simply the mean. But as our models get more complex, they still will result in a single prediction for each case, and this prediction will have a residual.