Statistics and Data Science: A Modeling Approach

5.8 The Power of Aggregation

In a famous article, the late evolutionary biologist Stephen Jay Gould argued that means (and medians) are not real; they are just abstractions (Gould, The Median Isn’t the Message). The only thing real is the variation, because those are the actual data points. Although he is right, the mean is an incredibly powerful tool for predicting the future. The reason has to do with the balancing of error.

Let’s say you run a pizza restaurant and you want to predict how many pizzas you are going to sell in the next week. This is important, because you want to make sure you have enough ingredients on hand to meet the demand, but not so much that you have to throw anything away.

As it happens, it would be practically impossible to predict which individual people are going to come into the restaurant and order a pizza during a given week—there is just too much variation. But if you know the average number of pizzas sold during a random sample of weeks, you could be pretty certain of your prediction for next week: the average is probably going to be pretty close.

This is all due to the power of aggregation—putting things together. Individuals are hard to predict, but the more things you add together, the more stable and predictable the resulting sum or average becomes. The reason for this is that the error variation balances out. Some scores pull the mean higher, and some lower. But when all’s said and done, the pulls in one direction are balanced out by the pulls in the opposite direction and you are left with something close to the average. And the more things you add together, the more stable the average will be.