Statistics and Data Science: A Modeling Approach

5.7 Statistics and Parameters

Now is a good time to remember that our goal in exploring distributions of data is to find out about the DGP. Our goal in constructing statistical models is the same: we estimate models based on data in order to make inferences about the population and the DGP.

With our data, we can calculate the exact mean of the distribution, and the exact size of the errors. When we do this, we are calculating a statistic. A statistic is anything you can compute to summarize something about your data; the mean is our first example of a statistic.

But we can’t calculate the mean of the population; the population distribution is unknown. Instead we use the mean we calculate from our data as an estimate of the mean of the population—the distribution from which our data were sampled.

The mean of the population is an example of a parameter. A parameter is a number that summarizes something about a population. Whereas statistics are computed, parameters are estimated. We use statistics as estimates because we don’t generally know what the true parameter is.

Sometimes students think that the main goal of statistics is to calculate a correct answer. But statistics isn’t mostly about calculation. It is a way of thinking, so that understanding what you are trying to calculate is just as important as the calculations themselves.

Notation is one way we keep our thinking straight about what we are trying to calculate, and what the results of our calculations mean. Because the distinction between statistics (or estimators) and parameters is so critical, we use different notation to distinguish them.

If we want to represent the mean calculated from data, we typically use the notation \(\bar{Y}\) (or, sometimes, \(\bar{X}\)). To represent the mean of the population, we typically use the Greek letter \(\mu\) (pronounced “mew”).

The same distinction shows up in the notation of the General Linear Model. The empty model we have discussed so far, which is calculated from data, is written like this (as you know):


The model of the DGP that we are trying to estimate when we fit the empty model is represented like this:


Note that in this model of the population we have replaced the estimators \(b_{0}\) and \(e_{i}\) with the Greek letters \(\beta_{0}\) (pronounced “beta sub 0”) and \(\epsilon_{i}\) (pronounced “epsilon sub i”). \(b_{0}\) is the estimator for \(\beta_{0}\), which is used to represent the mean of the population; and \(e_{i}\) is the estimator for \(\epsilon_{i}\).

Whenever you see Greek letters you can be pretty sure we are talking about parameters of the population. Roman letters are generally used to represent estimators calculated from data.

As it turns out, in the absence of other information about the objects being studied, the mean of our sample is the best estimate we have of the actual mean of the population. It is equally likely to be too high as it is too low, making it an unbiased estimator of the parameter.

Because it is our best guess of what the population parameter is, it is the best predictor we have of the value of a subsequent observation. While it will certainly be wrong, the mean will do a better job than any other number.