Course Outline

list High School / Advanced Statistics and Data Science I (ABC)

Book
  • High School / Advanced Statistics and Data Science I (ABC)
  • High School / Statistics and Data Science I (AB)
  • High School / Statistics and Data Science II (XCD)
  • College / Statistics and Data Science (ABC)
  • College / Advanced Statistics and Data Science (ABCD)
  • College / Accelerated Statistics and Data Science (XCDCOLLEGE)
  • Skew the Script: Jupyter

5.9 DATA = MODEL + ERROR: Notation

Now let’s see how mathematical notation is used to represent the simple (empty) model we introduced before.

Thumb = Mean + Error

There are some real advantages to rewriting this statement in mathematical notation. Here’s one form this notation might take:

\[Y_i=\bar{Y}+e_i\]

This equation literally represents what we showed above in our word equation. It tells us that each value of Thumb in our data set (\(Y_i\)) can be seen as the sum of two parts: the mean of all values of \(Y\) (\(\bar{Y}\), which is the empty model prediction), and its residual from that mean (\(e_i\), or error). If we add these two numbers together (Mean + Error) for a specific score, we will get the original score. Very simple, very concrete.

Notation for the General Linear Model

This notation for the mean (\(\bar{Y}\)) works well enough for the empty model. But it’s not going to help us build more complex models later. To prepare for that eventuality, we will introduce a more general notation, referred to as the General Linear Model (GLM). In GLM notation, the empty model is represented like this:

\[Y_i=b_0+e_i\]

This is a more general version of the equation above, in which we have substituted \(b_0\) (we read this as “b sub 0”) for the mean, \(\bar{Y}\). This won’t make much sense right now, but later it will help us add complexity to our model (with \(b_1\), \(b_2\), and so forth). The main thing to know for now is that \(b_0\) can represent the mean, as it does in the empty model, but it won’t always represent the mean.

\[\underbrace{Y_i}_\text{Thumb}=\underbrace{b_0}_\text{Predict}+\underbrace{e_i}_\text{Resid}\]

Indeed, this flexibility is what makes the General Linear Model general. Whenever you see a GLM model statement, you should think carefully about what, in the particular situation, each symbol represents.

Video Transcript

Statistics and Parameters

Now is a good time to remember that our goal in exploring distributions of data is to find out about the DGP. Our goal in constructing statistical models is the same: we estimate models based on data in order to make inferences about the population and the DGP.

With our data, we can calculate the exact mean of the distribution, and the exact size of the errors. When we do this, we are calculating a statistic. A statistic is anything you can compute to summarize something about your data; the mean is our first example of a statistic.

But we can’t calculate the mean of the population; the population distribution is unknown. Instead we use the mean we calculate from our data as an estimate of the mean of the population—the distribution from which our data were sampled.

The mean of the population is an example of a parameter. A parameter is a number that summarizes something about a population. Whereas statistics are computed, parameters are estimated. We use statistics as estimates because we don’t generally know what the true parameter is.

Sometimes students think that the main goal of statistics is to calculate a correct answer. But statistics isn’t mostly about calculation. It is a way of thinking, so that understanding what you are trying to calculate is just as important as the calculations themselves.

Notation is one way we keep our thinking straight about what we are trying to calculate, and what the results of our calculations mean. Because the distinction between statistics (or estimators) and parameters is so critical, we use different notation to distinguish them.

If we want to represent the mean calculated from data, we typically use the notation \(\bar{Y}\) (or, sometimes, \(\bar{X}\)). To represent the mean of the population, we typically use the Greek letter \(\mu\) (pronounced “mew”).

The same distinction shows up in the notation of the General Linear Model. The empty model we have discussed so far, which is calculated from data, is written like this (as you know):

\[Y_i=b_0+e_i\]

The model of the DGP that we are trying to estimate when we fit the empty model is represented like this:

\[Y_i=\beta_{0}+\epsilon_i\]

Note that in this model of the population we have replaced the estimators \(b_0\) and \(e_i\) with the Greek letters \(\beta_{0}\) (pronounced “beta sub 0”) and \(\epsilon_i\) (pronounced “epsilon sub i”). \(b_0\) is the estimator for \(\beta_{0}\), which is used to represent the mean of the population; and \(e_i\) is the estimator for \(\epsilon_i\).

Whenever you see Greek letters you can be pretty sure we are talking about parameters of the population. Roman letters are generally used to represent estimators calculated from data.

As it turns out, in the absence of other information about the objects being studied, the mean of our sample is the best estimate we have of the actual mean of the population. It is equally likely to be too high as it is too low, making it an unbiased estimator of the parameter.

Because it is our best guess of what the population parameter is, it is the best predictor we have of the value of a subsequent observation. While it will certainly be wrong, the mean will do a better job than any other number.

Video Transcript

Responses