*list*

# Statistics and Data Science: A Modeling Approach

## 5.6 DATA = MODEL + ERROR: Notation

Now let’s see how mathematical notation is used to represent the simple (empty) model we introduced before. We have introduced the overarching concept that *DATA = MODEL + ERROR*. In our simple model, we are using one number, the mean, to model the distribution of scores.

We could represent this model in a word equation like this:

*THUMB DATA = MEAN + ERROR*

But there are some real advantages to rewriting this statement in mathematical notation. Here’s one form this notation might take:

\[Y_{i}=\bar{Y}+e_{i}\]

This equation literally represents what we were doing with R before. It tells us that each value of Y in our data (\(Y_{i}\)) can be seen as the sum of two parts: the mean of all values of Y (\(\bar{Y}\), our MODEL), and its deviation from the mean (\(e_{i}\), or ERROR). If we add these two numbers together for a specific score, we will get the original score. Very simple, very concrete.

Going back to *DATA = MODEL + ERROR*, you might also see a version that looks like this:

\[Y_{i}=\hat{Y}_{i}+e_{i}\]

\(\hat{Y}\) (pronounced Y-hat) means “the predicted value of Y.” You could also think of it as the “model’s value for Y.” So, this equation simply states that each value of Y can be seen as the sum of its predicted value based on the model, and its deviation from that predicted value.

In our tiny data set, for example, student #1 had a thumb length of 56. So, \(Y_{1}=56\). Under our simple model we used the mean as the predicted value for all students, so \(\hat{Y}_{1}=\bar{Y}=62\). So, \(e_{1}\) would have to be -6 to make the equation true—the exact value of the residual for student #1.

As we develop more complex models we still will end up with a single predicted value of Y for each score based on our model. But we will predict this value using more than just the mean.

### Notation for the General Linear Model

Finally, we complicate things a little more, introducing one more form of our *DATA = MODEL + ERROR* formulation called the General Linear Model (GLM) notation:

\[Y_{i}=b_{0}+e_{i}\]

This is a more abstract version of the equation above; we have substituted \(b_{0}\) (we read this as “b sub 0”) for the mean, \(\bar{Y}\). Don’t be concerned if this doesn’t make complete sense—this is one of those things that will take time to understand. The main thing to know for now is that \(b_{0}\) can represent the mean, but it doesn’t have to.

For our simple model (the empty model) it represents the mean. But for other models, and other situations, it can represent other values. For example, if our outcome variable were categorical, the interpretation of \(b_{0}\) would need to be adjusted to be the mode, which is the best single predictor of the next observation’s value on a categorical variable.

Indeed, this flexibility is what makes the General Linear Model *general*. Whenever you see a GLM model statement, you should think carefully about what, in the particular situation, each symbol represents.