Course Outline

list High School / Statistics and Data Science II (XCD)

Book
  • College / Advanced Statistics and Data Science (ABCD)
  • College / Statistics and Data Science (ABC)
  • High School / Advanced Statistics and Data Science I (ABC)
  • High School / Statistics and Data Science I (AB)
  • High School / Statistics and Data Science II (XCD)

3.3 GLM Notation for the Neighborhood Model

For most of us humans, we are content to describe the Neighborhood model simply as two means. But as with the empty model, it will be helpful to learn how a two-group model is represented in the notation of the General Linear Model, especially as we develop more complicated models.

The Neighborhood Model Using GLM Notation

The full GLM equation for the Neighborhood model incorporates both \(b_0\) and \(b_1\). There are actually a few ways you could write this model but we will write the model like this:

\[Y_{i}=b_{0}+b_{1}X_{i}+e_{i}\]

We can also write it in a way more specific to the Neighborhood model of PriceK like this:

\[PriceK_{i}=b_{0}+b_{1}Neighborhood_{i}+e_{i}\]

It’s important to notice, first, that both the empty model and the two-group Neighborhood model start with \(Y_i\) to the left of the equals sign and end with \(e_i\). In both models, \(Y_{i}\) represents the price for home i, and \(e_{i}\) represents the error or residual between the predicted home price and the actual price for home i.

For the two-group model, the MODEL part of DATA = MODEL + ERROR is now more complicated: \(b_{0}+b_{1}X_{i}\) instead of simply \(b_0\). In both cases, though, the model can be thought of as a function that produces a predicted value on the outcome variable for each observation (in this case, home).

Note that the \(b_0\) parameter estimate has a different meaning than it does in the empty model. It is the first parameter in both models. But for the empty model, which only has one parameter, it represents the mean of PriceK for the whole sample of data, whereas for the two-group model (with two parameters), it represents the mean of the first group (in this case, CollegeCreek).

You might find it confusing to use the same symbol to represent two different ideas. But this flexibility is what makes the General Linear Model so powerful and so… general.

Unlike the empty model, this more complicated model (\(b_0 + b_1X_i\)) is able to generate two different predictions depending on whether a house is in College Creek or Old Town.

Interpreting \(X_i\)

We have developed the idea that \(b_0\) is the mean of the first group, and \(b_0 + b_1\) is the mean of the second group. But the function that results in a predicted value for each observation under the two-group model is this: \(b_{0} + b_{1} X_{i}\). In this model, what does the \(X_i\) do?

It turns out we need the \(X_i\) in order for the model to actually compute two predicted scores. Here’s how it works. \(X_i\) represents the grouping variable – our explanatory variable, Neighborhood – but in a special way. It is called a dummy variable, which means that R creates it specifically to make the model work.

R takes the variable Neighborhood and recodes it into a new variable (\(X_i\)) that can only be assigned one of two values: 0 or 1. In the two-group model, \(X_i\) is coded 1 if the home is in the second group (OldTown), and it is coded 0 if the home is not in the second group (i.e., not in OldTown).

Although in this data, not in Old Town is the same as saying the home is in College Creek, it’s important to think of \(X_i = 0\) as meaning the home is not in Old Town rather than that it is in College Creek (even though it is!). Keeping this subtle distinction in mind will help us understand how dummy variables work when we have models with more than 2 groups.

The reason the \(b_0\) estimate is called Intercept in the lm() output is because it is the predicted sale price when \(X_i\) is equal to 0 – in other words, when the Neighborhood is not Old Town. The estimate that R called NeighborhoodOldTown (\(b_1\)), by this line of reasoning, is kind of like the slope of a line. It is the adjustment in price for a 1 unit increase in Neighborhood.

Armed with this understanding of how \(X_i\) works, you can now use the model to make predictions. If \(X_i=1\), then the home is in Old Town. In this case, the predicted home value would be \(b_0+b_1*1\). (A reminder: The asterisk represents multiplication in R so we’re using it as a way of notating multiplication in GLM.)

If \(X_i = 0\), meaning the home is not in Old Town, then the second parameter estimate won’t get added in, because \(b_1\) times 0 is equal to 0. And if the second parameter drops out, the prediction would simply be the \(b_0\), which is the mean of College Creek.

\(X_i\) is a variable, meaning it can take a different value for different homes in the data frame. We show that this is a variable by putting a subscript \(i\) after the \(X\). Each home can either have the value of 0 or 1 for \(X_i\) because each home is or is not in Old Town.

How Does R Know Which Neighborhood to Represent with \(X_i\)?

The answer to this question is: R doesn’t know; it’s just taking whatever group comes first alphabetically (in this case, CollegeCreek) and making it the reference group. The mean of the reference group is the first parameter estimate (\(b_0\) or the Intercept in the lm() output).

R then takes the second group (in this case, OldTown) and represents it with the dummy variable \(X_i\). If \(X_i\) is coded 1 then the home is in OldTown. If it is coded 0, then the home is *not* inOld Town`.

Let’s say, just for fun, that you changed the code for OldTown to AnOldTown in the data frame. Because it now comes first in the alphabet, AnOldTown becomes the reference group, and its mean is now the estimate for the intercept (\(b_0\)).

As long as R knows that a variable is categorical (e.g., a factor), it doesn’t really care how you code it. You can code the categories under Neighborhood with any characters you choose (e.g., AnOldTown or OldTown), or with any numbers you choose (e.g, 1 and 2, 0 and 1, or 1 and 500). However you code it, R will take the category that comes first as the intercept (\(b_0\)) and then the next one as the dummy explanatory variable, which R will code 0 or 1.

Responses