Course Outline

list High School / Advanced Statistics and Data Science I (ABC)

Book
  • High School / Advanced Statistics and Data Science I (ABC)
  • High School / Statistics and Data Science I (AB)
  • High School / Statistics and Data Science II (XCD)
  • College / Statistics and Data Science (ABC)
  • College / Advanced Statistics and Data Science (ABCD)
  • College / Accelerated Statistics and Data Science (XCDCOLLEGE)
  • Skew the Script: Jupyter

9.2 Specifying the Height Model with GLM Notation

Here is how we specify a regression model in which we have a single quantitative explanatory variable (such as Height):

\[Y_i=b_0+b_1X_i+e_i\]

It might be useful to compare this notation to that used in the previous chapter to specify the two-group model (such as Height2Group):

\[Y_i=b_0+b_1X_i+e_i\]

The fact that the same notation represents both models is what is so beautiful about the General Linear Model. It is simple and elegant and can be applied across a wide variety of situations, including situations with categorical or quantitative explanatory variables. Although both models are specified using the same notation, the interpretation of the notation varies from situation to situation.

In the Height2Group model, \(X_i\) was dummy coded as either 0 or 1. The 1 did not represent a quantity, just whether the student was tall or not. In the Height model, the \(X_i\) literally is coded as the measured height in inches of the student.

Coding \(X_i\) in these different ways leads to different, but related, interpretations of the \(b_1\) coefficient.

In the regression model, \(b_1\) still represents an adjustment to \(b_0\), but this time it is the amount of adjustment to make for every 1-unit change in Height. This is the definition of the slope of a line: the amount of “rise” for each one unit of “run”, i.e., how much \(Y_i\) changes for each one unit change in \(X_i\). \(b_1\) is, in fact, the slope of the best-fitting regression line.

In both models, the \(b_0\) coefficient represents an intercept, i.e., the value of \(Y_i\) when \(X_i = 0\). But in the Height2Group model, when \(X_i = 0\) it simply means that the student is short, which is the reference category for Height2Group. In the Height model, if \(X_i\) were equal to 0 it would literally mean that the student has a height of 0 inches! Zero is not a common sense value for \(X_i\) when \(X_i\) is representing Height, but the model can still make a prediction for such a nonsensical student.

Connection to Algebra

In algebra, a straight line is often represented by the equation \(y = mx+ b\), where the \(m\) is called the slope and the \(b\) is called the y-intercept.

In statistics, we use that same linear equation, but we switch it around so the intercept comes first (\(y = b+mx\)), and we use different letters to represent the intercept and slope (\(b_0\) and \(b_1\), respectively).

Fitting a regression model is a matter of finding the particular line (i.e., slope and intercept) that best fits the data (i.e., that minimizes the sum of squared errors).

Responses