Course Outline
-
segmentGetting Started (Don't Skip This Part)
-
segmentStatistics and Data Science: A Modeling Approach
-
segmentPART I: EXPLORING VARIATION
-
segmentChapter 1 - Welcome to Statistics: A Modeling Approach
-
segmentChapter 2 - Understanding Data
-
segmentChapter 3 - Examining Distributions
-
segmentChapter 4 - Explaining Variation
-
segmentPART II: MODELING VARIATION
-
segmentChapter 5 - A Simple Model
-
segmentChapter 6 - Quantifying Error
-
segmentChapter 7 - Adding an Explanatory Variable to the Model
-
segmentChapter 8 - Digging Deeper into Group Models
-
segmentChapter 9 - Models with a Quantitative Explanatory Variable
-
9.2 Specifying the Height Model with GLM Notation
-
segmentPART III: EVALUATING MODELS
-
segmentChapter 10 - The Logic of Inference
-
segmentChapter 11 - Model Comparison with F
-
segmentChapter 12 - Parameter Estimation and Confidence Intervals
-
segmentChapter 13 - What You Have Learned
-
segmentFinishing Up (Don't Skip This Part!)
-
segmentResources
list High School / Advanced Statistics and Data Science I (ABC)
9.2 Specifying the Height Model with GLM Notation
Here is how we specify a regression model in which we have a single quantitative explanatory variable (such as Height
):
\[Y_i=b_0+b_1X_i+e_i\]
It might be useful to compare this notation to that used in the previous chapter to specify the two-group model (such as Height2Group
):
\[Y_i=b_0+b_1X_i+e_i\]
The fact that the same notation represents both models is what is so beautiful about the General Linear Model. It is simple and elegant and can be applied across a wide variety of situations, including situations with categorical or quantitative explanatory variables. Although both models are specified using the same notation, the interpretation of the notation varies from situation to situation.
In the Height2Group
model, \(X_i\) was dummy coded as either 0 or 1. The 1 did not represent a quantity, just whether the student was tall or not. In the Height
model, the \(X_i\) literally is coded as the measured height in inches of the student.
Coding \(X_i\) in these different ways leads to different, but related, interpretations of the \(b_1\) coefficient.
In the regression model, \(b_1\) still represents an adjustment to \(b_0\), but this time it is the amount of adjustment to make for every 1-unit change in Height
. This is the definition of the slope of a line: the amount of “rise” for each one unit of “run”, i.e., how much \(Y_i\) changes for each one unit change in \(X_i\). \(b_1\) is, in fact, the slope of the best-fitting regression line.
In both models, the \(b_0\) coefficient represents an intercept, i.e., the value of \(Y_i\) when \(X_i = 0\). But in the Height2Group
model, when \(X_i = 0\) it simply means that the student is short, which is the reference category for Height2Group
. In the Height
model, if \(X_i\) were equal to 0 it would literally mean that the student has a height of 0 inches! Zero is not a common sense value for \(X_i\) when \(X_i\) is representing Height
, but the model can still make a prediction for such a nonsensical student.
Connection to Algebra
In algebra, a straight line is often represented by the equation \(y = mx+ b\), where the \(m\) is called the slope and the \(b\) is called the y-intercept.
In statistics, we use that same linear equation, but we switch it around so the intercept comes first (\(y = b+mx\)), and we use different letters to represent the intercept and slope (\(b_0\) and \(b_1\), respectively).
Fitting a regression model is a matter of finding the particular line (i.e., slope and intercept) that best fits the data (i.e., that minimizes the sum of squared errors).