Course Outline

segmentGetting Started (Don't Skip This Part)

segmentStatistics and Data Science: A Modeling Approach

segmentPART I: EXPLORING VARIATION

segmentChapter 1  Welcome to Statistics: A Modeling Approach

segmentChapter 2  Understanding Data

segmentChapter 3  Examining Distributions

segmentChapter 4  Explaining Variation

segmentPART II: MODELING VARIATION

segmentChapter 5  A Simple Model

segmentChapter 6  Quantifying Error

segmentChapter 7  Adding an Explanatory Variable to the Model

segmentChapter 8  Digging Deeper into Group Models

segmentChapter 9  Models with a Quantitative Explanatory Variable

9.2 Specifying the Height Model with GLM Notation

segmentPART III: EVALUATING MODELS

segmentChapter 10  The Logic of Inference

segmentChapter 11  Model Comparison with F

segmentChapter 12  Parameter Estimation and Confidence Intervals

segmentChapter 13  What You Have Learned

segmentFinishing Up (Don't Skip This Part!)

segmentResources
list High School / Advanced Statistics and Data Science I (ABC)
9.2 Specifying the Height Model with GLM Notation
Here is how we specify a regression model in which we have a single quantitative explanatory variable (such as Height
):
\[Y_i=b_0+b_1X_i+e_i\]
It might be useful to compare this notation to that used in the previous chapter to specify the twogroup model (such as Height2Group
):
\[Y_i=b_0+b_1X_i+e_i\]
The fact that the same notation represents both models is what is so beautiful about the General Linear Model. It is simple and elegant and can be applied across a wide variety of situations, including situations with categorical or quantitative explanatory variables. Although both models are specified using the same notation, the interpretation of the notation varies from situation to situation.
In the Height2Group
model, \(X_i\) was dummy coded as either 0 or 1. The 1 did not represent a quantity, just whether the student was tall or not. In the Height
model, the \(X_i\) literally is coded as the measured height in inches of the student.
Coding \(X_i\) in these different ways leads to different, but related, interpretations of the \(b_1\) coefficient.
In the regression model, \(b_1\) still represents an adjustment to \(b_0\), but this time it is the amount of adjustment to make for every 1unit change in Height
. This is the definition of the slope of a line: the amount of “rise” for each one unit of “run”, i.e., how much \(Y_i\) changes for each one unit change in \(X_i\). \(b_1\) is, in fact, the slope of the bestfitting regression line.
In both models, the \(b_0\) coefficient represents an intercept, i.e., the value of \(Y_i\) when \(X_i = 0\). But in the Height2Group
model, when \(X_i = 0\) it simply means that the student is short, which is the reference category for Height2Group
. In the Height
model, if \(X_i\) were equal to 0 it would literally mean that the student has a height of 0 inches! Zero is not a common sense value for \(X_i\) when \(X_i\) is representing Height
, but the model can still make a prediction for such a nonsensical student.
Connection to Algebra
In algebra, a straight line is often represented by the equation \(y = mx+ b\), where the \(m\) is called the slope and the \(b\) is called the yintercept.
In statistics, we use that same linear equation, but we switch it around so the intercept comes first (\(y = b+mx\)), and we use different letters to represent the intercept and slope (\(b_0\) and \(b_1\), respectively).
Fitting a regression model is a matter of finding the particular line (i.e., slope and intercept) that best fits the data (i.e., that minimizes the sum of squared errors).