Course Outline
- 
        segmentGetting Started (Don't Skip This Part)
- 
        segmentStatistics and Data Science
- 
        segmentPART I: EXPLORING AND MODELING VARIATION
- 
        segmentChapter 1 - Exploring Data with R
- 
        segmentChapter 2 - From Exploring to Modeling Variation
- 
        segmentChapter 3 - Modeling Relationships in Data
- 
                
                  3.7 Specifying the `HomeSizeK` Model with GLM Notation
 
- 
        segmentPART II: COMPARING MODELS TO MAKE INFERENCES
- 
        segmentChapter 4 - The Logic of Inference
- 
        segmentChapter 5 - Model Comparison with F
- 
        segmentChapter 6 - Parameter Estimation and Confidence Intervals
- 
        segmentPART III: MULTIVARIATE MODELS
- 
        segmentChapter 7 - Introduction to Multivariate Models
- 
        segmentChapter 8 - Multivariate Model Comparisons
- 
        segmentChapter 9 - Models with Interactions
- 
        segmentChapter 10 - More Models with Interactions
- 
        segmentFinishing Up (Don't Skip This Part!)
- 
        segmentResources
list College / Accelerated Statistics with R (XCD)
3.7 Specifying the HomeSizeK Model with GLM Notation
Here is how we specify a regression model in which we have a single quantitative explanatory variable (such as HomeSizeK):
\[Y_i=b_0+b_1X_i+e_i\]
If this looks familiar to you it’s because it is! The exact same GLM notation is used for the two-group model (e.g., the Neighborhood model). Both of these models are two-parameter models, meaning that we will be estimating two parameters: \(b_0\) and \(b_1\).
The fact that the same notation represents both models is what is so beautiful about the General Linear Model. It is simple and elegant and can be applied across a wide variety of situations, including situations with categorical or quantitative explanatory variables. Although both models are specified using the same notation, the interpretation of the notation varies from situation to situation.
In the Neighborhood model, \(X_i\) was dummy coded as either 0 or 1. The 1 did not represent a quantity, just whether the home was in Old Town or not. In the HomeSizeK model, the \(X_i\) literally is coded as the number of square feet of living space (in thousands of square feet) in the home.
Coding \(X_i\) in these different ways leads to different, but related, interpretations of the \(b_1\) coefficient.
In the regression model, \(b_1\) still represents an adjustment to \(b_0\), but this time it is the amount of adjustment to make for every 1-unit change in HomeSizeK. This is the definition of the slope of a line: the amount of “rise” for each one unit of “run”, i.e., how much \(Y_i\) changes for each one unit change in \(X_i\). \(b_1\) is, in fact, the slope of the best-fitting regression line.
In both models, the \(b_0\) coefficient represents an intercept, i.e., the value of \(Y_i\) when \(X_i = 0\). But in the Neighborhood model, when \(X_i = 0\) it simply means that the home is in College Creek, which is the reference category for Neighborhood. In the HomeSizeK model, if \(X_i\) were equal to 0 it would literally mean that the home has 0 square feet of living space! Zero is not a common sense value for \(X_i\) when \(X_i\) is representing HomeSizeK, but the model can still make a prediction for such a nonsensical house.
Connection to Algebra
In algebra, a straight line is often represented by the equation \(y = mx+ b\), where the \(m\) is called the slope and the \(b\) is called the y-intercept.
In statistics, we use that same linear equation, but we switch it around so the intercept comes first (\(y = b+mx\)), and we use different letters to represent the intercept and slope (\(b_0\) and \(b_1\), respectively).
Fitting a regression model is a matter of finding the particular line (i.e., slope and intercept) that best fits the data (i.e., that minimizes the sum of squared errors).