Course Outline

list College / Introductory Statistics with R (ABC)

Book College / Introductory Statistics with R (ABC)
  • High School / Advanced Statistics and Data Science I (ABC)
  • High School / Statistics and Data Science I (AB)
  • High School / Statistics and Data Science II (XCD)
  • High School / Algebra + Data Science (G)
  • College / Introductory Statistics with R (ABC)
  • College / Advanced Statistics with R (ABCD)
  • College / Accelerated Statistics with R (XCD)
  • CKHub: Jupyter made easy

9.4 Comparing Regression Models to Group Models

Comparing the Height2Group Model and the Height Model

We now know how to specify and fit two different kinds of models: group models (e.g., Height2Group_model) and regression models (Height_model), let’s just think for a bit on what the similarities and differences are between these models.

Symbol Group Model
Yi=b0+b1Xi+ei
Thumbi=b0+b1Height2Grouptalli+ei
Regression Model
Yi=b0+b1Xi+ei
Thumbi=b0+b1Heighti+ei
Yi Thumb length of a student i Thumb length of a student i
b0 Predicted thumb length when Height2Groupi=0)
(mean thumb length for short group)
Predicted thumb length when Heighti=0
(y-intercept for regression line)
b1 Adjustment to predicted thumb length for a tall student
(the mean difference between the two group means)
Adjustment to predicted thumb length for a one-unit increase in height
(the slope of the regression line)
Xi Height2Group of a student i, coded as 0=not-tall, 1=tall Height of a student i in inches
ei Error for student i Error for student i
visualization of the model

A jitter plot of Thumb by Height2Group with the model predictions in red.

A scatter plot of Thumb by Height with the model predictions in red.


Fitting a Regression Model By Accident When You Don’t Want One

Although R is pretty smart about knowing which model to fit, it won’t always do the right thing. If you code the grouping variable with character strings such as “female” and “male” or “short” and “tall,” R will make the right decision to fit a group model because it knows the variable must be categorical. But if you code the same grouping variable as 1 and 2 (maybe you forget to make it a factor), R may get confused and fit the model as though the explanatory variable is quantitative.

For example, we’ve added a new variable to our Fingers data called GenderNum. Here is what the data look like.

 Thumb  Gender GenderNum
1    66   male         2
2    64 female         1
3    56 female         1
4    70   male         2
5    52 female         1
6    62   male         2

If you take a look at the variables Gender and GenderNum, they have the same information. Students 2, 3, and 5 are in one group and students 1, 4, and 6 are in another group. If we fit a model with Gender (and call it the Gender_model) or GenderNum (and call it the GenderNum_model), we would expect the same estimates. Let’s try it.

Call:
lm(formula = Thumb ~ Gender, data = Fingers)

Coefficients:
(Intercept)   Gendermale
     58.256        6.447
Call:
lm(formula = Thumb ~ GenderNum, data = Fingers)

Coefficients:
(Intercept)    GenderNum
     51.809        6.447

Because Gender is a factor (i.e., a categorical variable), lm() fits a group model. But for GenderNum, lm() thinks the 1 or 2 coding refers to a quantitative variable. Because we did not tell R to treat GenderNum as a factor, it fits a regression line instead of a two-group model. If it does that, the meaning of the estimates will not be what you expect for the group model.

The b1 estimate will be the same as in the two-group model; because it represents the adjustment in thumb length for a one unit change in Xi. For Gender, a 1-unit change is to go from not male (Xi=0) to male (Xi=1). For GenderNum, a 1-unit change similarly goes from not male (Xi=1) to male (Xi=2).

b1 of the Gender model,
a group model
b1 of the GenderNum model,
an accidental regression model

On the left, a representation of b-sub-one of the Gender model, a group model, as a jitter plot of Thumb predicted by Gender (female and male), with the model overlaid as horizontal lines at the mean of each group. The horizontal distance between each group is labeled as the one unit change in Gender, and the vertical distance between each group mean is labeled to say that we adjust predicted Thumb by 6.45.

On the right, a representation of b-sub-one of the GenderNum model, an accidental regression model, as a jitter plot of Thumb predicted by Gender (female and male), with the model overlaid as a regression line running through the mean of each group. The horizontal distance between each group is labeled as the one unit change in GenderNum, and the vertical distance between each group mean is labeled to say that we adjust predicted Thumb by 6.45.


But the b0 estimate will be different in the GenderNum model, where it represents the y-intercept of the regression line, or the predicted thumb length when Xi equals 0. This makes no sense, however, when there are only two groups and they are coded 1 and 2. This is an accidental regression model.

b0 of the Gender model b0 of the GenderNum model

On the left, a representation of b-sub-zero of the Gender model as a jitter plot of Thumb predicted by Gender (female and male), with the model overlaid as horizontal lines at the mean of each group. The line for the mean of the female group is labeled to say when Gender equals zero, predicted Thumb equals 58.26.

On the right, a representation of b-sub-zero of the GenderNum model as a jitter plot of Thumb predicted by Gender (female and male), with the model overlaid as a regression line running through the mean of each group. The point of the regression line nearest to the y-axis, where GenderNum equals zero, is labeled to say when GenderNum equals zero, predicted Thumb equals 51.81.


Responses