Course Outline

list High School / Advanced Statistics and Data Science I (ABC)

Book
  • High School / Advanced Statistics and Data Science I (ABC)
  • High School / Statistics and Data Science I (AB)
  • High School / Statistics and Data Science II (XCD)
  • High School / Algebra + Data Science (G)
  • College / Introductory Statistics with R (ABC)
  • College / Advanced Statistics with R (ABCD)
  • College / Accelerated Statistics with R (XCD)
  • CKHub: Jupyter made easy

7.2 Using R to Fit the Group Model

Now that we know what the Gender model does – i.e., that it produces two different values for the predictions, one for students of each gender – let’s use R to fit the model to the data and see how the model makes these predictions.

In the code block below, we have filled in the code to fit the Gender model and save it as Gender_model. Use the pipe operator (%>%) to overlay model predictions of the Gender model on the jitter plot: gf_model(Gender_model).

require(coursekata) # find best fitting model Gender_model <- lm(Thumb ~ Gender, data = Fingers) # add code to visualize the new model on the jitter plot gf_jitter(Thumb ~ Gender, data = Fingers, width = .1) # find best fitting model Gender_model <- lm(Thumb ~ Gender, data = Fingers) # add code to visualize the new model on the jitter plot gf_jitter(Thumb ~ Gender, data = Fingers, width = .1) %>% gf_model(Gender_model) ex() %>% { check_function(., "gf_model") %>% check_arg("object") %>% check_equal() check_or(., check_function(., "gf_model") %>% check_arg("model") %>% check_equal(), override_solution(., "gf_jitter(Thumb ~ Gender, data = Fingers) %>% gf_model(Thumb ~ Gender)") %>% check_function(., "gf_model") %>% check_arg("model") %>% check_equal() ) }

A jitter plot of the distribution of Thumb by Gender in the Fingers data frame, overlaid with a red horizontal line in each group showing the group mean.

If you want to change the color of the model (as we did for the figure above), you can add in the argument color = "red" to the gf_model() function.

Generating Predictions from the Gender Model

The Gender model in the visualization is represented by two lines because it generates two different predictions depending on the Gender of the person. Let’s use the predict() function to see how this works.

require(coursekata) # we have saved the Gender model for you Gender_model <- lm(Thumb ~ Gender, data = Fingers) # write code to generate predictions using this model # no need to save the predictions # we have saved the Gender model for you Gender_model <- lm(Thumb ~ Gender, data = Fingers) # write code to generate predictions using this model # no need to save the predictions predict(Gender_model) ex() %>% check_function("predict") %>% check_result() %>% check_equal()

Notice that now, instead of just generating a single number (like the empty model’s 60.1), these predictions are a mix of two numbers (64.7 and 58.3). Using the code below, we have saved these predictions back into Fingers as a new variable called Gender_predict and printed out Gender, Thumb, and Gender_predict for 6 of the students.

Fingers$Gender_predict <- predict(Gender_model)
head(select(Fingers, Gender, Thumb, Gender_predict))
  Gender Thumb Gender_predict
1   male 66.00       64.70267
2 female 64.00       58.25585
3 female 56.00       58.25585
4   male 58.42       64.70267
5 female 74.00       58.25585
6 female 60.00       58.25585

The Gender model looks at the gender of each student before making its prediction. If the student is male, the model predicts the thumb length as 64.7 mm, and if female, it predicts 58.3 mm.

We can run favstats() to confirm that these two model predictions are, in fact, the mean thumb lengths for students of each gender.

favstats(Thumb ~ Gender, data=Fingers)
  Gender min Q1 median     Q3   max     mean       sd   n missing
1 female  39 54     57 63.125 86.36 58.25585 8.034694 112       0
2   male  47 60     64 70.000 90.00 64.70267 8.764933  45       0

Yes, they are: the means of the two gender groups in the favstats() output are the same as the model predictions generated by the Gender_model.

Interpreting the lm() Output for the Gender Model

We have fit the Gender model using lm() and then used this model to generate predictions. However, we have not yet looked at the best-fitting parameter estimates for the model.

Recall that for the empty model we estimated one parameter (\(b_0\)), the mean. For the two-group Gender model we are going to estimate two parameters (\(b_0\) and \(b_1\)). Based on the model predictions, we might expect these parameter estimates to be the mean for each gender. Let’s find out.

In the code block below we have fit and saved the Gender model in the R object Gender_model. Add some code to print out the model so we can look at the parameter estimates.

require(coursekata) # we have saved the Gender model for you Gender_model <- lm(Thumb ~ Gender, data = Fingers) # print out the best fitting parameter estimates # we have saved the Gender model for you Gender_model <- lm(Thumb ~ Gender, data = Fingers) # print out the best fitting parameter estimates Gender_model ex() %>% check_output_expr("Gender_model")
Call:
lm(formula = Thumb ~ Gender, data = Fingers)

Coefficients:
(Intercept)   Gendermale
     58.256        6.447

As expected, we see two parameter estimates. But these are not the two estimates we might have expected from looking at the graph and predictions. We expected to get the two gender group means: around 58 and 65.

The estimate labeled Intercept in the output (58.256) seems like the mean thumb length of female students. But how should we interpret the second estimate (6.447), the one labeled Gendermale?

Jitter plot of Thumb predicted by Gender (female and male), with the mean of each group overlaid as a horizontal line. The line for the mean of female is labeled with the letter A, the line for the mean of male is labeled with the letter C, and the distance between the two group means is drawn with a vertical line and labeled with the letter B.


Call:
lm(formula = Thumb ~ Gender, data = Fingers)

Coefficients:
(Intercept) Gendermale
     58.256      6.447

Taking a closer look at the output of the model created by using lm(), the label Gendermale for the second parameter estimate is actually useful. Although it might have been nice for R to insert some punctuation between Gender and male, it nevertheless tells us that the second estimate is the adjustment needed to get from the mean of the first group, which is referred to as Intercept, to the mean of the second group, male.

Sure enough, this works: 58.3 + 6.4 = 64.7. This is the gender model’s predicted thumb length for a male student.

The Parameter Estimates \(b_0\) and \(b_1\)

Whereas the empty model was a one-parameter model (producing only one estimate, \(b_0\), for the grand mean), the Gender model is a two-parameter model (\(b_0\) and \(b_1\)). One of the parameters represents the mean for female students, the other is the amount that must be added to get the mean for males (as seen in the picture below).

A jitter plot of the distribution of Thumb by Gender, overlaid with a red horizontal line in each group showing the group mean. The line for the female group is labeled as b-sub-zero, the line for the male group is not labeled, and there is a vertical line in between them to indicate the distance between the two group means and it is labeled as b-sub-1.

The output of lm() shows us the values of \(b_0\) and \(b_1\) that best fit the data.

Call:
lm(formula = Thumb ~ Gender, data = Fingers)

Coefficients:
(Intercept)   Gendermale
     58.256        6.447

The \(b_0\) parameter estimate represents the mean of the first group (female). The \(b_1\) represents the quantity that must be added to \(b_0\) in order to get the model prediction for the second group, which in this case is male.

A jitter plot of the distribution of Thumb by Gender in the Fingers data frame, overlaid with a red horizontal line in each group showing the group mean. The line for the female group is labeled as b-sub-zero, the line for the male group is labeled b-sub-zero plus b-sub-1, and there is a vertical line in between them to indicate the distance between the two group means and it is labeled as b-sub-1.

Responses