Course Outline

list High School / Statistics and Data Science II (XCD)

Book
  • High School / Advanced Statistics and Data Science I (ABC)
  • High School / Statistics and Data Science I (AB)
  • High School / Statistics and Data Science II (XCD)
  • College / Statistics and Data Science (ABC)
  • College / Advanced Statistics and Data Science (ABCD)
  • College / Accelerated Statistics and Data Science (XCDCOLLEGE)
  • Skew the Script: Jupyter

7.3 Specifying and Fitting a Multivariate Model

We can see from visualizations of the data that a model that includes both Neighborhood and HomeSizeK might help us make better predictions of PriceK than would a model including only one of these variables. We can write this two-predictor model as a word equation: PriceK= Neighborhood + HomeSizeK + Error. Let’s now see how we would specify and fit such a model.

Specifying a Multivariate Model in GLM Notation

Building on the notation we used for the one-predictor model, we will specify the two predictor model like this:

\[Y_i = b_0+b_1X_{1i}+b_2X_{2i}+e_i\]

Although it may look more complicated, on closer examination you can see that it is similar to the single-predictor model in most ways. \(Y_i\) still represents the outcome variable PriceK, and \(e_i\), at the end, still represents each data point’s error from the model prediction. And, it still follows the basic structure: DATA = MODEL + ERROR.

Let’s unpack the MODEL part of the equation just a little. Whereas previously we had only one X in the model, we now have two (\(X_{1i}\) and \(X_{2i}\)). Each X represents a predictor variable. Because it varies across observations it has the subscript i. To distinguish one X from the other, we label one with the subscript 1, the other with 2. The first of these will represent Neighborhood, the second, HomeSizeK, though which X we assign to which variable doesn’t really matter.

Notice, also, that with the additional \(X_{2i}\) we also add a new coefficient or parameter estimate: \(b_2\). We said before that the empty model is a one-parameter model because we are estimating only one parameter, \(b_0\). A single-predictor model (e.g., the home size model) is a two-parameter model: it has both a \(b_0\) and a \(b_1\).

This multivariate model is a three-parameter model: \(b_0\), \(b_1\), and \(b_2\).

We can also write this model substituting the variable names for the Xs:

\[\text{PriceK}_i=b_0+b_1\text{Neighborhood}_i+b_2\text{HomeSizeK}_i+e_i\]

Fitting a Multivariate Model

Having specified the skeletal structure of the model, we next want to fit the model, which means finding the best-fitting parameter estimates (i.e., the values of \(b_0\), \(b_1\), and \(b_2\)). By “best-fitting” we mean the parameter estimates that reduce error as much as possible around the model predictions.

Although there are several mathematical ways to do this, you can imagine the computer trying every possible combination of three numbers to find the set that results in the lowest Sum of Squares (SS) Error.

It’s a bit like we are cooking up some model predictions and we’ll need to add a little of X1 (HomeSizeK) and a little of X2 (Neighborhood). The best-fitting estimates tell us how much of each to add (or subtract) in order to produce the best possible prediction of PriceK.

Clip art style representation of a large cooking pot labeled as model predictions, with two seasoning shakers, one labeled X1 and the other labeled X2, that are hovering over the pot, angled to pour into the pot. Each shaker has a an arrow pointing down from it to the inside of the pot, with a question mark in the arrow, as a way to question how much of each seasoning should be added to the pot to make the model predictions.

Now enter the lm() code into the window below and run it to get the best-fitting parameter estimates for the two-predictor model.

require(coursekata) # use lm() to find the best-fitting coefficients for our multivariate model # use lm() to find the best-fitting coefficients for our multivariate model lm(PriceK ~ Neighborhood + HomeSizeK, data = Smallville) ex() %>% check_function("lm") %>% check_result() %>% check_equal()
CK Code: D1_Code_Specifying_01
Call:
lm(formula = PriceK ~ Neighborhood + HomeSizeK, data = Smallville)
 
Coefficients:
         (Intercept)  NeighborhoodEastside             HomeSizeK  
              177.25                -66.22                 67.85 

In some ways, this output looks familiar to us. Let’s try to figure out what these parameter estimates mean.

Using the output of lm(PriceK~ Neighborhood + HomeSizeK, data = Smallville), we can write our best-fitting model in GLM notation as:

\[Y_i = 177.25 + -66.22X_{1i} + 67.85X_{2i}\]

As with the single-predictor model, R re-codes Neighborhood, a categorical variable, as a dummy variable and gives it the name NeighborhoodEastside. R codes this dummy variable, represented in the equation as \(X_{1i}\), as 1 if the house is in Eastside, and 0 if it is not in Eastside.

We also can write the best-fitting model like this, which will help us remember how Neighborhood is dummy coded:

\[\text{PriceK}_i = 177.25 + -66.22\text{NeighborhoodEastside}_i + 67.85\text{HomeSizeK}_i\]

Responses