Course Outline
-
segmentGetting Started (Don't Skip This Part)
-
segmentStatistics and Data Science II
-
segmentPART I: EXPLORING AND MODELING VARIATION
-
segmentChapter 1 - Exploring Data with R
-
segmentChapter 2 - From Exploring to Modeling Variation
-
2.4 Explaining Variation in an Outcome Variable
-
segmentChapter 3 - Modeling Relationships in Data
-
segmentPART II: COMPARING MODELS TO MAKE INFERENCES
-
segmentChapter 4 - The Logic of Inference
-
segmentChapter 5 - Model Comparison with F
-
segmentChapter 6 - Parameter Estimation and Confidence Intervals
-
segmentPART III: MULTIVARIATE MODELS
-
segmentChapter 7 - Introduction to Multivariate Models
-
segmentChapter 8 - Multivariate Model Comparisons
-
segmentChapter 9 - Models with Interactions
-
segmentChapter 10 - More Models with Interactions
-
segmentFinishing Up (Don't Skip This Part!)
-
segmentResources
list High School / Statistics and Data Science II (XCD)
2.4 Explaining Variation in an Outcome Variable
Word Equations
We’ve learned several ways to visualize the relationship between an outcome variable and an explanatory variable in a data frame, including gf_point()
, gf_jitter()
, and gf_boxplot()
.
We saw from our visualizations that Neighborhood
explained some of the variation in home prices, with prices in College Creek being generally higher than prices in Old Town. We could also say there is a relationship between neighborhood and home prices.
Another way to represent relationships is with word equations. Here is a word equation that represents the relationship between home prices and neighborhood:
PriceK = Neighborhood + Other Stuff
We can read this equation like this: “Variation in PriceK
is explained by variation in Neighborhood
plus other stuff.” By convention, the outcome variable, PriceK
, is written on the left of the equal sign and the explanatory variable, Neighborhood
, is written to the right. (Note the similarity in structure to the R code: PriceK ~ Neighborhood
.)
It’s important to note that this word equation can be used to convey two different, but related, ideas. On one hand, it can be used to describe a relationship we see in data, for example, a pattern we might see in the kinds of visualizations we have been making. But it also can be used to represent a hypothesis about what might be true in the Data Generating Process (or DGP), even if we don’t have the relevant data.
This kind of equation is not the same as a mathematical equation. It doesn’t mean, for example, that home prices and neighborhoods are the same thing or are “equal.” It is, instead, an informal way of representing the idea that some of the variation in home prices is explained by variation between neighborhoods.
We saw that some of the variation in home prices could be explained by neighborhood. But neighborhood doesn’t explain all the variation. There is plenty of variation even within each neighborhood and there is overlap between the prices in College Park and Old Town. That’s why the equation includes Other Stuff at the end. Neighborhood
explains some, but not all, of the variation in PriceK
.
In the next section, we will discuss what this Other Stuff might be. But for now, it’s useful to think of the total variation in our outcome variable as the sum of variation due to an explanatory variable, plus variation due to other stuff.
We sometimes refer to these word equations as informal models. We will talk more about why we call them models later, but it’s helpful to start thinking of them as models.
Sources of Variation
This is a good time to think a little more about where variation in data comes from. There are three important points we want to make about sources of variation.
(1) Variation Can Be Either Explained or Unexplained
Consider the word equation PriceK = Neighborhood + Other Stuff. Explained variation is the portion of the total variation in the outcome (prices) we can attribute to the explanatory variable (neighborhood). The rest of the variation (or remaining variation after accounting for the explanatory variable) is left unexplained. Other stuff represents this unexplained variation. It’s useful to think of total variation as the sum of explained and unexplained variation.
(2) Some Unexplained Variation Can Be Explained
Some of the unexplained variation in an outcome variable can be explained, if we add the right variables to our model. For example, if we added home size to our model in addition to neighborhood, some of the remaining variation in home prices, beyond that explained by neighborhood alone, might be explained.
There also might be other variables that could explain some of the variation in home prices if only we had measured them. For example, type of construction, quality of plumbing fixtures, age of the windows, etc. might all explain some of the variation in home prices.
The variation that could have been explained by other variables (whether measured or unmeasured), if only we had included them in the model, are part of Other Stuff. If we add an explanatory variable to the model and hence explain more of the variation, the amount left unexplained (i.e., Other Stuff) will be decreased by that same amount. The work of the data analyst can be thought of as increasing the proportion of variation that is explained, while decreasing the proportion left unexplained.
Can we ever explain all of the variation in an outcome? Almost certainly not. Even if we measured a lot of variables and added them to our model, there would still be some variation in home prices that we could not explain. This doesn’t mean that it could never be explained but just that, for now, it is too hard to explain.
(3) Unexplained Variation Can Be Thought of as Random
Practically speaking, there will always be unexplained variation. Statisticians deal with this unexplained variation by thinking of it as random. Even though we can’t fully explain why one home is more expensive than another, we can assume that, over time, the unexplained variation will be distributed in a predictable way.
For example, many of the statistical models you will learn about in this class assume that unexplained variation is randomly distributed as a normal distribution. Based on this assumption, we can say that the predictions made by these models are as likely to be too high as they are to be too low, and that most of the prediction errors will be off by just little, with only a few being off by a lot. We will say more about this later in the book!