Course Outline
-
segmentGetting Started (Don't Skip This Part)
-
segmentStatistics and Data Science: A Modeling Approach
-
segmentPART I: EXPLORING VARIATION
-
segmentChapter 1 - Welcome to Statistics: A Modeling Approach
-
segmentChapter 2 - Understanding Data
-
segmentChapter 3 - Examining Distributions
-
segmentChapter 4 - Explaining Variation
-
4.9 Sources of Variation
-
segmentPART II: MODELING VARIATION
-
segmentChapter 5 - A Simple Model
-
segmentChapter 6 - Quantifying Error
-
segmentChapter 7 - Adding an Explanatory Variable to the Model
-
segmentChapter 8 - Digging Deeper into Group Models
-
segmentChapter 9 - Models with a Quantitative Explanatory Variable
-
segmentPART III: EVALUATING MODELS
-
segmentChapter 10 - The Logic of Inference
-
segmentChapter 11 - Model Comparison with F
-
segmentChapter 12 - Parameter Estimation and Confidence Intervals
-
segmentChapter 13 - What You Have Learned
-
segmentFinishing Up (Don't Skip This Part!)
-
segmentResources
list High School / Advanced Statistics and Data Science I (ABC)
4.9 Sources of Variation
We have discussed what it means for an explanatory variable to explain variation in an outcome variable, and we have learned some ways to explore this idea with data visualizations. Let’s now zoom out a little and think more broadly about where variation in data comes from. There are three important points we want to make about sources of variation.
(1) Variation Can Be Either Explained or Unexplained
Consider the word equation Thumb = Height + Other Stuff. Explained variation is the portion of the total variation in the outcome (i.e., thumb length) we can attribute to the explanatory variable (height). The rest of the variation (or remaining variation after accounting for the explanatory variable) is left unexplained. Other stuff represents this unexplained variation. It’s useful to think of total variation as the sum of explained and unexplained variation.
(2) Some Unexplained Variation Can Be Explained
Some of the unexplained variation in an outcome variable can be explained, if we add the right variables to our model. For example, we have data on Gender
in the Fingers
dataframe. If we added gender to our model (in addition to height), some of the remaining variation in thumb lengths, beyond that explained by height alone, might be explained.
There also might be other variables that could explain some of the variation in thumb lengths if only we had measured them. For example, nutritional intake, toe lengths, mother’s thumb lengths, etc. might all explain some of the variation in thumb lengths.
The variation that could have been explained by other variables (whether measured or unmeasured), if only we had included them in the model, are part of Other Stuff. If we add an explanatory variable to the model and hence explain more of the variation, the amount left unexplained (i.e., Other Stuff) will be decreased by that same amount. The work of the data analyst can be thought of as increasing the proportion of variation that is explained, while decreasing the proportion left unexplained.
Can we ever explain all of the variation in an outcome? Almost certainly not. Even if we measured a lot of variables and added them to our model, there would still be some variation in thumb lengths that we could not explain. This doesn’t mean that it could never be explained but just that, for now, it is too hard to explain.
(3) Unexplained Variation Can Be Thought of as Random
Practically speaking, there will always be unexplained variation. Statisticians deal with this unexplained variation by thinking of it as random. Even though we can’t fully explain why one thumb is longer than another, we can assume that, over time, the unexplained variation will be distributed in a predictable way.
For example, many of the statistical models you will learn about in this class assume that unexplained variation is randomly distributed as a normal distribution. Based on this assumption, we can say that the predictions made by these models are as likely to be too high as they are to be too low, and that most of the prediction errors will be off by just little, with only a few being off by a lot. We will say more about this later in the book!