Course Outline
-
segmentGetting Started (Don't Skip This Part)
-
segmentStatistics and Data Science: A Modeling Approach
-
segmentPART I: EXPLORING VARIATION
-
segmentChapter 1 - Welcome to Statistics: A Modeling Approach
-
segmentChapter 2 - Understanding Data
-
segmentChapter 3 - Examining Distributions
-
segmentChapter 4 - Explaining Variation
-
4.1 Outcome and Explanatory Variables
-
-
segmentPART II: MODELING VARIATION
-
segmentChapter 5 - A Simple Model
-
segmentChapter 6 - Quantifying Error
-
segmentChapter 7 - Adding an Explanatory Variable to the Model
-
segmentChapter 8 - Digging Deeper into Group Models
-
segmentChapter 9 - Models with a Quantitative Explanatory Variable
-
segmentPART III: EVALUATING MODELS
-
segmentChapter 10 - The Logic of Inference
-
segmentChapter 11 - Model Comparison with F
-
segmentChapter 12 - Parameter Estimation and Confidence Intervals
-
segmentChapter 13 - What You Have Learned
-
segmentFinishing Up (Don't Skip This Part!)
-
segmentResources
list High School / Advanced Statistics and Data Science I (ABC)
Chapter 4 - Explaining Variation
4.1 Outcome and Explanatory Variables
Examining distributions of single variables is always an important starting place. But as data analysts, our interests usually go beyond exploring patterns of variation in a single variable. We want to explain the variation.
Let’s start with an informal definition of “explain variation”: if knowing someone’s score on one variable helps you make a slightly better guess about that person’s score on another variable, then we can say that the first variable explains some variation in the second variable.
For example, if we knew someone’s height we could probably make a more accurate prediction of their thumb length, assuming that taller people would have longer thumbs. Not that our prediction would be very accurate, but only more accurate than if we didn’t know their height.
Informal Definition of Explain Variation: If we know a case’s value on one variable, we can make a better prediction of its value on another variable.
In this chapter we will learn how to represent a hypothesis about the relationship between two variables as a word equation. We then will learn how to create data visualizations (such as scatter plots, jitter plots, box plots, and histograms) to explore the hypothesis. (In later chapters, we’ll turn these word equations into mathematical functions we can use to actually make predictions such as someone’s thumb length based on their height.)
Outcome versus Explanatory Variables
Up to this point, we have distinguished between categorical variables and quantitative variables. But our desire to explain variation in one variable with variation in another variable leads us to make another distinction, that is, between an outcome variable and an explanatory variable.
The outcome variable is the variable whose variation we are trying to explain.
The explanatory variable is the variable we use to make better predictions of the outcome.
For now, the tools and methods we use will focus on a single outcome variable and a single explanatory variable at a time. But we want to prepare you for the possibility of using multiple explanatory variables to explain variation in one outcome.
You may or may not have heard the terms “outcome variable” and “explanatory variable”. We will use these terms throughout, but if you’ve taken statistics before, or read any research reports, you will no doubt have encountered a number of different terms used to represent the same distinction.
Instead of outcome variable, some people will say dependent variable (or DV), response variable, or output variable. For explanatory variable, you may hear the terms independent variable (or IV), predictor variable, treatment variable, experimental variable, or factor.
Word Equations
We can represent relationships between outcome variables and explanatory variables with word equations. Here is a word equation that represents the relationship between thumb length and height:
thumb length = height + other stuff
The term other stuff at the end of the word equation represents an important idea: even if knowing someone’s height can help us make a better prediction of their thumb length, the prediction won’t be perfect. While some of the variation in thumb length can be explained by variation in height, there will still be some variation that is not explained. This remaining variation could, presumably, be explained by other stuff.
Gender SSLast Year Thumb Pinkie Height
1 male NA 3 66.00 57.0 70.5
2 female 7 2 64.00 62.0 64.8
3 female 2 2 56.00 54.0 64.0
4 male 9 2 58.42 63.5 70.0
5 female 8 3 74.00 64.0 68.0
6 female 7 3 60.00 58.0 68.0
Here’s how to read a word equation: “Variation in Thumb
can be explained by variation in Height
plus other stuff.” By convention, the outcome variable, Thumb
, is written to the left of the equal sign and the explanatory variable, Height
, is written to the right.
Word equations are not the same as mathematical equations. It isn’t the case, for example, that thumb length and height are the same thing or are “equal.” A word equation is just an informal way of representing the idea that some of the variation in thumb length is explained by variation in height (the rest being explained by other stuff).
More generally, we could say that some of the variation in the outcome variable is explained by variation in the explanatory variable:
outcome = explanatory + other stuff
We will start to refer to these word equations as informal models. A model airplane isn’t the real thing, but it will give you a good idea of what a real plane looks like. Models give us a simplified representation of what the relationship between variables might look like. We will quantify these relationships as formal models later, but it’s helpful to start thinking of word equations as models.
Using variation in one variable (explanatory) to explain variation in another (outcome) variable is at the heart of statistical analysis. This is where you start learning how your theories about the world can be supported by data, or not. Although this chapter is longer than the prior chapters, take your time going through it. Because the concepts are important, the effort and hard work you put into this chapter will pay off later as you learn how to create and test statistical models of the world!