Course Outline

list High School / Advanced Statistics and Data Science I (ABC)

Book
  • High School / Advanced Statistics and Data Science I (ABC)
  • High School / Statistics and Data Science I (AB)
  • High School / Statistics and Data Science II (XCD)
  • High School / Algebra + Data Science (G)
  • College / Introductory Statistics with R (ABC)
  • College / Advanced Statistics with R (ABCD)
  • College / Accelerated Statistics with R (XCD)
  • CKHub: Jupyter made easy

Chapter 4 - Explaining Variation

4.1 Outcome and Explanatory Variables

Examining distributions of single variables is always an important starting place. But as data analysts, our interests usually go beyond exploring patterns of variation in a single variable. We want to explain the variation.

Let’s start with an informal definition of “explain variation”: if knowing someone’s score on one variable helps you make a slightly better guess about that person’s score on another variable, then we can say that the first variable explains some variation in the second variable.

For example, if we knew someone’s height we could probably make a more accurate prediction of their thumb length, assuming that taller people would have longer thumbs. Not that our prediction would be very accurate, but only more accurate than if we didn’t know their height.

Informal Definition of Explain Variation: If we know a case’s value on one variable, we can make a better prediction of its value on another variable.

In this chapter we will learn how to represent a hypothesis about the relationship between two variables as a word equation. We then will learn how to create data visualizations (such as scatter plots, jitter plots, box plots, and histograms) to explore the hypothesis. (In later chapters, we’ll turn these word equations into mathematical functions we can use to actually make predictions such as someone’s thumb length based on their height.)

Outcome versus Explanatory Variables

Up to this point, we have distinguished between categorical variables and quantitative variables. But our desire to explain variation in one variable with variation in another variable leads us to make another distinction, that is, between an outcome variable and an explanatory variable.

The outcome variable is the variable whose variation we are trying to explain.

The explanatory variable is the variable we use to make better predictions of the outcome.

For now, the tools and methods we use will focus on a single outcome variable and a single explanatory variable at a time. But we want to prepare you for the possibility of using multiple explanatory variables to explain variation in one outcome.

You may or may not have heard the terms “outcome variable” and “explanatory variable”. We will use these terms throughout, but if you’ve taken statistics before, or read any research reports, you will no doubt have encountered a number of different terms used to represent the same distinction.

Instead of outcome variable, some people will say dependent variable (or DV), response variable, or output variable. For explanatory variable, you may hear the terms independent variable (or IV), predictor variable, treatment variable, experimental variable, or factor.

Word Equations

We can represent relationships between outcome variables and explanatory variables with word equations. Here is a word equation that represents the relationship between thumb length and height:

thumb length = height + other stuff

The term other stuff at the end of the word equation represents an important idea: even if knowing someone’s height can help us make a better prediction of their thumb length, the prediction won’t be perfect. While some of the variation in thumb length can be explained by variation in height, there will still be some variation that is not explained. This remaining variation could, presumably, be explained by other stuff.

     Sex SSLast Year Thumb Pinkie Height
1   male     NA    3 66.00   57.0   70.5
2 female      7    2 64.00   62.0   64.8
3 female      2    2 56.00   54.0   64.0
4   male      9    2 58.42   63.5   70.0
5 female      8    3 74.00   64.0   68.0
6 female      7    3 60.00   58.0   68.0

Here’s how to read a word equation: “Variation in Thumb can be explained by variation in Height plus other stuff.” By convention, the outcome variable, Thumb, is written to the left of the equal sign and the explanatory variable, Height, is written to the right.

Word equations are not the same as mathematical equations. It isn’t the case, for example, that thumb length and height are the same thing or are “equal.” A word equation is just an informal way of representing the idea that some of the variation in thumb length is explained by variation in height (the rest being explained by other stuff).

More generally, we could say that some of the variation in the outcome variable is explained by variation in the explanatory variable:

outcome = explanatory + other stuff

We will start to refer to these word equations as informal models. A model airplane isn’t the real thing, but it will give you a good idea of what a real plane looks like. Models give us a simplified representation of what the relationship between variables might look like. We will quantify these relationships as formal models later, but it’s helpful to start thinking of word equations as models.

Using variation in one variable (explanatory) to explain variation in another (outcome) variable is at the heart of statistical analysis. This is where you start learning how your theories about the world can be supported by data, or not. Although this chapter is longer than the prior chapters, take your time going through it. Because the concepts are important, the effort and hard work you put into this chapter will pay off later as you learn how to create and test statistical models of the world!

Responses