Course Outline

segmentGetting Started (Don't Skip This Part)

segmentStatistics and Data Science: A Modeling Approach

segmentPART I: EXPLORING VARIATION

segmentChapter 1  Welcome to Statistics: A Modeling Approach

segmentChapter 2  Understanding Data

segmentChapter 3  Examining Distributions

segmentChapter 4  Explaining Variation

4.2 Visualizing Relationships with Scatter Plots

segmentPART II: MODELING VARIATION

segmentChapter 5  A Simple Model

segmentChapter 6  Quantifying Error

segmentChapter 7  Adding an Explanatory Variable to the Model

segmentChapter 8  Digging Deeper into Group Models

segmentChapter 9  Models with a Quantitative Explanatory Variable

segmentPART III: EVALUATING MODELS

segmentChapter 10  The Logic of Inference

segmentChapter 11  Model Comparison with F

segmentChapter 12  Parameter Estimation and Confidence Intervals

segmentChapter 13  What You Have Learned

segmentFinishing Up (Don't Skip This Part!)

segmentResources
list High School / Advanced Statistics and Data Science I (ABC)
4.2 Visualizing Relationships with Scatter Plots
Now that we have a word equation, let’s make a data visualization to explore the relationship between the outcome variable and the explanatory variable.
Thumb = Height + other stuff
Here’s some code to make a scatter plot:
gf_point(Thumb ~ Height, data = Fingers)
Let’s break down this line of code into parts. First, take a look at the first part of the code inside the parentheses ( )
:
gf_point(Thumb ~ Height, data = Fingers)
Now look at the part that starts data =
:
gf_point(Thumb ~ Height, data = Fingers)
Finally, take a look at the beginning part of the code:
gf_point(Thumb ~ Height, data = Fingers)
gf_point()
is the name of an R function that will make a scatter plot of the relationship between height and thumb length.
Alright, enough explanation. Let’s write some code to explore the hypothesis expressed in the word equation Thumb = Height + other stuff in a scatter plot.
require(coursekata)
# fill in the variables and data frame
gf_point( ~ , data = )
# fill in the variables and data frame
gf_point(Thumb ~ Height, data = Fingers)
ex() %>% check_function(., "gf_point") %>% {
check_arg(., "data") %>% check_equal()
check_arg(., "object") %>% check_equal()
}
Interpreting the Scatter Plot
Here’s the graph produced by the gf_point()
function.
By convention, we put the outcome variable (in this case Thumb
) on the yaxis and the explanatory variable (Height
) on the xaxis. Each point in the scatter plot represents a single student (a row) in the Fingers
data frame.
As we look at this graph, let’s keep in mind the relationship we hypothesized between Thumb
and Height
:
Thumb = Height + other stuff
In general, students who are taller (i.e., farther to the right) also tend to have longer thumbs (i.e., they tend to be closer to the top of the graph). This pattern illustrates what we mean when we say that some of the variation in Thumb
is explained by variation in Height
. If we know someone’s height, we can make a better prediction of their thumb length than we could if we didn’t know their height.
Even though we can make a better prediction of thumb length if we know height, we can’t make a perfect prediction. That’s where the “other stuff” comes in. We can see in the scatter plot that there are a few students with heights of 70 inches (colored in purple). Even though they all have the same height, there is still variation in their thumb lengths  due, presumably, to other stuff.
How R Knows Which Variable to Put on Which Axis
As pointed out earlier, it is customary to put the outcome variable on the yaxis and the explanatory variable on the xaxis. But how does R know which variable should be on the yaxis?
Here, again, is the code used to create the scatter plot above:
gf_point(Thumb ~ Height, data = Fingers)
Try modifying the code below to put Height
on the yaxis and Thumb
on the xaxis. (We also added some code to show you how to change the colors of the data points, color = "purple"
.)
require(coursekata)
# modify this code
gf_point(Thumb ~ Height, data = Fingers, color = "purple")
# modify this code
gf_point(Height ~ Thumb, data = Fingers, color = "purple")
ex() %>% check_function(., "gf_point") %>% {
check_arg(., "data") %>% check_equal()
check_arg(., "object") %>% check_equal()
}
Even though you can put the outcome variable anywhere, it is more common to put the outcome variable on the yaxis. We will follow that convention because it conforms with what people expect, and thus makes the plots easier to interpret.