Course Outline
-
segmentGetting Started (Don't Skip This Part)
-
segmentStatistics and Data Science: A Modeling Approach
-
segmentPART I: EXPLORING VARIATION
-
segmentChapter 1 - Welcome to Statistics: A Modeling Approach
-
segmentChapter 2 - Understanding Data
-
segmentChapter 3 - Examining Distributions
-
segmentChapter 4 - Explaining Variation
-
4.2 Visualizing Relationships with Scatter Plots
-
segmentPART II: MODELING VARIATION
-
segmentChapter 5 - A Simple Model
-
segmentChapter 6 - Quantifying Error
-
segmentChapter 7 - Adding an Explanatory Variable to the Model
-
segmentChapter 8 - Digging Deeper into Group Models
-
segmentChapter 9 - Models with a Quantitative Explanatory Variable
-
segmentPART III: EVALUATING MODELS
-
segmentChapter 10 - The Logic of Inference
-
segmentChapter 11 - Model Comparison with F
-
segmentChapter 12 - Parameter Estimation and Confidence Intervals
-
segmentChapter 13 - What You Have Learned
-
segmentFinishing Up (Don't Skip This Part!)
-
segmentResources
list High School / Advanced Statistics and Data Science I (ABC)
4.2 Visualizing Relationships with Scatter Plots
Now that we have a word equation, let’s make a data visualization to explore the relationship between the outcome variable and the explanatory variable.
Thumb = Height + other stuff
Here’s some code to make a scatter plot:
gf_point(Thumb ~ Height, data = Fingers)
Let’s break down this line of code into parts. First, take a look at the first part of the code inside the parentheses ( )
:
gf_point(Thumb ~ Height, data = Fingers)
Now look at the part that starts data =
:
gf_point(Thumb ~ Height, data = Fingers)
Finally, take a look at the beginning part of the code:
gf_point(Thumb ~ Height, data = Fingers)
gf_point()
is the name of an R function that will make a scatter plot of the relationship between height and thumb length.
Alright, enough explanation. Let’s write some code to explore the hypothesis expressed in the word equation Thumb = Height + other stuff in a scatter plot.
require(coursekata)
# fill in the variables and data frame
gf_point( ~ , data = )
# fill in the variables and data frame
gf_point(Thumb ~ Height, data = Fingers)
ex() %>% check_function(., "gf_point") %>% {
check_arg(., "data") %>% check_equal()
check_arg(., "object") %>% check_equal()
}
Interpreting the Scatter Plot
Here’s the graph produced by the gf_point()
function.
By convention, we put the outcome variable (in this case Thumb
) on the y-axis and the explanatory variable (Height
) on the x-axis. Each point in the scatter plot represents a single student (a row) in the Fingers
data frame.
As we look at this graph, let’s keep in mind the relationship we hypothesized between Thumb
and Height
:
Thumb = Height + other stuff
In general, students who are taller (i.e., farther to the right) also tend to have longer thumbs (i.e., they tend to be closer to the top of the graph). This pattern illustrates what we mean when we say that some of the variation in Thumb
is explained by variation in Height
. If we know someone’s height, we can make a better prediction of their thumb length than we could if we didn’t know their height.
Even though we can make a better prediction of thumb length if we know height, we can’t make a perfect prediction. That’s where the “other stuff” comes in. We can see in the scatter plot that there are a few students with heights of 70 inches (colored in purple). Even though they all have the same height, there is still variation in their thumb lengths - due, presumably, to other stuff.
How R Knows Which Variable to Put on Which Axis
As pointed out earlier, it is customary to put the outcome variable on the y-axis and the explanatory variable on the x-axis. But how does R know which variable should be on the y-axis?
Here, again, is the code used to create the scatter plot above:
gf_point(Thumb ~ Height, data = Fingers)
Try modifying the code below to put Height
on the y-axis and Thumb
on the x-axis. (We also added some code to show you how to change the colors of the data points, color = "purple"
.)
require(coursekata)
# modify this code
gf_point(Thumb ~ Height, data = Fingers, color = "purple")
# modify this code
gf_point(Height ~ Thumb, data = Fingers, color = "purple")
ex() %>% check_function(., "gf_point") %>% {
check_arg(., "data") %>% check_equal()
check_arg(., "object") %>% check_equal()
}
Even though you can put the outcome variable anywhere, it is more common to put the outcome variable on the y-axis. We will follow that convention because it conforms with what people expect, and thus makes the plots easier to interpret.