Course Outline

list High School / Advanced Statistics and Data Science I (ABC)

Book
  • High School / Advanced Statistics and Data Science I (ABC)
  • High School / Statistics and Data Science I (AB)
  • High School / Statistics and Data Science II (XCD)
  • High School / Algebra + Data Science (G)
  • College / Introductory Statistics with R (ABC)
  • College / Advanced Statistics with R (ABCD)
  • College / Accelerated Statistics with R (XCD)
  • CKHub: Jupyter made easy

4.2 Visualizing Relationships with Scatter Plots

Now that we have a word equation, let’s make a data visualization to explore the relationship between the outcome variable and the explanatory variable.

Thumb = Height + other stuff

Here’s some code to make a scatter plot:

gf_point(Thumb ~ Height, data = Fingers)

Let’s break down this line of code into parts. First, take a look at the first part of the code inside the parentheses ( ):

gf_point(Thumb ~ Height, data = Fingers)

Now look at the part that starts data =:

gf_point(Thumb ~ Height, data = Fingers)

Finally, take a look at the beginning part of the code:

gf_point(Thumb ~ Height, data = Fingers)

gf_point() is the name of an R function that will make a scatter plot of the relationship between height and thumb length.

Alright, enough explanation. Let’s write some code to explore the hypothesis expressed in the word equation Thumb = Height + other stuff in a scatter plot.

require(coursekata) # fill in the variables and data frame gf_point( ~ , data = ) # fill in the variables and data frame gf_point(Thumb ~ Height, data = Fingers) ex() %>% check_function(., "gf_point") %>% { check_arg(., "data") %>% check_equal() check_arg(., "object") %>% check_equal() }

Interpreting the Scatter Plot

Here’s the graph produced by the gf_point() function.

A scatter plot of the distribution of Thumb by Height in Fingers.

By convention, we put the outcome variable (in this case Thumb) on the y-axis and the explanatory variable (Height) on the x-axis. Each point in the scatter plot represents a single student (a row) in the Fingers data frame.

As we look at this graph, let’s keep in mind the relationship we hypothesized between Thumb and Height:

Thumb = Height + other stuff

In general, students who are taller (i.e., farther to the right) also tend to have longer thumbs (i.e., they tend to be closer to the top of the graph). This pattern illustrates what we mean when we say that some of the variation in Thumb is explained by variation in Height. If we know someone’s height, we can make a better prediction of their thumb length than we could if we didn’t know their height.

Even though we can make a better prediction of thumb length if we know height, we can’t make a perfect prediction. That’s where the “other stuff” comes in. We can see in the scatter plot that there are a few students with heights of 70 inches (colored in purple). Even though they all have the same height, there is still variation in their thumb lengths - due, presumably, to other stuff.

A scatter plot of Thumb predicted by Height. A vertical dashed line runs through the plot at Height = 70. The data points that this line runs through are shaded in purple.

How R Knows Which Variable to Put on Which Axis

As pointed out earlier, it is customary to put the outcome variable on the y-axis and the explanatory variable on the x-axis. But how does R know which variable should be on the y-axis?

Here, again, is the code used to create the scatter plot above:

gf_point(Thumb ~ Height, data = Fingers)

Try modifying the code below to put Height on the y-axis and Thumb on the x-axis. (We also added some code to show you how to change the colors of the data points, color = "purple".)

require(coursekata) # modify this code gf_point(Thumb ~ Height, data = Fingers, color = "purple") # modify this code gf_point(Height ~ Thumb, data = Fingers, color = "purple") ex() %>% check_function(., "gf_point") %>% { check_arg(., "data") %>% check_equal() check_arg(., "object") %>% check_equal() }

A scatter plot of Height predicted by Thumb length. The data points are shaded in purple. The data points show a positive trend.

Even though you can put the outcome variable anywhere, it is more common to put the outcome variable on the y-axis. We will follow that convention because it conforms with what people expect, and thus makes the plots easier to interpret.

Responses