Course Outline

segmentGetting Started (Don't Skip This Part)

segmentStatistics and Data Science: A Modeling Approach

segmentPART I: EXPLORING VARIATION

segmentChapter 1  Welcome to Statistics: A Modeling Approach

segmentChapter 2  Understanding Data

segmentChapter 3  Examining Distributions

segmentChapter 4  Explaining Variation

4.3 Categorical Explanatory Variables

segmentPART II: MODELING VARIATION

segmentChapter 5  A Simple Model

segmentChapter 6  Quantifying Error

segmentChapter 7  Adding an Explanatory Variable to the Model

segmentChapter 8  Digging Deeper into Group Models

segmentChapter 9  Models with a Quantitative Explanatory Variable

segmentPART III: EVALUATING MODELS

segmentChapter 10  The Logic of Inference

segmentChapter 11  Model Comparison with F

segmentChapter 12  Parameter Estimation and Confidence Intervals

segmentChapter 13  What You Have Learned

segmentFinishing Up (Don't Skip This Part!)

segmentResources
list High School / Advanced Statistics and Data Science I (ABC)
4.3 Categorical Explanatory Variables
Hypotheses with Categorical Variables
We have learned how to translate hypotheses into word equations and then into scatter plots with a quantitative outcome and a quantitative explanatory variable (Thumb
and Height
are both quantitative). Now let’s try swapping out the quantitative explanatory variable of Height
with a categorical explanatory variable: Sex
.
In the code block below, make a scatter plot to explore this hypothesis.
require(coursekata)
# make a scatter plot of the word equation
# make a scatter plot of the word equation
gf_point(Thumb ~ Sex, data = Fingers)
ex() %>% check_function(., "gf_point") %>% {
check_arg(., "data") %>% check_equal()
check_arg(., "object") %>% check_equal()
}
Although this graph might look a little weird at first, it is also a scatter plot! It’s just that Sex
is a different kind of variable than Height
. There are many values of Height
: 59, 60.5, 72, just to name a few (it would be a long list to name them all!). The variable Sex
in contrast has only two possible values: “female” or “male”. So the dots are all lined up in one of two places on the horizontal xaxis: above the “female” or above the “male”.
Notice that the variable Thumb
still varies quantitatively so the data points are spread out along the vertical yaxis even though they are clustered on the xaxis.
We can look at the graph again in light of the hypothesized relationship we were investigating: Thumb = Sex + other stuff. Does the variable Sex
explain some of the variation in Thumb
?
If Sex
explains some of the variation in Thumb
then knowing whether a student is female or male would help us make a better prediction of their thumb length than if we didn’t know what sex they reported. If we know that a student is male, we would predict a longer thumb (maybe 65 mm, the middle of the male distribution) than if the student were female, in which case we might predict something closer to 60 mm. These wouldn’t be very accurate predictions, but they’d be a little bit better than if we didn’t know anything about the sex of the student.
Jitter Plots
Although a scatter plot is a common way to show the relationship between an outcome variable and an explanatory variable, in the case of a categorical explanatory variable (like Sex
) we often can’t see all the points because so many of them are on top of other points. One solution to this is to jitter the points around a little so that you can better see the individual points.
We’ll use the function gf_jitter()
to create a jitter plot depicting Thumb
by Sex
. This function works just like gf_point()
, except that the points will be randomly jittered both vertically and horizontally. As you can see below, the default jitter plot has an awful lot of jitter!
gf_jitter(Thumb ~ Sex, data = Fingers)
We can use the arguments height
and width
to adjust the amount of jitter. In this situation, we might want the vertical jitter (height
) to be set to 0 so that a person with a thumb length of 60 mm shows up right at 60 on the yaxis. We might also want to reduce the horizontal jitter (width
), but not so much that the points overlap too much. Note that these two arguments can be set to values between 0 and 1.
Try running the code below. To submit, reduce the height of the jitter to 0 and the width to 0.1.
require(coursekata)
# adjust height and width
gf_jitter(Thumb ~ Sex, data = Fingers, height = .1, width = .2)
# adjust height and width
gf_jitter(Thumb ~ Sex, data = Fingers, height = 0, width = .1)
ex() %>% check_function(., "gf_jitter") %>% {
check_arg(., "data") %>% check_equal()
check_arg(., "object") %>% check_equal()
check_arg(., "height") %>% check_equal()
check_arg(., "width") %>% check_equal()
}
If a point is in the Female column, it’s a female’s thumb length. But being more to the left or right within the female column doesn’t mean anything  it’s just random jitter. The jitter is there just so the points do not overlap too much and obscure how many females have that same thumb length.
In the jitter plot, a dense row of points shows that there are a lot of people with that thumb length. For instance, look at all the Female points at 60 mm, shown in purple below. More points means more females with that particular thumb length.
Extra Features for Scatter and Jitter Plots
In both scatter and jitter plots, you can change the size, shape, and transparency of the points by including the arguments size
, shape
, and alpha
, respectively.
Here are some different shapes you can use. For example, putting in the argument shape = 15
will result in squares rather than circles on a scatter or jitter plot.
Play around with the code below and try changing the size (put in numbers between 05 to start), shape (put in whole numbers between 1 and 20), and transparency (put in numbers between 01, 0 being more transparent and 1 being more opaque).
require(coursekata)
# adjust size, shape, and alpha
gf_point(Thumb ~ Height, data = Fingers, size = 3, shape = 15, alpha = .8)
# adjust size, shape, and alpha
gf_point(Thumb ~ Height, data = Fingers, size = 3, shape = 15, alpha = .8)
ex() %>% check_function(., "gf_point") %>% {
check_arg(., "data") %>% check_equal()
check_arg(., "object") %>% check_equal()
}