Course Outline
-
segmentGetting Started (Don't Skip This Part)
-
segmentStatistics and Data Science: A Modeling Approach
-
segmentPART I: EXPLORING VARIATION
-
segmentChapter 1 - Welcome to Statistics: A Modeling Approach
-
segmentChapter 2 - Understanding Data
-
segmentChapter 3 - Examining Distributions
-
segmentChapter 4 - Explaining Variation
-
4.6 Categorical Outcomes
-
segmentPART II: MODELING VARIATION
-
segmentChapter 5 - A Simple Model
-
segmentChapter 6 - Quantifying Error
-
segmentChapter 7 - Adding an Explanatory Variable to the Model
-
segmentChapter 8 - Digging Deeper into Group Models
-
segmentChapter 9 - Models with a Quantitative Explanatory Variable
-
segmentPART III: EVALUATING MODELS
-
segmentChapter 10 - The Logic of Inference
-
segmentChapter 11 - Model Comparison with F
-
segmentChapter 12 - Parameter Estimation and Confidence Intervals
-
segmentChapter 13 - What You Have Learned
-
segmentFinishing Up (Don't Skip This Part!)
-
segmentResources
list High School / Advanced Statistics and Data Science I (ABC)
4.6 Categorical Outcomes
We have learned to express hypotheses in word equations and make appropriate data visualizations to explore these hypotheses with real data. Thus far, we have focused solely on hypotheses about quantitative outcome variables – e.g., thumb length. We can extend those same ideas to categorical outcome variables.
Example Study: MindsetMatters
The MindsetMatters
data frame contains the results of an experiment in which a sample of housekeepers were randomly assigned to one of two conditions (recorded in the variable Condition
). In the Informed condition (N=41) the housekeepers were told that the work they do satisfies the Surgeon General’s recommendations for an active lifestyle (which is true), and they were given some examples to illustrate why their work is considered good exercise. Housekeepers assigned to the Uninformed condition (N=37) were told nothing.
The researchers hypothesized that being informed in this way would lead housekeepers to actually become more fit and perhaps even to lose weight. Four weeks after the start of the study, researchers recorded whether each housekeeper lost weight in a categorical variable called WtLost
(either lost or not lost). Below, we show a sample of data from 10 housekeepers for the two variables (Condition
and WtLost
) below.
Condition WtLost
1 Informed not lost
2 Informed wt lost
3 Informed not lost
4 Informed wt lost
5 Informed wt lost
6 Informed wt lost
7 Uninformed not lost
8 Uninformed wt lost
9 Uninformed not lost
10 Uninformed not lost
Faceted Bar Graphs
Because WtLost
is a categorical outcome, we can’t graph its distribution in a histogram. Instead, we can use a bar graph. Try running the code in the code block below. Then replace gf_histogram()
with gf_bar()
to make a bar graph.
require(coursekata)
MindsetMatters <- MindsetMatters %>%
mutate(WtLost = ifelse(Wt2 < Wt, "lost", "not lost"))
# Edit this code to make a more appropriate visualization
# for this outcome variable
gf_histogram(~WtLost, data = MindsetMatters)
# Edit this code to make a more appropriate visualization
# for this outcome variable
gf_bar(~WtLost, data = MindsetMatters)
ex() %>% {
check_or(.,
check_function(., "gf_bar") %>% {
check_arg(., "object") %>% check_equal()
check_arg(., "data") %>% check_equal()
},
override_solution(., "gf_bar(MindsetMatters, ~ WtLost)") %>%
check_function("gf_bar") %>% {
check_arg(., "object") %>% check_equal()
check_arg(., "gformula") %>% check_equal()
}
)
}
This graph shows us the outcome, whether housekeepers lost weight or not, but it doesn’t break the outcome down by Condition
. To see if Condition
might explain some of the variation in WtLost
we can add on the function gf_facet_grid()
(as we can with any gf_
plot). We can facet the bar graphs either vertically or horizontally.
|
|
---|---|
|
|
Try making both types of faceted bar graphs in the code block below. Submit code after you have created side-by-side faceted bar graphs (the graph on the right).
require(coursekata)
MindsetMatters <- MindsetMatters %>%
mutate(WtLost = ifelse(Wt2 < Wt, "lost", "not lost"))
# Create a faceted bar graph of WtLost by Condition
# Create a faceted bar graph of WtLost by Condition
gf_bar(~ WtLost, data = MindsetMatters) %>%
gf_facet_grid(. ~ Condition)
ex() %>% {
check_or(.,
check_function(., "gf_bar") %>% {
check_arg(., "object") %>% check_equal()
check_arg(., "data") %>% check_equal()
},
override_solution(., "gf_bar(MindsetMatters, ~ WtLost)") %>%
check_function("gf_bar") %>% {
check_arg(., "object") %>% check_equal()
check_arg(., "gformula") %>% check_equal()
}
)
check_function(., "gf_facet_grid") %>%
check_arg(2) %>%
check_equal()
}
There is a limitation in this graph. Because the sample sizes are different between the two groups (41 in the Informed
group, 34 in the Uninformed
), you have to look at the relative difference in the number of housekeepers who lost weight between the two groups, mentally controlling for the difference in sample size.
A simpler approach is to use the gf_props()
function instead of gf_bar*()
. gf_props()
shows the proportion of housekeepers who lost weight instead of the number of housekeepers. Use gf_props()
instead of gf_bar()
to create a bar graph depicting the proportion of each condition that lost weight in the code window below.
require(coursekata)
MindsetMatters <- MindsetMatters %>%
mutate(WtLost = ifelse(Wt2 < Wt, "lost", "not lost"))
# Edit this code
gf_bar(~ WtLost, data = MindsetMatters, fill = "purple") %>%
gf_facet_grid(. ~ Condition)
# Edit this code
gf_props(~ WtLost, data = MindsetMatters, fill = "purple") %>%
gf_facet_grid(. ~ Condition)
ex() %>% {
check_or(.,
check_function(., "gf_props") %>% {
check_arg(., "object") %>% check_equal()
check_arg(., "data") %>% check_equal()
},
override_solution(., "gf_props(MindsetMatters, ~ WtLost)") %>%
check_function("gf_props") %>% {
check_arg(., "object") %>% check_equal()
check_arg(., "gformula") %>% check_equal()
}
)
check_function(., "gf_facet_grid") %>%
check_arg(2) %>% check_equal()
}
gf_props()
|
gf_bar()
|
---|---|
|
|
The sample sizes between the two groups aren’t that different, but because there are fewer housekeepers in the Uninformed group, proportions are a better basis on which to compare to the two groups. Roughly .68 of the Informed group lost weight while a little less than .60 of the Uninformed group lost weight.