Course Outline
-
segmentGetting Started (Don't Skip This Part)
-
segmentStatistics and Data Science: A Modeling Approach
-
segmentPART I: EXPLORING VARIATION
-
segmentChapter 1 - Welcome to Statistics: A Modeling Approach
-
segmentChapter 2 - Understanding Data
-
2.11 Creating and Recoding Variables
-
segmentChapter 3 - Examining Distributions
-
segmentChapter 4 - Explaining Variation
-
segmentPART II: MODELING VARIATION
-
segmentChapter 5 - A Simple Model
-
segmentChapter 6 - Quantifying Error
-
segmentChapter 7 - Adding an Explanatory Variable to the Model
-
segmentChapter 8 - Digging Deeper into Group Models
-
segmentChapter 9 - Models with a Quantitative Explanatory Variable
-
segmentPART III: EVALUATING MODELS
-
segmentChapter 10 - The Logic of Inference
-
segmentChapter 11 - Model Comparison with F
-
segmentChapter 12 - Parameter Estimation and Confidence Intervals
-
segmentChapter 13 - What You Have Learned
-
segmentFinishing Up (Don't Skip This Part!)
-
segmentResources
list High School / Advanced Statistics and Data Science I (ABC)
2.11 Creating and Recoding Variables
Creating Summary Variables
Often we use multiple measures of a single attribute because no single measure would be adequate. For instance, it would be difficult to measure school achievement with a measure of performance from just one course. However, if you do have multiple measures, you probably will want to combine them into a single variable. In the case of school achievement, a good summary measure might be the average grade earned across all of a student’s courses.
It is quite common to create new variables that summarize values from other variables. For example, in Fingers
, we have a measurement for the length of each person’s fingers (Thumb
, Index
, Middle
, Ring
, Pinkie
). By now, you should imagine this in the data frame where each person is a row and the length of each finger is in a column.
Although for some purposes you may want to examine these finger lengths separately, you also might want to create a new variable based on these finger lengths. For example, in most people the index finger (the second digit) is shorter than the ring finger (the fourth digit). We can create a new summary variable called RingLonger
that tells us whether someone’s ring finger is longer than their index finger. We can add this new variable to our Fingers
data frame as a new column.
Fingers$RingLonger <- Fingers$Ring > Fingers$Index
Tally up how many people have longer ring fingers (relative to their own index finger).
require(coursekata)
Fingers$RingLonger <- Fingers$Ring > Fingers$Index
# This code creates a variable called RingLonger
Fingers$RingLonger <- Fingers$Ring > Fingers$Index
# Write code to tally up RingLonger in Fingers
# This code creates a variable called RingLonger
Fingers$RingLonger <- Fingers$Ring > Fingers$Index
# Write code to tally up RingLonger in Fingers
tally(Fingers$RingLonger)
tally(~RingLonger, data = Fingers)
ex() %>% check_correct(
check_function(., "tally") %>% check_result() %>% check_equal(),
{
check_error(.)
check_function(., "tally") %>% check_arg("x") %>% check_equal(incorrect_msg = "Make sure you are getting RingLonger from Fingers using the $.")
}
)
RingLonger
TRUE FALSE
89 68
You can also use arithmetic operators to summarize variables. For example, it turns out that the ratio of Index
to Ring
finger (that is, Index
divided by Ring
) is often used in health research as a crude measure of prenatal testosterone exposure. Use the division operator, /
, to create this summary variable.
require(coursekata)
# Write code to create this summary variable
Fingers$IndexRingRatio <-
# Will this print anything?
# Write code to create this summary variable
Fingers$IndexRingRatio <- Fingers$Index / Fingers$Ring
# Will this print anything?
ex() %>% check_object("Fingers") %>% check_column("IndexRingRatio") %>% check_equal()
Whenever you make new variables, or even do anything else in R, it’s a good idea to check to make sure R did what you intended it to do. You can use the head()
function for this. Go ahead and print out the first six rows of Fingers
. Use select()
to look at Index
, Ring
, and IndexRingRatio
. By looking at the index and ring fingers of a few students, you can see whether the IndexRingRatio
variable ended up meaning what you thought it did.
require(coursekata)
Fingers <- Fingers %>%
mutate(IndexRingRatio = Index/Ring)
# Use head() and select() together to look at the first six rows of Ring, Index, and IndexRingRatio
# Use head() and select() together to look at the first six rows of Ring, Index, and IndexRingRatio
head(select(Fingers, Ring, Index, IndexRingRatio))
# These also work:
# select(head(Fingers), Ring, Index, IndexRingRatio)
# select(Fingers, Ring, Index, IndexRingRatio) %>% head()
# Fingers %>% select(Ring, Index, IndexRingRatio) %>% head()
ex() %>% check_or(
check_output_expr(., "head(select(Fingers, Ring, Index, IndexRingRatio))"),
check_output_expr(., "head(select(Fingers, Ring, IndexRingRatio, Index))"),
check_output_expr(., "head(select(Fingers, Index, Ring, IndexRingRatio))"),
check_output_expr(., "head(select(Fingers, Index, IndexRingRatio, Ring))"),
check_output_expr(., "head(select(Fingers, IndexRingRatio, Ring, Index))"),
check_output_expr(., "head(select(Fingers, IndexRingRatio, Index, Ring))"),
check_output_expr(., "select(head(Fingers), Ring, Index, IndexRingRatio)"),
check_output_expr(., "select(head(Fingers), Ring, IndexRingRatio, Index)"),
check_output_expr(., "select(head(Fingers), Index, Ring, IndexRingRatio)"),
check_output_expr(., "select(head(Fingers), Index, IndexRingRatio, Ring)"),
check_output_expr(., "select(head(Fingers), IndexRingRatio, Ring, Index)"),
check_output_expr(., "select(head(Fingers), IndexRingRatio, Index, Ring)")
)
It might be helpful to get an average finger length by adding up all the values of Thumb
, Index
, Middle
, Ring
, and Pinkie
and dividing by 5. Write code for adding the variable AvgFinger
to Fingers
that does this. Write code to look at the first few lines of the Fingers
data frame as well, so you can check that your calculations look correct.
require(coursekata)
# This code averages the lengths of the Thumb and Pinkie
# Modify it to find the average length of all five fingers
Fingers$AvgFinger <- (Fingers$Thumb + Fingers$Pinkie)/2
# Write code to look at a few lines of the Fingers data frame
# This code averages the lengths of the Thumb and Pinkie
# Modify it to find the average length of all five fingers
Fingers$AvgFinger <- (Fingers$Thumb + Fingers$Index + Fingers$Middle + Fingers$Ring + Fingers$Pinkie)/5
# Write code to look at a few lines of the Fingers data frame
head(Fingers)
ex() %>% {
check_object(., "Fingers") %>%
check_column("AvgFinger") %>%
check_equal()
check_function(., "head") %>%
check_arg("x") %>%
check_equal()
}
Recoding Variables
There are some instances where you may want to change the way a variable is coded. For instance, the variable Job
is coded 1 for no job, 2 for part-time job, and 3 for full-time job. Perhaps you want to recode full-time job as 100 (because it’s 100% time) instead of 3, part-time as 50 instead of 2, and no job as 0 instead of 1. The function recode()
can be used like this:
recode(Fingers$Job, "1" = 0, "2" = 50, "3" = 100)
[1] 0 0 50 50 50 50 50 50 0 0 0 0 50 0 50 0 50 0
[19] 50 50 0 0 50 50 0 0 50 0 50 0 50 0 50 50 0 50
[37] 0 50 50 0 50 50 50 0 0 50 0 50 0 0 0 0 0 50
[55] 0 0 0 0 0 0 50 50 0 50 0 50 0 0 0 0 0 50
[73] 50 0 50 50 0 50 50 50 0 50 0 50 0 0 0 50 0 0
[91] 50 0 50 50 0 0 0 0 50 0 0 0 0 0 0 0 0 0
[109] 0 50 50 50 50 50 50 50 0 50 0 50 0 0 50 0 0 0
[127] 0 50 50 0 0 0 0 0 0 0 0 0 50 0 0 0 100 0
[145] 0 50 50 50 50 50 50 0 0 50 0 50 0
Note that in the recode()
function, you need to put the old value in quotes; the new variable could be in quotes (if a character value) or not (if numerical).
As always, whenever we do anything, we might want to save it. Try saving the recoded version of Job
as JobRecode
, a new variable in Fingers
. Print a few observations of Job
and JobRecode
to check that your recode worked.
require(coursekata)
Fingers <- Fingers %>%
mutate(Job = as.numeric(Job))
# Save the recoded version of `Job` to `JobRecode`
Fingers$JobRecode <- recode()
# Write code to print the first 6 observations of `Job` and `JobRecode`
# Save the recoded version of `Job` to `JobRecode`
Fingers$JobRecode <- recode(Fingers$Job, "1" = 0, "2" = 50, "3" = 100)
# Write code to print the first 6 observations of `Job` and `JobRecode`
head(select(Fingers, Job, JobRecode))
ex() %>% {
check_object(., "Fingers") %>% check_column("JobRecode") %>% check_equal()
. %>% check_or(
check_output_expr(., "head(select(Fingers, Job, JobRecode))"),
check_output_expr(., "head(select(Fingers, JobRecode, Job))")
)
}
Job JobRecode
1 1 0
2 1 0
3 2 50
4 2 50
5 2 50
6 2 50
Summary
We have started our journey with data—what we end up with after we turn variation in the world into numbers. The process of creating data starts with sampling, and then measurement. We organize data into columns and rows, where the columns represent the variables (e.g., Thumb
) that we have measured; and the rows represent the objects to which we applied our measurement (e.g., students). Each cell of the table holds a value, representing that row’s measurement for that variable (such as one student’s thumb length).
Before analyzing data, we often want to manipulate it in various ways. We may create summary variables, filter out missing data, and so on.
But let’s keep our eye on the prize: we care about variation in data because we are interested in variation in the world. There is some greater population that a sample comes from. And here we see the ultimate problem with data: it won’t always look like the thing it came from. Much of statistics is devoted to understanding and dealing with this problem.