Course Outline

list High School / Advanced Statistics and Data Science I (ABC)

Book
  • High School / Advanced Statistics and Data Science I (ABC)
  • High School / Statistics and Data Science I (AB)
  • High School / Statistics and Data Science II (XCD)
  • High School / Algebra + Data Science (G)
  • College / Introductory Statistics with R (ABC)
  • College / Advanced Statistics with R (ABCD)
  • College / Accelerated Statistics with R (XCD)
  • CKHub: Jupyter made easy

2.9 Creating and Recoding Variables

Creating Summary Variables

Often we use multiple measures of a single attribute because no single measure would be adequate. For instance, it would be difficult to measure school achievement with a measure of performance from just one course. However, if you do have multiple measures, you probably will want to combine them into a single variable. In the case of school achievement, a good summary measure might be the average grade earned across all of a student’s courses.

It is quite common to create new variables that summarize values from other variables. For example, in Fingers, we have a measurement for the length of each person’s fingers (Thumb, Index, Middle, Ring, Pinkie). By now, you should imagine this in the data frame where each person is a row and the length of each finger is in a column.

Although for some purposes you may want to examine these finger lengths separately, you also might want to create a new variable based on these finger lengths. For example, in most people the index finger (the second digit) is shorter than the ring finger (the fourth digit). We can create a new summary variable called RingLonger that tells us whether someone’s ring finger is longer than their index finger. We can add this new variable to our Fingers data frame as a new column.

Fingers$RingLonger <- Fingers$Ring > Fingers$Index

Tally up how many people have longer ring fingers (relative to their own index finger).

require(coursekata) Fingers$RingLonger <- Fingers$Ring > Fingers$Index # This code creates a variable called RingLonger Fingers$RingLonger <- Fingers$Ring > Fingers$Index # Write code to tally up RingLonger in Fingers # This code creates a variable called RingLonger Fingers$RingLonger <- Fingers$Ring > Fingers$Index # Write code to tally up RingLonger in Fingers tally(Fingers$RingLonger) tally(~RingLonger, data = Fingers) ex() %>% check_correct( check_function(., "tally") %>% check_result() %>% check_equal(), { check_error(.) check_function(., "tally") %>% check_arg("x") %>% check_equal(incorrect_msg = "Make sure you are getting RingLonger from Fingers using the $.") } )
RingLonger
 TRUE FALSE
   89    68

You can also use arithmetic operators to summarize variables. For example, it turns out that the ratio of Index to Ring finger (that is, Index divided by Ring) is often used in health research as a crude measure of prenatal testosterone exposure. Use the division operator, /, to create this summary variable.

require(coursekata) # Write code to create this summary variable Fingers$IndexRingRatio <- # Will this print anything? # Write code to create this summary variable Fingers$IndexRingRatio <- Fingers$Index / Fingers$Ring # Will this print anything? ex() %>% check_object("Fingers") %>% check_column("IndexRingRatio") %>% check_equal()

Whenever you make new variables, or even do anything else in R, it’s a good idea to check to make sure R did what you intended it to do. You can use the head() function for this. Go ahead and print out the first six rows of Fingers. Use select() to look at Index, Ring, and IndexRingRatio. By looking at the index and ring fingers of a few students, you can see whether the IndexRingRatio variable ended up meaning what you thought it did.

require(coursekata) Fingers <- Fingers %>% mutate(IndexRingRatio = Index/Ring) # Use head() and select() together to look at the first six rows of Ring, Index, and IndexRingRatio # Use head() and select() together to look at the first six rows of Ring, Index, and IndexRingRatio head(select(Fingers, Ring, Index, IndexRingRatio)) # These also work: # select(head(Fingers), Ring, Index, IndexRingRatio) # select(Fingers, Ring, Index, IndexRingRatio) %>% head() # Fingers %>% select(Ring, Index, IndexRingRatio) %>% head() ex() %>% check_or( check_output_expr(., "head(select(Fingers, Ring, Index, IndexRingRatio))"), check_output_expr(., "head(select(Fingers, Ring, IndexRingRatio, Index))"), check_output_expr(., "head(select(Fingers, Index, Ring, IndexRingRatio))"), check_output_expr(., "head(select(Fingers, Index, IndexRingRatio, Ring))"), check_output_expr(., "head(select(Fingers, IndexRingRatio, Ring, Index))"), check_output_expr(., "head(select(Fingers, IndexRingRatio, Index, Ring))"), check_output_expr(., "select(head(Fingers), Ring, Index, IndexRingRatio)"), check_output_expr(., "select(head(Fingers), Ring, IndexRingRatio, Index)"), check_output_expr(., "select(head(Fingers), Index, Ring, IndexRingRatio)"), check_output_expr(., "select(head(Fingers), Index, IndexRingRatio, Ring)"), check_output_expr(., "select(head(Fingers), IndexRingRatio, Ring, Index)"), check_output_expr(., "select(head(Fingers), IndexRingRatio, Index, Ring)") )

It might be helpful to get an average finger length by adding up all the values of Thumb, Index, Middle, Ring, and Pinkie and dividing by 5. Write code for adding the variable AvgFinger to Fingers that does this. Write code to look at the first few lines of the Fingers data frame as well, so you can check that your calculations look correct.

require(coursekata) # This code averages the lengths of the Thumb and Pinkie # Modify it to find the average length of all five fingers Fingers$AvgFinger <- (Fingers$Thumb + Fingers$Pinkie)/2 # Write code to look at a few lines of the Fingers data frame # This code averages the lengths of the Thumb and Pinkie # Modify it to find the average length of all five fingers Fingers$AvgFinger <- (Fingers$Thumb + Fingers$Index + Fingers$Middle + Fingers$Ring + Fingers$Pinkie)/5 # Write code to look at a few lines of the Fingers data frame head(Fingers) ex() %>% { check_object(., "Fingers") %>% check_column("AvgFinger") %>% check_equal() check_function(., "head") %>% check_arg("x") %>% check_equal() }

Recoding Variables

There are some instances where you may want to change the way a variable is coded. For instance, the variable Job is coded 1 for no job, 2 for part-time job, and 3 for full-time job. Perhaps you want to recode full-time job as 100 (because it’s 100% time) instead of 3, part-time as 50 instead of 2, and no job as 0 instead of 1. The function recode() can be used like this:

recode(Fingers$Job, "1" = 0, "2" = 50, "3" = 100)
  [1]   0   0  50  50  50  50  50  50   0   0   0   0  50   0  50   0  50   0
 [19]  50  50   0   0  50  50   0   0  50   0  50   0  50   0  50  50   0  50
 [37]   0  50  50   0  50  50  50   0   0  50   0  50   0   0   0   0   0  50
 [55]   0   0   0   0   0   0  50  50   0  50   0  50   0   0   0   0   0  50
 [73]  50   0  50  50   0  50  50  50   0  50   0  50   0   0   0  50   0   0
 [91]  50   0  50  50   0   0   0   0  50   0   0   0   0   0   0   0   0   0
[109]   0  50  50  50  50  50  50  50   0  50   0  50   0   0  50   0   0   0
[127]   0  50  50   0   0   0   0   0   0   0   0   0  50   0   0   0 100   0
[145]   0  50  50  50  50  50  50   0   0  50   0  50   0

Note that in the recode() function, you need to put the old value in quotes; the new variable could be in quotes (if a character value) or not (if numerical).

As always, whenever we do anything, we might want to save it. Try saving the recoded version of Job as JobRecode, a new variable in Fingers. Print a few observations of Job and JobRecode to check that your recode worked.

require(coursekata) Fingers <- Fingers %>% mutate(Job = as.numeric(Job)) # Save the recoded version of `Job` to `JobRecode` Fingers$JobRecode <- recode() # Write code to print the first 6 observations of `Job` and `JobRecode` # Save the recoded version of `Job` to `JobRecode` Fingers$JobRecode <- recode(Fingers$Job, "1" = 0, "2" = 50, "3" = 100) # Write code to print the first 6 observations of `Job` and `JobRecode` head(select(Fingers, Job, JobRecode)) ex() %>% { check_object(., "Fingers") %>% check_column("JobRecode") %>% check_equal() . %>% check_or( check_output_expr(., "head(select(Fingers, Job, JobRecode))"), check_output_expr(., "head(select(Fingers, JobRecode, Job))") ) }
  Job  JobRecode
1   1          0
2   1          0
3   2         50
4   2         50
5   2         50
6   2         50

Summary

We have started our journey with data—what we end up with after we turn variation in the world into numbers. The process of creating data starts with sampling, and then measurement. We organize data into columns and rows, where the columns represent the variables (e.g., Thumb) that we have measured; and the rows represent the objects to which we applied our measurement (e.g., students). Each cell of the table holds a value, representing that row’s measurement for that variable (such as one student’s thumb length).

Before analyzing data, we often want to manipulate it in various ways. We may create summary variables, filter out missing data, and so on.

But let’s keep our eye on the prize: we care about variation in data because we are interested in variation in the world. There is some greater population that a sample comes from. And here we see the ultimate problem with data: it won’t always look like the thing it came from. Much of statistics is devoted to understanding and dealing with this problem.

Responses