list

Statistics and Data Science: A Modeling Approach

2.7 Manipulating Data

Once data are in a tidy format, we can use R commands to manipulate the data in various ways. Let’s look at a few common things you might need to do before analyzing your data:

  • Identifying Missing Data
  • Filtering Data
  • Creating Summary Variables
  • Recoding Variables

Identifying Missing Data

Sometimes (in fact, usually) we end up with some missing data in our data set. R represents missing data with the value NA (not available), and then also lets you decide how to handle missing data in subsequent analyses. If your data set represents missing data in some other way (e.g., some people put the value -999), you should recode the values as NA when working in R.

Let’s consider the last digit of students’ Social Security Numbers (SSLast) in the Fingers data frame. First, arrange the Fingers data frame so that rows are in in descending order by SSLast (remember to save it). Then print out just the variable SSLast from the Fingers data frame (remember to use $).

require(tidyverse) require(mosaic) require(supernova) Fingers <- supernova::Fingers # Arrange SSLast in descending order Fingers <- # Print out just the variable SSLast from the Fingers data frame Fingers <- arrange(Fingers, desc(SSLast)) Fingers$SSLast ex() %>% { check_function(.,"arrange") check_function(., "desc") check_object(., "Fingers") %>% check_equal() check_output_expr(., "Fingers$SSLast") }
Did you use arrange() with the desc() function to order SSLast in descending order? Remember to print SSLast by typing Fingers$SSLast
DataCamp: ch2-18

Remember: your output may not look EXACTLY like what we have here; DataCamp will wrap the lines to fit within the smaller space. What matters is that the content is basically the same!

  [1] 9397 8894 7700 7549 7037 6990 6346 6292 6138 5461 5112 4800 3530 3364 3362
 [16] 2354 2019 1821 1339 1058  791  789  760    9    9    9    9    9    9    9
 [31]    9    9    9    9    9    9    9    9    8    8    8    8    8    8    8
 [46]    8    8    7    7    7    7    7    7    7    7    7    7    7    7    7
 [61]    7    7    6    6    6    6    6    6    6    5    5    5    5    4    4
 [76]    4    4    4    4    4    4    4    4    4    3    3    3    3    3    3
 [91]    3    3    3    3    3    3    3    3    3    3    2    2    2    2    2
[106]    2    2    2    2    2    1    1    1    1    1    1    1    0    0    0
[121]    0    0    0    0    0    0    0    0   NA   NA   NA   NA   NA   NA   NA
[136]   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA
[151]   NA   NA   NA   NA   NA   NA   NA

In R, blanks are automatically given the label “NA” for not available. You can choose to remove observations with missing data from an individual analysis, or you can remove them from the data set entirely.

For example, if you wanted to create a new data frame without the missing data, you could use a comparison operator (such as >, <, ==, !=) to check whether the data is missing or not.

This is a situation where it is more useful to think about what SSLast does not equal because there are a lot of numbers that students could have entered in. The phrase SSLast != “NA” means that SSLast does not equal “NA”. This statement would be true for anyone who actually entered in some numbers. This statement returns false for anyone who did not answer this question.

Filtering Data

We can use the filter() function, introduced previously, to remove observations with missing data from a data frame. For example:

filter(Fingers, SSLast != "NA")

This code returns a data frame that includes only cases for which the variable SSLast is not equal to NA. Note that the filter() function filters in, not out.

As with anything in R, your filtered data frame is only temporary unless you save it to an R object. So save the data with no missing SSLast values in a new data frame called Fingers.subset.

require(tidyverse) require(mosaic) require(supernova) Fingers <- supernova::Fingers %>% arrange(desc(SSLast)) # Filter out the students who have missing data for SSLast Fingers.subset <- # Print out the variable SSLast from Fingers.subset Fingers.subset <- filter(Fingers, SSLast != "NA") Fingers.subset$SSLast ex() %>% { check_function(., "filter") %>% { check_arg(., ".data") %>% check_equal() check_arg(., "...") %>% check_equal() } check_object(., "Fingers.subset") %>% check_equal() check_output_expr(., "Fingers.subset$SSLast", missing_msg = "Make sure to print out SSLast from the Fingers.subset data frame.") }
Make sure to filter IN: we want the SSLast data that is NOT NA, i.e. != NA.
DataCamp: ch2-19

  [1] 9397 8894 7700 7549 7037 6990 6346 6292 6138 5461 5112 4800 3530 3364 3362
 [16] 2354 2019 1821 1339 1058  791  789  760    9    9    9    9    9    9    9
 [31]    9    9    9    9    9    9    9    9    8    8    8    8    8    8    8
 [46]    8    8    7    7    7    7    7    7    7    7    7    7    7    7    7
 [61]    7    7    6    6    6    6    6    6    6    5    5    5    5    4    4
 [76]    4    4    4    4    4    4    4    4    4    3    3    3    3    3    3
 [91]    3    3    3    3    3    3    3    3    3    3    2    2    2    2    2
[106]    2    2    2    2    2    1    1    1    1    1    1    1    0    0    0
[121]    0    0    0    0    0    0    0    0

Remember, however, that if you remove cases with missing data you may be introducing bias into your sample.

Creating Summary Variables

Often we use multiple measures of a single attribute because no single measure would be adequate. For instance, it would be difficult to measure school achievement with a measure of performance from just one course. However, if you do have multiple measures, you probably will want to aggregate them into a single variable. In the case of school achievement, a good summary measure might be the average grade earned across all of a student’s courses.

It is quite common to create new variables that summarize values from other variables. For example, in Fingers, we have a measurement for the length of each person’s fingers (Thumb, Index, Middle, Ring, Pinkie). By now, you should imagine this in the data frame where each person is a row and the length of each finger is in a column.

Although for some purposes you may want to examine these finger lengths separately, you also might want to create a new variable based on these finger lengths. For example, in most people the index finger (the second digit) is shorter than the ring finger (the fourth digit). We can create a new summary variable called RingLonger that tells us whether someone’s ring finger is longer than their index finger. We can add this new variable to our Fingers data frame as a new column.

Fingers$RingLonger <- Fingers$Ring > Fingers$Index

Tally up how many people have longer ring fingers (relative to their own index finger).

require(mosaic) require(tidyverse) require(supernova) Fingers <- supernova::Fingers Fingers$RingLonger <- Fingers$Ring > Fingers$Index # This code creates a variable called RingLonger Fingers$RingLonger <- Fingers$Ring > Fingers$Index # Write code to tally up RingLonger in Fingers. # Either of these: tally(Fingers$RingLonger) tally(~RingLonger, data = Fingers) ex() %>% check_correct( check_function(., "tally") %>% check_result() %>% check_equal(), { check_error(.) check_function(., "tally") %>% check_arg("x") %>% check_equal(incorrect_msg = "Make sure you are getting RingLonger from Fingers using the $.") } )
Remember that RingLonger is stored as a column in Fingers, so you need to access it like Fingers$RingLonger when using tally()
DataCamp: ch2-20

RingLonger
 TRUE FALSE 
   89    68

You can also use arithmetic operators to summarize variables. For example, it turns out that the ratio of Index to Ring finger (that is, Index divided by Ring) is often used in health research as a crude measure of prenatal testosterone exposure. Use the division operator, /, to create this summary variable.

require(tidyverse) require(mosaic) require(supernova) Fingers <- supernova::Fingers # Write code to create this summary variable Fingers$IndexRingRatio <- # Will this print anything? Fingers$IndexRingRatio <- Fingers$Index / Fingers$Ring ex() %>% check_object("Fingers") %>% check_column("IndexRingRatio") %>% check_equal()
Divide Fingers$Index by Fingers$Ring using the / operator
DataCamp: ch2-21

Whenever you make new variables, or even do anything else in R, it’s a good idea to check to make sure R did what you intended it to do. You can use the head() function for this. Go ahead and print out the first six rows of Fingers. Use select() to look at Index, Ring, and IndexRingRatio. By looking at the index and ring fingers of a few students, you can see whether the IndexRingRatio variable ended up meaning what you thought it did.

require(mosaic) require(tidyverse) require(supernova) Fingers <- supernova::Fingers %>% mutate(IndexRingRatio = Index/Ring) # Use head() and select() together to look at the first six rows of Ring, Index, and IndexRingRatio head(select(Fingers, Ring, Index, IndexRingRatio)) # These also work: # select(Fingers, Ring, Index, IndexRingRatio) %>% head() # Fingers %>% select(Ring, Index, IndexRingRatio) %>% head() ex() %>% check_correct( check_output_expr(., "head(select(Fingers, Ring, Index, IndexRingRatio))"), { check_error(.) check_function(., "select") %>% { check_arg(., ".data") %>% check_equal() check_arg(., "...") %>% check_equal() } check_function(., "head") %>% check_arg("x") %>% check_equal() } )
Have you used head and select?
DataCamp: ch2-21a

It might be helpful to get an average finger length by adding up all the values of Thumb, Index, Middle, Ring, and Pinkie and dividing by 5. Write code for adding the variable AvgFinger to Fingers that does this. Write code to look at the first few lines of the Fingers data frame as well, so you can check that your calculations look correct.

require(tidyverse) require(mosaic) require(supernova) Fingers <- supernova::Fingers # Write code to create this summary variable Fingers$AvgFinger <- # Write code to look at a few lines of Fingers Fingers$AvgFinger <- (Fingers$Thumb + Fingers$Index + Fingers$Middle + Fingers$Ring + Fingers$Pinkie)/5 head(Fingers) ex() %>% { check_object(., "Fingers") %>% check_column("AvgFinger") %>% check_equal() check_output_expr(., "head(Fingers)") }
Add up Fingers$Thumb, Fingers$Index, Fingers$Middle, Fingers$Ring, Fingers$Pinkie and divide by 5
DataCamp: ch2-22

Recoding Variables

There are some instances where you may want to change the way a variable is coded. For instance, the variable Job is coded 1 for no job, 2 for part-time job, and 3 for full-time job. Perhaps you want to recode full-time job as 100 (because it’s 100% time) instead of 3, part-time as 50 instead of 2, and no job as 0 instead of 1. The function recode() can be used like this:

recode(Fingers$Job, "1" = 0, "2" = 50, "3" = 100)
  [1]   0   0  50  50  50  50  50  50   0   0   0   0  50   0  50   0  50   0
 [19]  50  50   0   0  50  50   0   0  50   0  50   0  50   0  50  50   0  50
 [37]   0  50  50   0  50  50  50   0   0  50   0  50   0   0   0   0   0  50
 [55]   0   0   0   0   0   0  50  50   0  50   0  50   0   0   0   0   0  50
 [73]  50   0  50  50   0  50  50  50   0  50   0  50   0   0   0  50   0   0
 [91]  50   0  50  50   0   0   0   0  50   0   0   0   0   0   0   0   0   0
[109]   0  50  50  50  50  50  50  50   0  50   0  50   0   0  50   0   0   0
[127]   0  50  50   0   0   0   0   0   0   0   0   0  50   0   0   0 100   0
[145]   0  50  50  50  50  50  50   0   0  50   0  50   0

Note that in the recode() function, you need to put the old value in quotes; the new variable could be in quotes (if a character value) or not (if numerical).

As always, whenever we do anything, we might want to save it. Try saving the recoded version of Job as Job.recode, a new variable in Fingers. Print a few observations of Job and Job.recode to check that your recode worked.

require(tidyverse) require(mosaic) require(supernova) Fingers <- supernova::Fingers %>% mutate(Job = as.numeric(Job)) # Save the recoded version of `Job` to `Job.recode` Fingers$Job.recode <- recode() # Write code to print a few observations of `Job` and `Job.recode` Fingers$Job.recode <- recode(Fingers$Job, "1" = 0, "2" = 50, "3" = 100) head(select(Fingers, Job, Job.recode)) ex() %>% { check_object(., "Fingers") %>% check_column("Job.recode") %>% check_equal() . %>% check_or( check_output_expr(., "head(select(Fingers, Job, Job.recode))"), check_output_expr(., "head(select(Fingers, Job.recode, Job))") ) }
Use `recode(Fingers$Job, "1"=0, "2"=50, "3"=100)`
DataCamp: ch2-23

  Job Job.recode
1   1          0
2   1          0
3   2         50
4   2         50
5   2         50
6   2         50