Course Outline

list High School / Statistics and Data Science II (XCD)

Book
  • College / Advanced Statistics and Data Science (ABCD)
  • College / Statistics and Data Science (ABC)
  • High School / Advanced Statistics and Data Science I (ABC)
  • High School / Statistics and Data Science I (AB)
  • High School / Statistics and Data Science II (XCD)

1.8 Manipulating Data in R

We can use R commands to manipulate the data in various, helpful ways. Let’s look at a few common things you might need to do before analyzing your data:

  • Handling Missing Data
  • Recoding Variables
  • Creating Summary Variables

Handling Missing Data

Sometimes (in fact, usually) we end up with some missing data in our dataset. R represents missing data with the value NA (Not Available), and also lets you decide how to handle missing data in subsequent analyses. If your dataset represents missing data in some other way (e.g., some people use the value -999), you should recode the values as NA when working in R.

Let’s consider the variable GarageCars which describes the number of cars that can fit in each home’s garage (GarageCars). First, let’s arrange the Ames data frame so that rows are in descending order by GarageCars (remembering to save the arranged version back into Ames). Then let’s print out the values of the variable GarageCars from the Ames data frame (let’s use $ rather than select()).

require(coursekata) # Arrange Ames by GarageCars in descending order Ames <- # Use $ to print out the values of GarageCars from Ames # Arrange Ames by GarageCars in descending order Ames <- arrange(Ames, desc(GarageCars)) # Use $ to print out the values of GarageCars from Ames Ames$GarageCars ex() %>% { check_function(.,"arrange") check_function(., "desc") check_object(., "Ames") %>% check_equal() check_output_expr(., "Ames$GarageCars") }
CK Code: X1_Code_Manipulating_01
3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  2  2  2
2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2
2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2
2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2
2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2
2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  1  1  1  1  1  1  1
1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
1  1  1  1  1  1 NA NA NA NA

We can see that we have four missing values for GarageCars. You can choose to remove these homes from an individual analysis, or you can remove them from the dataset entirely.

If you wanted to get the data from homes that do not have missing data for any variable, we could use na.omit(Ames). If we just want the homes that do not have missing data for GarageCars specifically, we could use filter() and the “does not equal” != symbol.

Let’s try it. Previously we used filter(Ames, PriceK > 300) to filter for homes where PriceK is greater than 300. Modify the code below to filter for homes where GarageCars does not equal "NA" (hint: we have to put quotation marks around "NA").

require(coursekata) Ames <- Ames %>% arrange(desc(GarageCars)) # Modify this to filter for homes where GarageCars does not equal "NA" Ames_subset <- filter(Ames, PriceK > 300) # To check your work, this prints out the variable GarageCars from Ames_subset # Do you see any NAs? Ames_subset$GarageCars # Modify this to filter for homes where GarageCars does not equal "NA" Ames_subset <- filter(Ames, GarageCars != "NA") # To check your work, this prints out the variable GarageCars from Ames_subset # Do you see any NAs? Ames_subset$GarageCars ex() %>% { check_function(., "filter") %>% { check_arg(., ".data") %>% check_equal() check_arg(., "...") %>% check_equal() } check_object(., "Ames_subset") %>% check_equal() check_output_expr(., "Ames_subset$GarageCars", missing_msg = "Make sure to print out GarageCars from the Ames_subset data frame.") }
CK Code: X1_Code_Manipulating_02
3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

We succeeded in getting rid of the homes for which GarageCars is missing. But sometimes removing cases with missing data may introduce bias into your sample.

To see what kind of bias we might be introducing, it’s often helpful to take a closer look at the observations we intend to remove. Although we can say GarageCars != "NA", we can’t say == NA. This is just a weird thing about the value NA, which is a special. Instead we will use a new function – is.na() – together with filter()) to help us find the cases that have an “NA” value.

Run the code below to see what happens.

require(coursekata) Ames <- Ames %>% arrange(desc(GarageCars)) # try running this code filter(Ames, is.na(GarageCars)) filter(Ames, is.na(GarageCars)) ex() %>% { check_function(., "filter") %>% check_arg("...") %>% check_equal() check_output_expr(., "filter(Ames, is.na(GarageCars))") }
CK Code: X1_Code_Manipulating_03

If you scroll over and look at the variable GarageCars, you will see that these houses all have “NA”. But notice right next to that variable is another variable, GarageType. It turns out these four houses all have “None” for GarageType, meaning they don’t have garages. This may explain why GarageCars is coded as missing. You can’t measure how many cars will fit into a garage that doesn’t exist!

If we remove these observations, we could bias our analyses by underrepresenting homes without garages. We must be careful when making decisions about removing observations with missing data.

Recoding Variables

Sometimes you might want to recode one or more variables in a data frame. There are many ways to do this, and many reasons for doing it. To start, let’s take all the homes with no garage (i.e., GarageType == "None") and recode their GarageCars as 0 instead of NA. That way we can keep them in the analysis in a way that makes sense.

First we might run a tally() to make sure that the homes with NA for the number of cars are the ones without garages. If we run tally(GarageCars ~ GarageType, data=Ames) we get the following two-way table:

         GarageType
GarageCars Attached Detached None
      1          14       24    0
      2          95       26    0
      3          21        1    0
      <NA>        0        0    4

By looking at this table we can confirm that there are exactly 4 houses that have no garage (GarageType is None) and that these four houses are the same four that were coded as missing (NA) for GarageCars.

One very flexible way to change the NAs into 0 for these four homes is to use the indexing brackets in a new way:

Ames$GarageCars[Ames$GarageType == "None"] <- 0

To read this code, start with the stuff in the brackets: For all rows where GarageType == "None", assign the value of GarageCars to be 0. If we run this code, and then run tally() again, we can see that the 4 homes with no garages now show 0 for the number of cars that can be parked in garages.

         GarageType
GarageCars Attached Detached None
         0        0        0    4
         1       14       24    0
         2       95       26    0
         3       21        1    0

Our use of the tally() command here illustrates an important habit to get into as you develop your R skills: always think of ways to verify that R actually did what you wanted it to do.

Creating Summary Variables

In Ames, we have a variable that shows the year the home was built (YearBuilt). For some analysis purposes, you might want to create a new variable that combines multiple years into broader eras. For example, we might find different sale prices for homes built before or after the year 1900.

We can create a new summary variable called BuiltPre1900 that tells us whether the home was built before the year 1900, and then add this new variable as a new column to the Ames data frame.

Ames$BuiltPre1900 <- Ames$YearBuilt < 1900

Run glimpse(Ames) in the window below to check to see if your new variable is there.

require(coursekata) # try running this and read the error message # then fix this code BuiltPre1900 <- YearBuilt < 1900 # take a glimpse at Ames, is BuildPre1900 there? # try running this and read the error message # then fix this code Ames$BuiltPre1900 <- Ames$YearBuilt < 1900 # take a glimpse at Ames, is BuildPre1900 there? glimpse(Ames) # temporary SCT ex() %>% check_object("Ames") %>% check_column("BuiltPre1900") %>% check_equal()
CK Code: X1_Code_Manipulating_04

Notice that the variable BuiltPre1900 is listed with a new type: <lgl> (short for logical). Just as there are different types of quantitative variables, there also are different types of categorical variables. This particular type is logical, which is also sometimes called Boolean. Logical variables are special in that they can only take the values TRUE or FALSE.

Try running some tally() commands in the window below to make sure your new variable works the way you expected. These two tally() commands, for example, should yield similar results:

tally(~ BuiltPre1900, data=Ames)
tally(~ YearBuilt < 1900, data=Ames)

Try both of these lines of code below and see what happens.

require(coursekata) # This code creates a variable called BuiltPre1900 Ames$BuiltPre1900 <- Ames$YearBuilt < 1900 # Write code to tally up BuiltPre1900 in Ames in two different ways # This code creates a variable called BuiltPre1900 Ames$BuiltPre1900 <- Ames$YearBuilt < 1900 # Write code to tally up BuiltPre1900 in Ames in two tally(Ames$BuiltPre1900) tally(~BuiltPre1900, data = Ames) ex() %>% check_correct( check_function(., "tally") %>% check_result() %>% check_equal(), { check_error(.) check_function(., "tally") %>% check_arg("x") %>% check_equal(incorrect_msg = "Make sure you are getting `BuiltPre1900` from `Ames` using the `$`.") } )
CK Code: X1_Code_Manipulating_05

Here’s what we got:

BuiltPre1900
 TRUE FALSE 
         5      180

YearBuilt < 1900
 TRUE FALSE 
         5      180
 

You can also use arithmetic operators to create new summary variables. For example, you might want a variable to indicate how old a house is. For example, we can calculate the difference between the current year and the year the house was built.

require(coursekata) # Save the current year by editing the code below CurrentYear <- 1900 # Write code to create a variable that finds how old the house is # Hint: CurrentYear is not in the Ames data frame so it won’t need Ames$ in front of it Ames$HowOld <- # This will print HowOld from the Ames data frame Ames$HowOld CurrentYear <- as.numeric(format(Sys.Date()[1],'%Y')) Ames$HowOld <- CurrentYear - Ames$YearBuilt ex() %>% check_object("Ames") %>% check_column("HowOld") %>% check_equal()
CK Code: X1_Code_Manipulating_06

Responses