CourseKata - 1.8 Manipulating Data in R

High School / Statistics and Data Science II (XCD)

Book

1.8 Manipulating Data in R

We can use R commands to manipulate the data in various, helpful ways. Let’s look at a few common things you might need to do before analyzing your data:

Handling Missing Data
Recoding Variables
Creating Summary Variables

Handling Missing Data

Sometimes (in fact, usually) we end up with some missing data in our dataset. R represents missing data with the value NA (Not Available), and also lets you decide how to handle missing data in subsequent analyses. If your dataset represents missing data in some other way (e.g., some people use the value -999), you should recode the values as NA when working in R.

Let’s consider the variable GarageCars which describes the number of cars that can fit in each home’s garage (GarageCars). First, let’s arrange the Ames data frame so that rows are in descending order by GarageCars (remembering to save the arranged version back into Ames). Then let’s print out the values of the variable GarageCars from the Ames data frame (let’s use $ rather than select()).

require(coursekata)

# Arrange Ames by GarageCars in descending order
Ames <-

# Use $ to print out the values of GarageCars from Ames

# Arrange Ames by GarageCars in descending order
Ames <- arrange(Ames, desc(GarageCars))

# Use $ to print out the values of GarageCars from Ames
Ames$GarageCars

ex() %>% {
  check_function(.,"arrange")
  check_function(., "desc")
  check_object(., "Ames") %>% check_equal()
  check_output_expr(., "Ames$GarageCars")
}

CK Code: X1_Code_Manipulating_01

3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  2  2  2
2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2
2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2
2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2
2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2
2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  1  1  1  1  1  1  1
1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
1  1  1  1  1  1 NA NA NA NA

We can see that we have four missing values for GarageCars. You can choose to remove these homes from an individual analysis, or you can remove them from the dataset entirely.

If you wanted to get the data from homes that do not have missing data for any variable, we could use na.omit(Ames). If we just want the homes that do not have missing data for GarageCars specifically, we could use filter() and the “does not equal” != symbol.

Let’s try it. Previously we used filter(Ames, PriceK > 300) to filter for homes where PriceK is greater than 300. Modify the code below to filter for homes where GarageCars does not equal "NA" (hint: we have to put quotation marks around "NA").

require(coursekata)
Ames <- Ames %>%
  arrange(desc(GarageCars))

# Modify this to filter for homes where GarageCars does not equal "NA"
Ames_subset <- filter(Ames, PriceK > 300)

# To check your work, this prints out the variable GarageCars from Ames_subset
# Do you see any NAs?
Ames_subset$GarageCars

# Modify this to filter for homes where GarageCars does not equal "NA"
Ames_subset <- filter(Ames, GarageCars != "NA")

# To check your work, this prints out the variable GarageCars from Ames_subset
# Do you see any NAs?
Ames_subset$GarageCars

ex() %>% {
  check_function(., "filter") %>% {
    check_arg(., ".data") %>% check_equal()
    check_arg(., "...") %>% check_equal()
  }
  check_object(., "Ames_subset") %>% check_equal()
  check_output_expr(., "Ames_subset$GarageCars", missing_msg = "Make sure to print out GarageCars from the Ames_subset data frame.")
}

CK Code: X1_Code_Manipulating_02

3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

We succeeded in getting rid of the homes for which GarageCars is missing. But sometimes removing cases with missing data may introduce bias into your sample.

To see what kind of bias we might be introducing, it’s often helpful to take a closer look at the observations we intend to remove. Although we can say GarageCars != "NA", we can’t say == NA. This is just a weird thing about the value NA, which is a special. Instead we will use a new function – is.na() – together with filter()) to help us find the cases that have an “NA” value.

Run the code below to see what happens.

require(coursekata)
Ames <- Ames %>%
  arrange(desc(GarageCars))

# try running this code
filter(Ames, is.na(GarageCars))

filter(Ames, is.na(GarageCars))

ex() %>% {
  check_function(., "filter") %>%
    check_arg("...") %>%
    check_equal()
  check_output_expr(., "filter(Ames, is.na(GarageCars))")
}

CK Code: X1_Code_Manipulating_03

If you scroll over and look at the variable GarageCars, you will see that these houses all have “NA”. But notice right next to that variable is another variable, GarageType. It turns out these four houses all have “None” for GarageType, meaning they don’t have garages. This may explain why GarageCars is coded as missing. You can’t measure how many cars will fit into a garage that doesn’t exist!

If we remove these observations, we could bias our analyses by underrepresenting homes without garages. We must be careful when making decisions about removing observations with missing data.

Recoding Variables

Sometimes you might want to recode one or more variables in a data frame. There are many ways to do this, and many reasons for doing it. To start, let’s take all the homes with no garage (i.e., GarageType == "None") and recode their GarageCars as 0 instead of NA. That way we can keep them in the analysis in a way that makes sense.

First we might run a tally() to make sure that the homes with NA for the number of cars are the ones without garages. If we run tally(GarageCars ~ GarageType, data=Ames) we get the following two-way table:

         GarageType
GarageCars Attached Detached None
      1          14       24    0
      2          95       26    0
      3          21        1    0
      <NA>        0        0    4

By looking at this table we can confirm that there are exactly 4 houses that have no garage (GarageType is None) and that these four houses are the same four that were coded as missing (NA) for GarageCars.

One very flexible way to change the NAs into 0 for these four homes is to use the indexing brackets in a new way:

Ames$GarageCars[Ames$GarageType == "None"] <- 0

To read this code, start with the stuff in the brackets: For all rows where GarageType == "None", assign the value of GarageCars to be 0. If we run this code, and then run tally() again, we can see that the 4 homes with no garages now show 0 for the number of cars that can be parked in garages.

         GarageType
GarageCars Attached Detached None
         0        0        0    4
         1       14       24    0
         2       95       26    0
         3       21        1    0

Our use of the tally() command here illustrates an important habit to get into as you develop your R skills: always think of ways to verify that R actually did what you wanted it to do.

Creating Summary Variables

In Ames, we have a variable that shows the year the home was built (YearBuilt). For some analysis purposes, you might want to create a new variable that combines multiple years into broader eras. For example, we might find different sale prices for homes built before or after the year 1900.

We can create a new summary variable called BuiltPre1900 that tells us whether the home was built before the year 1900, and then add this new variable as a new column to the Ames data frame.

Ames$BuiltPre1900 <- Ames$YearBuilt < 1900

Run glimpse(Ames) in the window below to check to see if your new variable is there.

require(coursekata)

# try running this and read the error message
# then fix this code
BuiltPre1900 <- YearBuilt < 1900

# take a glimpse at Ames, is BuildPre1900 there?

# try running this and read the error message
# then fix this code
Ames$BuiltPre1900 <- Ames$YearBuilt < 1900

# take a glimpse at Ames, is BuildPre1900 there?
glimpse(Ames)

# temporary SCT
ex() %>%
  check_object("Ames") %>%
  check_column("BuiltPre1900") %>%
  check_equal()

CK Code: X1_Code_Manipulating_04

Notice that the variable BuiltPre1900 is listed with a new type: <lgl> (short for logical). Just as there are different types of quantitative variables, there also are different types of categorical variables. This particular type is logical, which is also sometimes called Boolean. Logical variables are special in that they can only take the values TRUE or FALSE.

Try running some tally() commands in the window below to make sure your new variable works the way you expected. These two tally() commands, for example, should yield similar results:

tally(~ BuiltPre1900, data=Ames)
tally(~ YearBuilt < 1900, data=Ames)

Try both of these lines of code below and see what happens.

require(coursekata)

# This code creates a variable called BuiltPre1900
Ames$BuiltPre1900 <- Ames$YearBuilt < 1900

# Write code to tally up BuiltPre1900 in Ames in two different ways

# This code creates a variable called BuiltPre1900
Ames$BuiltPre1900 <- Ames$YearBuilt < 1900

# Write code to tally up BuiltPre1900 in Ames in two
tally(Ames$BuiltPre1900)
tally(~BuiltPre1900, data = Ames)

ex() %>% check_correct(
  check_function(., "tally") %>%
    check_result() %>% check_equal(),
  {
    check_error(.)
    check_function(., "tally") %>%
      check_arg("x") %>%
      check_equal(incorrect_msg = "Make sure you are getting `BuiltPre1900` from `Ames` using the `$`.")
  }
)

CK Code: X1_Code_Manipulating_05

Here’s what we got:

BuiltPre1900
 TRUE FALSE 
         5      180

YearBuilt < 1900
 TRUE FALSE 
         5      180

You can also use arithmetic operators to create new summary variables. For example, you might want a variable to indicate how old a house is. For example, we can calculate the difference between the current year and the year the house was built.

require(coursekata)

# Save the current year by editing the code below
CurrentYear <- 1900

# Write code to create a variable that finds how old the house is
# Hint: CurrentYear is not in the Ames data frame so it won’t need Ames$ in front of it
Ames$HowOld <-

# This will print HowOld from the Ames data frame
Ames$HowOld

CurrentYear <- as.numeric(format(Sys.Date()[1],'%Y'))
Ames$HowOld <- CurrentYear - Ames$YearBuilt

ex() %>%
  check_object("Ames") %>%
  check_column("HowOld") %>%
  check_equal()

CK Code: X1_Code_Manipulating_06

1.7 Selecting and Filtering Data in R 1.9 Visualizing and Summarizing Quantitative Variables in R

Course Outline

High School / Statistics and Data Science II (XCD)

1.8 Manipulating Data in R

Handling Missing Data

Recoding Variables

Creating Summary Variables

Responses

list High School / Statistics and Data Science II (XCD)

1.8 Manipulating Data in R

Handling Missing Data

Recoding Variables

Creating Summary Variables

High School / Statistics and Data Science II (XCD)