Course Outline

list College / Accelerated Statistics with R (XCD)

Book
  • High School / Advanced Statistics and Data Science I (ABC)
  • High School / Statistics and Data Science I (AB)
  • High School / Statistics and Data Science II (XCD)
  • High School / Algebra + Data Science (G)
  • College / Introductory Statistics with R (ABC)
  • College / Advanced Statistics with R (ABCD)
  • College / Accelerated Statistics with R (XCD)
  • CKHub: Jupyter made easy

1.9 Recoding and Creating Variables

Recoding Variables

Sometimes you might want to recode one or more variables in a data frame. There are many ways to do this, and many reasons for doing it. To start, let’s take all the homes with no garage (i.e., GarageType == "None") and recode their GarageCars as 0 instead of NA. That way we can keep them in the analysis in a way that makes sense.

First we might run a tally() to make sure that the homes with NA for the number of cars are the ones without garages. If we run tally(GarageCars ~ GarageType, data=Ames) we get the following two-way table:

         GarageType
GarageCars Attached Detached None
      1          14       24    0
      2          95       26    0
      3          21        1    0
      <NA>        0        0    4

By looking at this table we can confirm that there are exactly 4 houses that have no garage (GarageType is None) and that these four houses are the same four that were coded as missing (NA) for GarageCars.

One very flexible way to change the NAs into 0 for these four homes is to use the indexing brackets in a new way:

Ames$GarageCars[Ames$GarageType == "None"] <- 0

To read this code, start with the stuff in the brackets: For all rows where GarageType == "None", assign the value of GarageCars to be 0. If we run this code, and then run tally() again, we can see that the 4 homes with no garages now show 0 for the number of cars that can be parked in garages.

         GarageType
GarageCars Attached Detached None
         0        0        0    4
         1       14       24    0
         2       95       26    0
         3       21        1    0

Our use of the tally() command here illustrates an important habit to get into as you develop your R skills: always think of ways to verify that R actually did what you wanted it to do.

Creating Variables

In Ames, we have a variable that shows the year the home was built (YearBuilt). For some analysis purposes, you might want to create a new variable that combines multiple years into broader eras. For example, we might find different sale prices for homes built before or after the year 1900.

We can create a new variable called BuiltPre1900 that tells us whether the home was built before the year 1900, and then add this new variable as a new column to the Ames data frame.

Ames$BuiltPre1900 <- Ames$YearBuilt < 1900

Run glimpse(Ames) in the window below to check to see if your new variable is there.

require(coursekata) Ames <- coursekata::Ames # try running this BuiltPre1900 <- Ames$YearBuilt < 1900 # take a glimpse at Ames to check if BuiltPre1900 is in there # fix the first line of code to put BuiltPre1900 in Ames # try running this Ames$BuiltPre1900 <- Ames$YearBuilt < 1900 # take a glimpse at Ames to check if BuiltPre1900 is in there glimpse(Ames) # fix the first line of code to put BuiltPre1900 in Ames ex() %>% { check_object(., "Ames") %>% check_column("BuiltPre1900") %>% check_equal() check_function(., "glimpse") %>% check_result() %>% check_equal() }

Notice that the variable BuiltPre1900 is listed with a new type: <lgl> (short for logical). Just as there are different types of quantitative variables, there also are different types of categorical variables. This particular type is logical, which is also sometimes called Boolean. Logical variables are special in that they can only take the values TRUE or FALSE.

Try running some tally() commands in the window below to make sure your new variable works the way you expected. These two tally() commands, for example, should yield similar results:

tally(~ BuiltPre1900, data=Ames)
tally(~ YearBuilt < 1900, data=Ames)

Try both of these lines of code below and see what happens.

require(coursekata) # This code creates a variable called BuiltPre1900 Ames$BuiltPre1900 <- Ames$YearBuilt < 1900 # Write code to tally up BuiltPre1900 in Ames in two different ways # This code creates a variable called BuiltPre1900 Ames$BuiltPre1900 <- Ames$YearBuilt < 1900 # Write code to tally up BuiltPre1900 in Ames in two tally(Ames$BuiltPre1900) tally(~BuiltPre1900, data = Ames) ex() %>% check_correct( check_function(., "tally") %>% check_result() %>% check_equal(), { check_error(.) check_function(., "tally") %>% check_arg("x") %>% check_equal(incorrect_msg = "Make sure you are getting `BuiltPre1900` from `Ames` using the `$`.") } )

Here’s what we got:

BuiltPre1900
 TRUE FALSE
         5      180

YearBuilt < 1900
 TRUE FALSE
         5      180

You can also use arithmetic operators to create new summary variables. For example, you might want a variable to indicate how old a house is. For example, we can calculate the difference between the current year and the year the house was built.

require(coursekata) # Save the current year by editing the code below CurrentYear <- 1900 # Write code to create a variable that finds how old the house is # Hint: CurrentYear is not in the Ames data frame so it won’t need Ames$ in front of it Ames$HowOld <- # This will print HowOld from the Ames data frame Ames$HowOld # Save the current year by editing the code below CurrentYear <- as.numeric(format(Sys.Date()[1],'%Y')) # Write code to create a variable that finds how old the house is # Hint: CurrentYear is not in the Ames data frame so it won’t need Ames$ in front of it Ames$HowOld <- CurrentYear - Ames$YearBuilt ex() %>% check_object("Ames") %>% check_column("HowOld") %>% check_equal()

Responses