Course Outline
-
segmentGetting Started (Don't Skip This Part)
-
segmentStatistics and Data Science
-
segmentPART I: EXPLORING AND MODELING VARIATION
-
segmentChapter 1 - Exploring Data with R
-
1.9 Recoding and Creating Variables
-
segmentChapter 2 - From Exploring to Modeling Variation
-
segmentChapter 3 - Modeling Relationships in Data
-
segmentPART II: COMPARING MODELS TO MAKE INFERENCES
-
segmentChapter 4 - The Logic of Inference
-
segmentChapter 5 - Model Comparison with F
-
segmentChapter 6 - Parameter Estimation and Confidence Intervals
-
segmentPART III: MULTIVARIATE MODELS
-
segmentChapter 7 - Introduction to Multivariate Models
-
segmentChapter 8 - Multivariate Model Comparisons
-
segmentChapter 9 - Models with Interactions
-
segmentChapter 10 - More Models with Interactions
-
segmentFinishing Up (Don't Skip This Part!)
-
segmentResources
list College / Accelerated Statistics with R (XCD)
1.9 Recoding and Creating Variables
Recoding Variables
Sometimes you might want to recode one or more variables in a data frame. There are many ways to do this, and many reasons for doing it. To start, let’s take all the homes with no garage (i.e., GarageType == "None"
) and recode their GarageCars
as 0 instead of NA. That way we can keep them in the analysis in a way that makes sense.
First we might run a tally()
to make sure that the homes with NA for the number of cars are the ones without garages. If we run tally(GarageCars ~ GarageType, data=Ames)
we get the following two-way table:
GarageType
GarageCars Attached Detached None
1 14 24 0
2 95 26 0
3 21 1 0
<NA> 0 0 4
By looking at this table we can confirm that there are exactly 4 houses that have no garage (GarageType
is None
) and that these four houses are the same four that were coded as missing (NA) for GarageCars
.
One very flexible way to change the NAs into 0 for these four homes is to use the indexing brackets in a new way:
Ames$GarageCars[Ames$GarageType == "None"] <- 0
To read this code, start with the stuff in the brackets: For all rows where GarageType == "None"
, assign the value of GarageCars
to be 0. If we run this code, and then run tally()
again, we can see that the 4 homes with no garages now show 0 for the number of cars that can be parked in garages.
GarageType
GarageCars Attached Detached None
0 0 0 4
1 14 24 0
2 95 26 0
3 21 1 0
Our use of the tally()
command here illustrates an important habit to get into as you develop your R skills: always think of ways to verify that R actually did what you wanted it to do.
Creating Variables
In Ames
, we have a variable that shows the year the home was built (YearBuilt
). For some analysis purposes, you might want to create a new variable that combines multiple years into broader eras. For example, we might find different sale prices for homes built before or after the year 1900.
We can create a new variable called BuiltPre1900
that tells us whether the home was built before the year 1900, and then add this new variable as a new column to the Ames
data frame.
Ames$BuiltPre1900 <- Ames$YearBuilt < 1900
Run glimpse(Ames)
in the window below to check to see if your new variable is there.
require(coursekata)
Ames <- coursekata::Ames
# try running this
BuiltPre1900 <- Ames$YearBuilt < 1900
# take a glimpse at Ames to check if BuiltPre1900 is in there
# fix the first line of code to put BuiltPre1900 in Ames
# try running this
Ames$BuiltPre1900 <- Ames$YearBuilt < 1900
# take a glimpse at Ames to check if BuiltPre1900 is in there
glimpse(Ames)
# fix the first line of code to put BuiltPre1900 in Ames
ex() %>% {
check_object(., "Ames") %>%
check_column("BuiltPre1900") %>%
check_equal()
check_function(., "glimpse") %>%
check_result() %>%
check_equal()
}
Notice that the variable BuiltPre1900
is listed with a new type: <lgl>
(short for logical). Just as there are different types of quantitative variables, there also are different types of categorical variables. This particular type is logical, which is also sometimes called Boolean. Logical variables are special in that they can only take the values TRUE or FALSE.
Try running some tally()
commands in the window below to make sure your new variable works the way you expected. These two tally()
commands, for example, should yield similar results:
tally(~ BuiltPre1900, data=Ames)
tally(~ YearBuilt < 1900, data=Ames)
Try both of these lines of code below and see what happens.
require(coursekata)
# This code creates a variable called BuiltPre1900
Ames$BuiltPre1900 <- Ames$YearBuilt < 1900
# Write code to tally up BuiltPre1900 in Ames in two different ways
# This code creates a variable called BuiltPre1900
Ames$BuiltPre1900 <- Ames$YearBuilt < 1900
# Write code to tally up BuiltPre1900 in Ames in two
tally(Ames$BuiltPre1900)
tally(~BuiltPre1900, data = Ames)
ex() %>% check_correct(
check_function(., "tally") %>%
check_result() %>% check_equal(),
{
check_error(.)
check_function(., "tally") %>%
check_arg("x") %>%
check_equal(incorrect_msg = "Make sure you are getting `BuiltPre1900` from `Ames` using the `$`.")
}
)
Here’s what we got:
BuiltPre1900
TRUE FALSE
5 180
YearBuilt < 1900
TRUE FALSE
5 180
You can also use arithmetic operators to create new summary variables. For example, you might want a variable to indicate how old a house is. For example, we can calculate the difference between the current year and the year the house was built.
require(coursekata)
# Save the current year by editing the code below
CurrentYear <- 1900
# Write code to create a variable that finds how old the house is
# Hint: CurrentYear is not in the Ames data frame so it won’t need Ames$ in front of it
Ames$HowOld <-
# This will print HowOld from the Ames data frame
Ames$HowOld
# Save the current year by editing the code below
CurrentYear <- as.numeric(format(Sys.Date()[1],'%Y'))
# Write code to create a variable that finds how old the house is
# Hint: CurrentYear is not in the Ames data frame so it won’t need Ames$ in front of it
Ames$HowOld <- CurrentYear - Ames$YearBuilt
ex() %>%
check_object("Ames") %>%
check_column("HowOld") %>%
check_equal()