Course Outline
-
segmentGetting Started (Don't Skip This Part)
-
segmentStatistics and Data Science II
-
segmentPART I: EXPLORING AND MODELING VARIATION
-
segmentChapter 1 - Exploring Data with R
-
1.8 Missing Data
-
segmentChapter 2 - From Exploring to Modeling Variation
-
segmentChapter 3 - Modeling Relationships in Data
-
segmentPART II: COMPARING MODELS TO MAKE INFERENCES
-
segmentChapter 4 - The Logic of Inference
-
segmentChapter 5 - Model Comparison with F
-
segmentChapter 6 - Parameter Estimation and Confidence Intervals
-
segmentPART III: MULTIVARIATE MODELS
-
segmentChapter 7 - Introduction to Multivariate Models
-
segmentChapter 8 - Multivariate Model Comparisons
-
segmentChapter 9 - Models with Interactions
-
segmentChapter 10 - More Models with Interactions
-
segmentFinishing Up (Don't Skip This Part!)
-
segmentResources
list High School / Statistics and Data Science II (XCD)
1.8 Missing Data
We can use R commands to manipulate the data in various, helpful ways. On this page we will learn to handle missing data; on the next page we will learn to create new variables and recode existing variables.
Identifying Missing Data
Sometimes (in fact, usually) we end up with some missing data in our dataset. R represents missing data with the value NA (Not Available), and also lets you decide how to handle missing data in subsequent analyses. If your dataset represents missing data in some other way (e.g., some people use the value -999), you should recode the values as NA when working in R.
Let’s consider the variable GarageCars
which describes the number of cars that can fit in each home’s garage. First, let’s arrange the Ames
data frame so that rows are in descending order by GarageCars
(remembering to save the arranged version back into Ames
). Then let’s print out the values of the variable GarageCars
from the Ames
data frame (let’s use $ rather than select()
).
require(coursekata)
# Arrange Ames by GarageCars in descending order
Ames <-
# Use $ to print out the values of GarageCars from Ames
# Arrange Ames by GarageCars in descending order
Ames <- arrange(Ames, desc(GarageCars))
# Use $ to print out the values of GarageCars from Ames
Ames$GarageCars
ex() %>% {
check_function(.,"arrange")
check_function(., "desc")
check_object(., "Ames") %>% check_equal()
check_output_expr(., "Ames$GarageCars")
}
3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 2 2 2
2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 NA NA NA NA
We can see that we have four missing values for GarageCars
. You can choose to remove these homes from an individual analysis, or you can remove them from the dataset entirely.
Removing Rows with Missing Data
If you wanted to get the data from homes that do not have missing data on any variable, we could use the na.omit()
function.
na.omit(Ames)
One issue with using na.omit()
is that it will remove rows that have an NA
on any variable, not just those with an NA
on a specific variable of interest (e.g., GarageCars
). Because of this, using na.omit()
might remove a lot more rows than you expected.
To remove only rows that have an NA
on GarageCars
, we first have to identify which are those rows. We can then use the filter()
function to include only those rows that are not coded NA
for the variable GarageCars
.
NA
is a special value in R; it is not the same as the text string “NA”. For this reason, we use the special function is.na()
to identify missing values. The is.na(GarageCars)
function returns TRUE
if a case is missing on the variable GarageCars
, and FALSE
if it is not.
If we want to get the data from homes that do not have missing data for GarageCars
, we could use the filter()
function with the argument is.na(GarageCars) == FALSE
. This should give us only the rows in which the variable GarageCars
has a numerical value.
Let’s try it. Previously we used filter(Ames, PriceK > 300)
to filter in homes where PriceK
is greater than 300. Modify the code below to filter in homes where GarageCars
is not NA
.
require(coursekata)
Ames <- Ames %>%
arrange(desc(GarageCars))
# Modify this to filter for homes where GarageCars is not NA
Ames_subset <- filter(Ames, PriceK > 300)
# To check your work, this prints out the variable GarageCars from Ames_subset
# Do you see any NAs?
Ames_subset$GarageCars
# Modify this to filter for homes where GarageCars is not NA
Ames_subset <- filter(Ames, is.na(GarageCars) == FALSE)
# To check your work, this prints out the variable GarageCars from Ames_subset
# Do you see any NAs?
Ames_subset$GarageCars
ex() %>% {
check_function(., "filter") %>% {
check_arg(., ".data") %>% check_equal()
check_arg(., "...") %>% check_equal()
}
check_object(., "Ames_subset") %>% check_equal()
check_output_expr(., "Ames_subset$GarageCars", missing_msg = "Make sure to print out GarageCars from the Ames_subset data frame.")
}
3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
We succeeded in getting rid of the homes for which GarageCars
is missing. But sometimes removing cases with missing data may introduce bias into your sample.
To see what kind of bias we might be introducing, it’s often helpful to take a closer look at the observations we intend to remove.
Run the code below to see what happens.
require(coursekata)
Ames <- Ames %>%
arrange(desc(GarageCars))
# try running this code
filter(Ames, is.na(GarageCars))
# try running this code
filter(Ames, is.na(GarageCars))
ex() %>% {
check_function(., "filter") %>%
check_arg("...") %>%
check_equal()
check_output_expr(., "filter(Ames, is.na(GarageCars))")
}
If you scroll over and look at the variable GarageCars
, you will see that these houses all have NA
. But notice right next to that variable is another variable, GarageType
. It turns out these four houses all have “None” for GarageType
, meaning they don’t have garages. This may explain why GarageCars
is coded as missing. You can’t measure how many cars will fit into a garage that doesn’t exist!
If we remove these observations, we could bias our analyses by underrepresenting homes without garages. We must be careful when making decisions about removing observations with missing data.