Course Outline

list High School / Statistics and Data Science II (XCD)

Book
  • High School / Advanced Statistics and Data Science I (ABC)
  • High School / Statistics and Data Science I (AB)
  • High School / Statistics and Data Science II (XCD)
  • High School / Algebra + Data Science (G)
  • College / Introductory Statistics with R (ABC)
  • College / Advanced Statistics with R (ABCD)
  • College / Accelerated Statistics with R (XCD)
  • CKHub: Jupyter made easy

1.8 Missing Data

We can use R commands to manipulate the data in various, helpful ways. On this page we will learn to handle missing data; on the next page we will learn to create new variables and recode existing variables.

Identifying Missing Data

Sometimes (in fact, usually) we end up with some missing data in our dataset. R represents missing data with the value NA (Not Available), and also lets you decide how to handle missing data in subsequent analyses. If your dataset represents missing data in some other way (e.g., some people use the value -999), you should recode the values as NA when working in R.

Let’s consider the variable GarageCars which describes the number of cars that can fit in each home’s garage. First, let’s arrange the Ames data frame so that rows are in descending order by GarageCars (remembering to save the arranged version back into Ames). Then let’s print out the values of the variable GarageCars from the Ames data frame (let’s use $ rather than select()).

require(coursekata) # Arrange Ames by GarageCars in descending order Ames <- # Use $ to print out the values of GarageCars from Ames # Arrange Ames by GarageCars in descending order Ames <- arrange(Ames, desc(GarageCars)) # Use $ to print out the values of GarageCars from Ames Ames$GarageCars ex() %>% { check_function(.,"arrange") check_function(., "desc") check_object(., "Ames") %>% check_equal() check_output_expr(., "Ames$GarageCars") }
3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  2  2  2
2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2
2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2
2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2
2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2
2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  1  1  1  1  1  1  1
1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
1  1  1  1  1  1 NA NA NA NA

We can see that we have four missing values for GarageCars. You can choose to remove these homes from an individual analysis, or you can remove them from the dataset entirely.

Removing Rows with Missing Data

If you wanted to get the data from homes that do not have missing data on any variable, we could use the na.omit() function.

na.omit(Ames)

One issue with using na.omit() is that it will remove rows that have an NA on any variable, not just those with an NA on a specific variable of interest (e.g., GarageCars). Because of this, using na.omit() might remove a lot more rows than you expected.

To remove only rows that have an NA on GarageCars, we first have to identify which are those rows. We can then use the filter() function to include only those rows that are not coded NA for the variable GarageCars.

NA is a special value in R; it is not the same as the text string “NA”. For this reason, we use the special function is.na() to identify missing values. The is.na(GarageCars) function returns TRUE if a case is missing on the variable GarageCars, and FALSE if it is not.

If we want to get the data from homes that do not have missing data for GarageCars, we could use the filter() function with the argument is.na(GarageCars) == FALSE. This should give us only the rows in which the variable GarageCars has a numerical value.

Let’s try it. Previously we used filter(Ames, PriceK > 300) to filter in homes where PriceK is greater than 300. Modify the code below to filter in homes where GarageCars is not NA.

require(coursekata) Ames <- Ames %>% arrange(desc(GarageCars)) # Modify this to filter for homes where GarageCars is not NA Ames_subset <- filter(Ames, PriceK > 300) # To check your work, this prints out the variable GarageCars from Ames_subset # Do you see any NAs? Ames_subset$GarageCars # Modify this to filter for homes where GarageCars is not NA Ames_subset <- filter(Ames, is.na(GarageCars) == FALSE) # To check your work, this prints out the variable GarageCars from Ames_subset # Do you see any NAs? Ames_subset$GarageCars ex() %>% { check_function(., "filter") %>% { check_arg(., ".data") %>% check_equal() check_arg(., "...") %>% check_equal() } check_object(., "Ames_subset") %>% check_equal() check_output_expr(., "Ames_subset$GarageCars", missing_msg = "Make sure to print out GarageCars from the Ames_subset data frame.") }
3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

We succeeded in getting rid of the homes for which GarageCars is missing. But sometimes removing cases with missing data may introduce bias into your sample.

To see what kind of bias we might be introducing, it’s often helpful to take a closer look at the observations we intend to remove.

Run the code below to see what happens.

require(coursekata) Ames <- Ames %>% arrange(desc(GarageCars)) # try running this code filter(Ames, is.na(GarageCars)) # try running this code filter(Ames, is.na(GarageCars)) ex() %>% { check_function(., "filter") %>% check_arg("...") %>% check_equal() check_output_expr(., "filter(Ames, is.na(GarageCars))") }

If you scroll over and look at the variable GarageCars, you will see that these houses all have NA. But notice right next to that variable is another variable, GarageType. It turns out these four houses all have “None” for GarageType, meaning they don’t have garages. This may explain why GarageCars is coded as missing. You can’t measure how many cars will fit into a garage that doesn’t exist!

If we remove these observations, we could bias our analyses by underrepresenting homes without garages. We must be careful when making decisions about removing observations with missing data.

Responses