Course Outline

list High School / Advanced Statistics and Data Science I (ABC)

Book
  • High School / Advanced Statistics and Data Science I (ABC)
  • High School / Statistics and Data Science I (AB)
  • High School / Statistics and Data Science II (XCD)
  • High School / Algebra + Data Science (G)
  • College / Introductory Statistics with R (ABC)
  • College / Advanced Statistics with R (ABCD)
  • College / Accelerated Statistics with R (XCD)
  • CKHub: Jupyter made easy

2.10 Missing Data

Once data are in a tidy format, we can use R commands to manipulate the data in various ways. On this page we will learn to handle missing data; on the next page we will learn to create new variables and recode existing variables.

Identifying Missing Data

Sometimes (in fact, usually) we end up with some missing data in our data set. R represents missing data with the value NA (not available), and then also lets you decide how to handle missing data in subsequent analyses. If your data set represents missing data in some other way (e.g., some people put the value -999), you should recode the values as NA when working in R.

Let’s consider the last digit of students’ Social Security Numbers (SSLast) in the Fingers data frame. First, arrange the Fingers data frame so that rows are in descending order by SSLast (hint: use the desc() function). We have written some code that will print out just the variable SSLast from the Fingers data frame (remember to use $).

require(coursekata) # Edit this to arrange Fingers dataframe in descending order by SSLast Fingers_arranged <- arrange(Fingers, SSLast) # This will print the values of the variable Fingers_arranged$SSLast print(Fingers_arranged$SSLast) # Edit this to arrange Fingers dataframe in descending order by SSLast Fingers_arranged <- arrange(Fingers, desc(SSLast)) # This will print the values of the variable Fingers_arranged$SSLast print(Fingers_arranged$SSLast) ex() %>% { check_function(.,"arrange") check_function(., "desc") check_object(., "Fingers_arranged") %>% check_equal() }
  [1] 9397 8894 7700 7549 7037 6990 6346 6292 6138 5461 5112 4800 3530 3364 3362
 [16] 2354 2019 1821 1339 1058  791  789  760    9    9    9    9    9    9    9
 [31]    9    9    9    9    9    9    9    9    8    8    8    8    8    8    8
 [46]    8    8    7    7    7    7    7    7    7    7    7    7    7    7    7
 [61]    7    7    6    6    6    6    6    6    6    5    5    5    5    4    4
 [76]    4    4    4    4    4    4    4    4    4    3    3    3    3    3    3
 [91]    3    3    3    3    3    3    3    3    3    3    2    2    2    2    2
[106]    2    2    2    2    2    1    1    1    1    1    1    1    0    0    0
[121]    0    0    0    0    0    0    0    0   NA   NA   NA   NA   NA   NA   NA
[136]   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA
[151]   NA   NA   NA   NA   NA   NA   NA

In R, blanks are automatically given the special value NA for not available. You can choose to remove rows (i.e., observations) with missing data from an individual analysis, or you can remove them from the data set entirely.

Removing Rows with Missing Data

One drastic move is to create a new data frame without any missing data. The function na.omit() will remove all rows on which any variable has the value NA:

Fingers_complete <- na.omit(Fingers)

One issue with using na.omit() is that it will remove rows that have an NA on any variable, not just those with an NA on a specific variable of interest (e.g., SSLast). Because of this, using na.omit() might remove a lot more rows than you expected.

To remove only rows that have an NA in SSLast, we first have to identify which are those rows. We can then use the filter() function to include only those rows that are not coded NA for the variable SSLast.

NA is a special value in R; it is not the same as the text string “NA”. For this reason, we use the special function is.na() to identify missing values.

This is a case where it will be more useful to find the rows where SSLast is not NA instead of those where it is. To keep only these rows we can use this filter command:

filter(Fingers, is.na(SSLast) == FALSE)

This code returns a data frame that includes only cases for which the variable SSLast is not NA. Just a reminder, the filter() function filters in, not out.

As with anything in R, your filtered data frame is only temporary unless you save it to an R object. Go ahead and save the data with no missing SSLast values in a new data frame called Fingers_subset.

require(coursekata) Fingers <- Fingers %>% arrange(desc(SSLast)) # Filter cases where SSLast is not NA Fingers_subset <- # Print out the variable Fingers_subset$SSLast # Filter cases where SSLast is not NA Fingers_subset <- filter(Fingers, is.na(SSLast) == FALSE) # Print out the variable Fingers_subset$SSLast Fingers_subset$SSLast ex() %>% { check_function(., "filter") %>% { check_arg(., ".data") %>% check_equal() check_arg(., "...") %>% check_equal() check_result(.) %>% check_equal() } check_or(., check_output_expr(., "Fingers_subset$SSLast"), override_solution(., 'Fingers_subset <- filter(Fingers, is.na(SSLast) == FALSE); select(Fingers_subset, SSLast)') %>% check_function("select") %>% check_result() %>% check_equal() ) }
  [1] 9397 8894 7700 7549 7037 6990 6346 6292 6138 5461 5112 4800 3530 3364 3362
 [16] 2354 2019 1821 1339 1058  791  789  760    9    9    9    9    9    9    9
 [31]    9    9    9    9    9    9    9    9    8    8    8    8    8    8    8
 [46]    8    8    7    7    7    7    7    7    7    7    7    7    7    7    7
 [61]    7    7    6    6    6    6    6    6    6    5    5    5    5    4    4
 [76]    4    4    4    4    4    4    4    4    4    3    3    3    3    3    3
 [91]    3    3    3    3    3    3    3    3    3    3    2    2    2    2    2
[106]    2    2    2    2    2    1    1    1    1    1    1    1    0    0    0
[121]    0    0    0    0    0    0    0    0

Remember, however, that if you remove cases with missing data you may be introducing bias into your sample.

Responses