Course Outline
-
segmentGetting Started (Don't Skip This Part)
-
segmentStatistics and Data Science: A Modeling Approach
-
segmentPART I: EXPLORING VARIATION
-
segmentChapter 1 - Welcome to Statistics: A Modeling Approach
-
segmentChapter 2 - Understanding Data
-
2.10 Missing Data
-
segmentChapter 3 - Examining Distributions
-
segmentChapter 4 - Explaining Variation
-
segmentPART II: MODELING VARIATION
-
segmentChapter 5 - A Simple Model
-
segmentChapter 6 - Quantifying Error
-
segmentChapter 7 - Adding an Explanatory Variable to the Model
-
segmentChapter 8 - Digging Deeper into Group Models
-
segmentChapter 9 - Models with a Quantitative Explanatory Variable
-
segmentPART III: EVALUATING MODELS
-
segmentChapter 10 - The Logic of Inference
-
segmentChapter 11 - Model Comparison with F
-
segmentChapter 12 - Parameter Estimation and Confidence Intervals
-
segmentChapter 13 - What You Have Learned
-
segmentFinishing Up (Don't Skip This Part!)
-
segmentResources
list High School / Advanced Statistics and Data Science I (ABC)
2.10 Missing Data
Once data are in a tidy format, we can use R commands to manipulate the data in various ways. On this page we will learn to handle missing data; on the next page we will learn to create new variables and recode existing variables.
Identifying Missing Data
Sometimes (in fact, usually) we end up with some missing data in our data set. R represents missing data with the value NA (not available), and then also lets you decide how to handle missing data in subsequent analyses. If your data set represents missing data in some other way (e.g., some people put the value -999), you should recode the values as NA when working in R.
Let’s consider the last digit of students’ Social Security Numbers (SSLast
) in the Fingers
data frame. First, arrange the Fingers
data frame so that rows are in descending order by SSLast
(hint: use the desc()
function). We have written some code that will print out just the variable SSLast
from the Fingers
data frame (remember to use $
).
require(coursekata)
# Edit this to arrange Fingers dataframe in descending order by SSLast
Fingers_arranged <- arrange(Fingers, SSLast)
# This will print the values of the variable Fingers_arranged$SSLast
print(Fingers_arranged$SSLast)
# Edit this to arrange Fingers dataframe in descending order by SSLast
Fingers_arranged <- arrange(Fingers, desc(SSLast))
# This will print the values of the variable Fingers_arranged$SSLast
print(Fingers_arranged$SSLast)
ex() %>% {
check_function(.,"arrange")
check_function(., "desc")
check_object(., "Fingers_arranged") %>% check_equal()
}
[1] 9397 8894 7700 7549 7037 6990 6346 6292 6138 5461 5112 4800 3530 3364 3362
[16] 2354 2019 1821 1339 1058 791 789 760 9 9 9 9 9 9 9
[31] 9 9 9 9 9 9 9 9 8 8 8 8 8 8 8
[46] 8 8 7 7 7 7 7 7 7 7 7 7 7 7 7
[61] 7 7 6 6 6 6 6 6 6 5 5 5 5 4 4
[76] 4 4 4 4 4 4 4 4 4 3 3 3 3 3 3
[91] 3 3 3 3 3 3 3 3 3 3 2 2 2 2 2
[106] 2 2 2 2 2 1 1 1 1 1 1 1 0 0 0
[121] 0 0 0 0 0 0 0 0 NA NA NA NA NA NA NA
[136] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
[151] NA NA NA NA NA NA NA
In R, blanks are automatically given the special value NA
for not available. You can choose to remove rows (i.e., observations) with missing data from an individual analysis, or you can remove them from the data set entirely.
Removing Rows with Missing Data
One drastic move is to create a new data frame without any missing data. The function na.omit()
will remove all rows on which any variable has the value NA
:
Fingers_complete <- na.omit(Fingers)
One issue with using na.omit()
is that it will remove rows that have an NA
on any variable, not just those with an NA
on a specific variable of interest (e.g., SSLast
). Because of this, using na.omit()
might remove a lot more rows than you expected.
To remove only rows that have an NA
in SSLast
, we first have to identify which are those rows. We can then use the filter()
function to include only those rows that are not coded NA
for the variable SSLast
.
NA
is a special value in R; it is not the same as the text string “NA”. For this reason, we use the special function is.na()
to identify missing values.
This is a case where it will be more useful to find the rows where SSLast
is not NA
instead of those where it is. To keep only these rows we can use this filter command:
filter(Fingers, is.na(SSLast) == FALSE)
This code returns a data frame that includes only cases for which the variable SSLast
is not NA
. Just a reminder, the filter()
function filters in, not out.
As with anything in R, your filtered data frame is only temporary unless you save it to an R object. Go ahead and save the data with no missing SSLast
values in a new data frame called Fingers_subset
.
require(coursekata)
Fingers <- Fingers %>%
arrange(desc(SSLast))
# Filter cases where SSLast is not NA
Fingers_subset <-
# Print out the variable Fingers_subset$SSLast
# Filter cases where SSLast is not NA
Fingers_subset <- filter(Fingers, is.na(SSLast) == FALSE)
# Print out the variable Fingers_subset$SSLast
Fingers_subset$SSLast
ex() %>% {
check_function(., "filter") %>% {
check_arg(., ".data") %>% check_equal()
check_arg(., "...") %>% check_equal()
check_result(.) %>% check_equal()
}
check_or(.,
check_output_expr(., "Fingers_subset$SSLast"),
override_solution(., 'Fingers_subset <- filter(Fingers, is.na(SSLast) == FALSE); select(Fingers_subset, SSLast)') %>%
check_function("select") %>%
check_result() %>%
check_equal()
)
}
[1] 9397 8894 7700 7549 7037 6990 6346 6292 6138 5461 5112 4800 3530 3364 3362
[16] 2354 2019 1821 1339 1058 791 789 760 9 9 9 9 9 9 9
[31] 9 9 9 9 9 9 9 9 8 8 8 8 8 8 8
[46] 8 8 7 7 7 7 7 7 7 7 7 7 7 7 7
[61] 7 7 6 6 6 6 6 6 6 5 5 5 5 4 4
[76] 4 4 4 4 4 4 4 4 4 3 3 3 3 3 3
[91] 3 3 3 3 3 3 3 3 3 3 2 2 2 2 2
[106] 2 2 2 2 2 1 1 1 1 1 1 1 0 0 0
[121] 0 0 0 0 0 0 0 0
Remember, however, that if you remove cases with missing data you may be introducing bias into your sample.