Course Outline
-
segmentGetting Started (Don't Skip This Part)
-
segmentHigh School / Advanced Statistics and Data Science I (ABC)
-
segmentPART I: EXPLORING VARIATION
-
segmentChapter 1 - Welcome to Statistics: A Modeling Approach
-
segmentChapter 2 - Understanding Data
-
2.10 Missing Data
-
segmentChapter 3 - Examining Distributions
-
segmentChapter 4 - Explaining Variation
-
segmentPART II: MODELING VARIATION
-
segmentChapter 5 - A Simple Model
-
segmentChapter 6 - Quantifying Error
-
segmentChapter 7 - Adding an Explanatory Variable to the Model
-
segmentChapter 8 - Digging Deeper into Group Models
-
segmentChapter 9 - Models with a Quantitative Explanatory Variable
-
segmentPART III: EVALUATING MODELS
-
segmentChapter 10 - The Logic of Inference
-
segmentChapter 11 - Model Comparison with F
-
segmentChapter 12 - Parameter Estimation and Confidence Intervals
-
segmentChapter 13 - What You Have Learned
-
segmentFinishing Up (Don't Skip This Part!)
-
segmentResources
list High School / Advanced Statistics and Data Science I (ABC)
2.10 Missing Data
Once data are in a tidy format, we can use R commands to manipulate the data in various ways. On this page we will learn to handle missing data; on the next page we will learn to create new variables and recode existing variables.
Identifying Missing Data
Sometimes (in fact, usually) we end up with some missing data in our dataset. R represents missing data with the value NA (not available), and then also lets you decide how to handle missing data in subsequent analyses. If your dataset represents missing data in some other way (e.g., some people put the value -999), you should recode the values as NA when working in R.
Let’s consider the last digit of students’ Social Security Numbers
(SSLast
) in the Fingers
data frame. First,
arrange the Fingers
data frame so that rows are in
descending order by SSLast
(hint: use the
desc()
function). We have written some code that will print
out just the variable SSLast
from the Fingers
data frame (remember to use $
).
require(coursekata)
# Edit this to arrange Fingers dataframe in descending order by SSLast
Fingers_arranged <- arrange(Fingers, SSLast)
# This will print the values of the variable Fingers_arranged$SSLast
print(Fingers_arranged$SSLast)
# Edit this to arrange Fingers dataframe in descending order by SSLast
Fingers_arranged <- arrange(Fingers, desc(SSLast))
# This will print the values of the variable Fingers_arranged$SSLast
print(Fingers_arranged$SSLast)
ex() %>% {
check_function(.,"arrange")
check_function(., "desc")
check_object(., "Fingers_arranged") %>% check_equal()
}
[1] 9397 8894 7700 7549 7037 6990 6346 6292 6138 5461 5112 4800 3530 3364 3362
[16] 2354 2019 1821 1339 1058 791 789 760 9 9 9 9 9 9 9
[31] 9 9 9 9 9 9 9 9 8 8 8 8 8 8 8
[46] 8 8 7 7 7 7 7 7 7 7 7 7 7 7 7
[61] 7 7 6 6 6 6 6 6 6 5 5 5 5 4 4
[76] 4 4 4 4 4 4 4 4 4 3 3 3 3 3 3
[91] 3 3 3 3 3 3 3 3 3 3 2 2 2 2 2
[106] 2 2 2 2 2 1 1 1 1 1 1 1 0 0 0
[121] 0 0 0 0 0 0 0 0 NA NA NA NA NA NA NA
[136] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
[151] NA NA NA NA NA NA NA
In R, blanks are automatically given the special value
NA
for not available. You can choose to remove rows (i.e.,
observations) with missing data from an individual analysis, or you can
remove them from the dataset entirely.
Removing Rows with Missing Data
One drastic move is to create a new data frame without any
missing data. The function na.omit()
will remove all rows
on which any variable has the value NA
:
Fingers_complete <- na.omit(Fingers)
One issue with using na.omit()
is that it will remove
rows that have an NA
on any variable, not just
those with an NA
on a specific variable of interest (e.g.,
SSLast
). Because of this, using na.omit()
might remove a lot more rows than you expected.
To remove only rows that have an NA
in
SSLast
, we first have to identify which are those rows. We
can then use the filter()
function to include only those
rows that are not coded NA
for the variable
SSLast
.
NA
is a special value in R; it is not the same as the
text string “NA”. For this reason, we use the special function
is.na()
to identify missing values.
This is a case where it will be more useful to find the rows where
SSLast
is not NA
instead of those
where it is. To keep only these rows we can use this filter command:
filter(Fingers, is.na(SSLast) == FALSE)
This code returns a data frame that includes only cases for which the
variable SSLast
is not NA
. Just a
reminder, the filter()
function filters in, not
out.
As with anything in R, your filtered data frame is only temporary
unless you save it to an R object. Go ahead and save the data with no
missing SSLast
values in a new data frame called
Fingers_subset
.
require(coursekata)
Fingers <- Fingers %>%
arrange(desc(SSLast))
# Filter cases where SSLast is not NA
Fingers_subset <-
# Print out the variable Fingers_subset$SSLast
# Filter cases where SSLast is not NA
Fingers_subset <- filter(Fingers, is.na(SSLast) == FALSE)
# Print out the variable Fingers_subset$SSLast
Fingers_subset$SSLast
ex() %>% {
check_function(., "filter") %>% {
check_arg(., ".data") %>% check_equal()
check_arg(., "...") %>% check_equal()
check_result(.) %>% check_equal()
}
check_or(.,
check_output_expr(., "Fingers_subset$SSLast"),
override_solution(., 'Fingers_subset <- filter(Fingers, is.na(SSLast) == FALSE); select(Fingers_subset, SSLast)') %>%
check_function("select") %>%
check_result() %>%
check_equal()
)
}
[1] 9397 8894 7700 7549 7037 6990 6346 6292 6138 5461 5112 4800 3530 3364 3362
[16] 2354 2019 1821 1339 1058 791 789 760 9 9 9 9 9 9 9
[31] 9 9 9 9 9 9 9 9 8 8 8 8 8 8 8
[46] 8 8 7 7 7 7 7 7 7 7 7 7 7 7 7
[61] 7 7 6 6 6 6 6 6 6 5 5 5 5 4 4
[76] 4 4 4 4 4 4 4 4 4 3 3 3 3 3 3
[91] 3 3 3 3 3 3 3 3 3 3 2 2 2 2 2
[106] 2 2 2 2 2 1 1 1 1 1 1 1 0 0 0
[121] 0 0 0 0 0 0 0 0
Remember, however, that if you remove cases with missing data you may be introducing bias into your sample.