R Find Missing Values (6 Examples for Data Frame, Column & Vector)

Let’s face it:

Missing values are an issue of almost every raw data set!

If we don’t handle our missing data in an appropriate way, our estimates are likely to be biased.

However, before we can deal with missingness, we need to identify in which rows and columns the missing values occur.

In the following, I will show you several examples how to find missing values in R.

 

Example 1: One of the most common ways in R to find missing values in a vector

expl_vec1 <- c(4, 8, 12, NA, 99, - 20, NA) # Create your own example vector with NA's
 
is.na(expl_vec1) # The is.na() function returns a logical vector. The vector is TRUE in case
                 # of a missing value and FALSE in case of an observed value
which(is.na(expl_vec1)) # The which() function returns the positions with missing values in your vector.
                        # In our case there are NA's at positions 4 & 7
### [1] 4 7

 

Example 2: Find missing values in a column of a data frame

expl_data1 <- data.frame(x1 = c(NA, 7, 8, 9, 3), # Numeric variable with one missing value
                         x2 = c(4, 1, NA, NA, 4), # Numeric variable with two missing values
                         x3 = c(1, 4, 2, 9, 6), # Numeric variable without any missing values
                         x4 = c("Hello", "I am not NA", NA, "I love R", NA)) # Factor variable with
                                                                             # two missing values
expl_data1 # This is how our data with missing values looks like

 

Example Data R Find Missing Values

Table 1: Example Data Frame with Missing Values

 

which(is.na(expl_data1$x1)) # Same procedure as in Example 1, but this time with the column of a data frame;
                            # Missing value in x1 at position 1
which(is.na(expl_data1$x2)) # Variable x2 has missing values at positions 3 and 4
which(is.na(expl_data1$x3)) # The variable x3 in column 3 has no missing values
which(is.na(expl_data1$x4)) # Our factor variable x4 in column 4 has missing values at positions 3 and 5;
                            # The same procedure can be applied to factors

 

Example 3: Identify missing values in an R data frame

# As in Example one, you can create a data frame with logical TRUE and FALSE values; 
# Indicating observed and missing values
is.na(expl_data1)
apply(is.na(expl_data1), 2, which) # In order to get the positions of each column in your data set,
                                   # you can use the apply() function

 

Example 4: Detect missing values in a column of an R matrix

# Create matrix on the basis of the first three columns of our example data of Example 2
expl_matrix1 <- as.matrix(expl_data1[ , 1:3])
expl_matrix1
 
which(is.na(expl_matrix1[ , 1])) # The $ operator is invalid for columns of matrices.
                                 # Therefore we have to select our matrix columns by squared brackets 
which(is.na(expl_matrix1[ , 2])) # Beside the change from the $ operator to squared brackets,
                                 # we can apply the same functions as in the other examples
which(is.na(expl_matrix1[ , 3])) # Again, no missing values in x3

 

Example 5: Identify NA values in a matrix

# We can check the missing values of the whole matrix with the same procedure as in Example 3
apply(is.na(expl_matrix1), 2, which)

 

Example 6: Find missing values in R with the complete.cases() function

# An alternative to the is.na() function is the function complete.cases(),
# which searches for observed values instead of missing values
which(complete.cases(expl_vec1)) # Identify observed values (opposite result as in Example 1)
which(complete.cases(expl_vec1) == FALSE) # Reproduce result of Example 1 by adding == FALSE
complete.cases(expl_data1) # If a data frame or matrix is checked by complete.case(),
                           # the function returns a logical vector indicating whether a row is complete

 

Video Example – Detect Missing Values in a Real Data Set

The following video of my YouTube channel shows in a live example how to find NA, how to count NA, how to omit NA, and how to remove missing values.

Have a look at minute 1:05.

I’m showing here the same approach that I have explained in Example 1.

 

 

R – Count Missing Values per Row and Column

Besides the positioning of your missing data, the question might arise how to count missing values per row, by column, or in a single vector. Let’s check how to do this based on our example data above:

# With the sum() and the is.na() functions you can find the number of missing values in your data
sum(is.na(expl_vec1)) # Two missings in our vector
sum(is.na(expl_data1)) # The same method works for the whole data frame; Five missings overall
sum(is.na(expl_matrix1)) # The procedure works also for matrices; The NA count is three in our case

 

How to Handle Missing Data in R?

Once we found missing values in our data, the question appears how we should treat these not available values. Complete case data is needed for most data analyses in R!

The default method in the R programming language is listwise deletion, which deletes all rows with missing values in one or more columns.

Basic data manipulations can be done with the na.omit command or with the is.na R function.

A more sophisticated approach – which is usually preferable to a complete case analysis – is the imputation of missing values.

Very simple imputation approaches would be mean imputation (mode imputation in case of categorical variables) or the replacement of NA’s with 0.

However, in order to create a more reasonable complete data set, missing data imputation usually replaces missing values with estimates that are based on statistical models (e.g. via regression imputation or predictive mean matching).

 

Now It’s Your Turn

So that is how I’m checking for missing values in my data sets.

Now I’d like to hear about your thoughts: What’s your favorite approach?

Are you going to use the is.na function of Example 1? Or will you find NA’s by searching for complete cases?

Let me know by leaving a comment below. I will respond to every question!

 



 

Appendix

How to create the graphic of the header of this page

The header graphic shows a simple dotplot created with the R package ggplot2.

The dark blue values indicate observed values; The light blue values indicate missingness.

Since the missing values appear more often in the upper right part of the plot, they can not be considered as Missing Completely At Random anymore.

set.seed(8765) # Reproducability
 
var1 <- rnorm(2000, 10, 3) # Normal distribution
var2 <- var1 + rnorm(2000) # Correlated normal distribution
 
range01 <- function(x){(x - min(x)) / (max(x) - min(x))} # Supress probabilities of missingness between 0 and 1
var2_miss <- rbinom(2000, 1, range01(var1^3)) == 1 # Insert missing values for var2 in dependance of var1
 
data_ggplot_missings <- data.frame(var1, var2) # Store var1 and var2 in a data frame
 
colours <- rep(1, 2000) # Set colours 
colours[var2_miss] <- 2
 
ggplot_missings <- ggplot(data_ggplot_missings, aes(x = var1, y = var2)) + # Create ggplot
  geom_point(aes(col = colours, size = 1.1)) + 
  theme(legend.position = "none")

 

Leave a Reply

Your email address will not be published. Required fields are marked *

Fill out this field
Fill out this field
Please enter a valid email address.

Menu