Predictive Mean Matching Imputation (Theory & Example in R)

Predictive mean matching is the new gold standard of imputation methodology!

Forget about all these outdated and crappy methods such as mean substitution or regression imputation

In the following article, I’ll show you why predictive mean matching is heavily outperforming all the other imputation methods for missing data.

I have split the article in several sections. You may jump to the specific topic you are interested in:

Predictive Mean Matching Explained

Before we can dive into the R programming example, let’s first define what predictive mean matching exactly is. The predictive mean matching algorithm can be split into 6 steps (See also Paul Allison or Vink et al., 2014):

1. Estimate a linear regression model:
2. Draw randomly from the posterior predictive distribution of $$\hat{\beta}$$ and produce a new set of coefficients $$\beta^{*}$$.
• This bayesian step is needed for all multiple imputation methods to create some random variability in the imputed values.
• More details, e.g., in Yuan, 2005.
3. Calculate predicted values for observed and missing $$Y$$.
• Use $$\hat{\beta}$$ to calculate predicted values for observed $$Y$$.
• Use $$\beta^{*}$$ to calculate predicted values for missing $$Y$$.
4. For each case where $$Y$$ is missing, find the closest predicted values among cases where $$Y$$ is observed.
• Example:
• $$Y_i$$ is missing. Its predicted value is 10 (based on $$\beta^{*}$$).
• Our data consists of five observed cases of $$Y$$ with the values 6, 3, 22, 7, and 12.
• In step 3, we predicted the values 7, 2, 20, 9, and 13 for these five observed cases (based on $$\hat{\beta}$$).
• The predictive mean matching algorithm selects the closest observed values (typically three cases) to our missing value $$Y_i$$. Hence, the algorithm selects the values 7, 9, and 13 (the closest values to 10).
5. Draw randomly one of these three close cases and impute the missing value $$Y_i$$ with the observed value of this close case.
• Example continued:
• The algorithm draws randomly from 6, 7, and 12 (the observed values that correspond to the predicted values 7, 9, and 13).
• The algorithm chooses 12 and substitutes this value to $$Y_i$$.
6. In case of multiple imputation (which I strongly advise), steps 1-5 are repeated several times.
• Each repetition of steps 1-5 creates a new imputed data set.
• With multiple imputation, missing data is typically imputed 5 times.

Predictive Mean Matching in R (Example)

I have to admit, the predictive mean matching algorithm is not so easy to understand, when you read it the first time. However, in practice it’s easy to apply!

In the following, I’m going to show you how to apply predictive mean matching in R.

Let’s create some random data for our example:

##### Example data #####   set.seed(918273) # Seed   N <- 3000 # Sample size   y <- round(runif(N, -10, 10)) # Target variable Y x1 <- y + round(runif(N, 0, 50)) # Auxiliary variable 1 x2 <- round(y + 0.25 * x1 + rnorm(N, - 3, 15)) # Auxiliary variable 2 x3 <- round(0.1 * x1 + rpois(N, 2)) # Auxiliary variable 3 x4 <- as.factor(round(0.02 * y + runif(N))) # Auxiliary variable 4 (categorical variable)   y[rbinom(N, 1, 0.2) == 1] <- NA # Insert 20% missing data in Y   data <- data.frame(y, x1, x2, x3, x4) # Store data in dataset head(data) # First 6 rows of our data

Our data consists of five variables: A target variable $$Y$$ with 20% missing values and four auxiliary variables $$X_1$$, $$X_2$$, $$X_3$$, and $$X_4$$. The first six rows of our example data look as follows: Table 1: First Six Rows of Our Data with Missing Values in Y

Let’s move on to the application of predictive mean matching to our example data. First, we need to install the R package mice (detailed explanations can be found in van Buuren and Groothuis-Oudshoorn, 2011).

##### Install mice package in R#####   install.packages("mice") # Install mice package library("mice") # Load mice package

With the following code, we can impute our missing data via single imputation.

The function mice is used to impute the data; m = 1 specifies single imputation; and method = “pmm” specifies predictive mean matching as imputation method.

The function complete stores the imputed data in a new data object (in our example, we call it data_imp_single).

##### Impute data via predictive mean matching (single imputation)#####   imp_single <- mice(data, m = 1, method = "pmm") # Impute missing values data_imp_single <- complete(imp_single) # Store imputed data # head(data_imp_single) # First 6 rows of our imputed data

As I mentioned previously, single imputation is almost never a good idea, since it underestimates your standard errors. However, by specifying m = 5 within the function mice, you can easily switch to multiple imputation (i.e. 5 imputed data sets).

The specifications “repeated” and include = TRUE tell the complete function how to store our multiply imputed data in the object data_imp_multi_all. Have a look at the documentation of complete in order to explore different ways of storing your imputed data (by typing ?complete into your R Studio console).

With the remaining code we do some simple data cleaning and store the dataset in the object data_imp_multi.

##### Predictive mean matching (multiple imputation)#####   imp_multi <- mice(data, m = 5, method = "pmm") # Impute missing values multiple times data_imp_multi_all <- complete(imp_multi, # Store multiply imputed data "repeated", include = TRUE)   data_imp_multi <- data.frame( # Combine imputed Y and X1-X4 (for convenience) data_imp_multi_all[ , 1:6], data[, 2:5]) head(data_imp_multi) # First 6 rows of our multiply imputed data

This is how our multiply imputed data set looks like: Table 2: First Six Rows with Multiply Imputed Values

• $$Y.0$$ is equal to the original $$Y$$ with missing values.
• $$Y.1$$–$$Y.5$$ are the five imputed versions of $$Y$$.
• $$X1$$–$$X4$$ are the four auxiliary variables that we used as predictors for the imputation.

By comparing rows 4 and 6, i.e. the rows with NAs, you can see the effect of multiple imputation. While $$Y.0$$ contains missings, $$Y.1$$–$$Y.5$$ are filled with imputed values. Note that the imputed values differ between the five imputed data sets (row 4: -9, -5, 4, -6, 1; row 6: -3, 9, 0, 1, -3).

You might say: “Isn’t that bad?! It seems like the imputed values are very different to each other.

Yes, they are different – But that is exactly what we expect (and what we want) when we do multiple imputation. The difference between the imputed values reflects the uncertainty of our imputation. When we calculate standard errors (i.e. confidence intervals) for our point estimates, we include this uncertainty into the calculation, which leads to more correct standard errors.

Predictive Mean Matching vs. Stochastic Regression Imputation

In a previous post, I discussed pros and cons of stochastic regression imputation. Regression imputation has many advantages, but I have also shown two serious drawbacks:

1. Stochastic regression imputation might lead to implausible values (e.g. negative incomes).
2. Stochastic regression imputation has problems with heteroscedastic data.

In my previous post, I raved about predictive mean matching, as this method is supposed to fix all the problems of regression imputation.

Of course I do not want to deprive you of the proof! Therefore, I’m going to use exactly the same data as in the regression imputation example – but this time I’m also imputing via predictive mean matching.

Plausibility of Imputed Values

Let’s have a closer look at drawback 1 – the plausibility of imputed values. Consider the following example data:

# Income data   set.seed(91919) # Set seed N <- 1000 # Sample size   income <- round(rnorm(N, 0, 500)) # Create some synthetic income data income[income < 0] <- income[income < 0] * (- 1)   x1 <- income + rnorm(N, 1000, 1500) # Auxiliary variables x2 <- income + rnorm(N, - 5000, 2000)   income[rbinom(N, 1, 0.1) == 1] <- NA # Create 10% missingness in income   data_inc_miss <- data.frame(income, x1, x2)

We impute once via stochastic regression imputation…

imp_inc_sri <- mice(data_inc_miss, method = "norm.nob", m = 1) data_inc_sri <- complete(imp_inc_sri)

… and once via predictive mean matching (for the sake of simplicity, both times via single imputation).

imp_inc_pmm <- mice(data_inc_miss, method = "pmm", m = 1) data_inc_pmm <- complete(imp_inc_pmm)

Comparing the plausibility of both imputations, we can see that stochastic regression imputation leads to 10 implausible values (i.e. incomes below 0):

data_inc_sri$income[data_inc_sri$income < 0] # 10 values below 0 (implausible) #  -57.8859452 -58.9457388 -104.1099599 -147.9968026 -70.9192935 -280.2621111 #  -0.5694554 -91.3508690 -76.8628876 -468.5756297

Predictive mean matching, on the other hand, does not lead to implausible values:

data_inc_pmm$income[data_inc_pmm$income < 0] # No values below 0

In respect to the plausibility of imputed values, predictive mean matching clearly outperforms stochastic regression imputation.

Reason: Predictive mean matching only imputes values that were actually observed for other units. The range of imputed values therefore always lies between the minimum and the maximum of the observed values.

Imputation of Heteroscedastic Data

How well does predictive mean matching work for heteroscedastic data (i.e. drawback 2)? Let’s check it with the following example data:

# Heteroscedastic data   set.seed(654654) # Set seed N <- 1:5000 # Sample size   a <- 0 b <- 1 sigma2 <- N^2 eps <- rnorm(N, mean = 0, sd = sqrt(sigma2))   y <- a + b * N + eps # Heteroscedastic variable x <- 30 * N + rnorm(N[length(N)], 1000, 200) # Correlated variable   y[rbinom(N[length(N)], 1, 0.3) == 1] <- NA # 30% missings   data_het_miss <- data.frame(y, x)

As before, we first impute by stochastic regression imputation…

imp_het_sri <- mice(data_het_miss, method = "norm.nob", m = 1) data_het_sri <- complete(imp_het_sri)

… and then by predictive mean matching:

imp_het_pmm <- mice(data_het_miss, method = "pmm", m = 1) data_het_pmm <- complete(imp_het_pmm)

A good way for evaluating the two imputation methods for heteroscedastic data is a simple correlation plot:

par(mfrow = c(1, 2)) # Both plots in one graphic   plot(x[!is.na(data_het_sri$y)], # Plot of observed values data_het_sri$y[!is.na(data_het_sri$y)], main = "", xlab = "X", ylab = "Y") points(x[is.na(y)], data_het_sri$y[is.na(y)], # Plot of missing values col = "red") title("Stochastic Regression Imputation", # Title of plot line = 0.5) abline(lm(y ~ x, data_het_sri), # Regression line col = "#1b98e0", lwd = 2.5) legend("topleft", # Legend c("Observed Values", "Imputed Values", "Regression Y ~ X"), pch = c(1, 1, NA), lty = c(NA, NA, 1), col = c("black", "red", "#1b98e0"))   plot(x[!is.na(data_het_pmm$y)], # Plot of observed values data_het_pmm$y[!is.na(data_het_pmm$y)], main = "", xlab = "X", ylab = "Y") points(x[is.na(y)], data_het_pmm$y[is.na(y)], # Plot of missing values col = "red") title("Predictive Mean Matching", # Title of plot line = 0.5) abline(lm(y ~ x, data_het_pmm), col = "#1b98e0", lwd = 2.5) legend("topleft", # Legend c("Observed Values", "Imputed Values", "Regression Y ~ X"), pch = c(1, 1, NA), lty = c(NA, NA, 1), col = c("black", "red", "#1b98e0"))   mtext("Imputation of Heteroscedastic Data", # Main title of plot side = 3, line = - 1.5, outer = TRUE, cex = 2) Graphic 1: Stochastic Regression Imputation vs. Predictive Mean Matching for Heteroscedastic Data

The plot makes it obvious:

Stochastic regression imputation overestimates the variance of smaller $$X$$ and underestimates the variance of larger $$X$$ – not good.

In contrast, predictive mean matching reflects the structure of our observed values almost perfectly – very good.

In other words: Even though stochastic regression imputation is much more popular among practitioners, there are strong arguments for using predictive mean matching instead!

PMM via Multiple Imputation in Stata (Video)

So far, I have only shown you how to apply predictive mean matching in R. However, the imputation method is implemented in many different software packages, such as SAS or Stata. Personally, I’m not an expert for these software packages, but there are many good instructions out there.

For instance, one tutorial I can recommend was published by the Stata YouTube channel. The speaker, Chuck Huber, explains step by step how to handle missing data by predictive mean matching in Stata.

I Would Like to Hear From You!

In this article, I have shown you why I’m such a big fan of predictive mean matching.

Are you using predictive mean matching or do you prefer other imputation methods? Have I missed any pros or cons in my article?

Let me know in the comments (questions are also very welcome)!

References

van Buuren, S. (2012). Flexible Imputation of Missing Data. Chapman & Hall/CRC Interdisciplinary Statistics.

van Buuren, S., and Groothuis-Oudshoorn, K. (2011). mice: Multivariate Imputation by Chained Equations in R. Journal of Statistical Software. 45(3).

Vink, G., Frank, L. E., Pannekoek, J., and van Buuren, S. (2014). Predictive mean matching imputation of semicontinuous variables. Statistica Neerlandica. 68(1). 61-90.

Yuan, C. Y. (2005). Multiple Imputation for Missing Data: Concepts and New Development. SAS Institute Inc., Rockville, MD.

Appendix

The header graphic of this page shows a correlation plot of two highly correlated variables. The black line indicates the regression slope. The black stars indicate missing values, which are replaced by a close observed value.

set.seed(123) # Seed   N <- 50000 # Sample size   x <- rnorm(N) # X variable y <- x + rnorm(N, 0, 1.5) # Y variable   par(bg = "#353436") # Background color par(mar = c(0, 0, 0, 0)) # Remove space around plot   plot(x, y, # Plot observed cases col = "#1b98e0", pch = 16, cex = 0.0001)   abline(lm(y ~ x), # Plot regression slope lwd = 5, col = "#353436")   points(x[1:30], y[1:30], # Plot missing cases col = "#353436", pch = 8, lwd = 2, cex = 3)