Be careful: Flawed imputations can heavily reduce the quality of your data! Are you aware that a poor missing value imputation might destroy the correlations between your variables?
If it’s done right, regression imputation can be a good solution for this problem. I‘ll show you all the tricks you need to know in the following article.
But first of all, what is regression imputation?
What are the advantages and drawbacks of regression imputation? How can I apply regression imputation? Are there even better alternatives?
Click on the buttons below to get an answer to your specific question:
Deterministic vs. Stochastic Regression Imputation
Regression imputation consists of two subsequent steps:
- A linear regression model is estimated on the basis of observed values in the target variable Y and some explanatory variables X.
- The model is used to predict values for the missing cases in Y. Missing values of Y are then replaced on the basis of these predictions.
Relationships of X and Y (i.e. correlations, regression coefficients etc.) are preserved, since imputed values are based on regression models. This is a big advantage over simpler imputation methods such as mean imputation or zero substitution.
Regression imputation is classified into two different versions: deterministic and stochastic regression imputation.
Deterministic regression imputation replaces missing values with the exact prediction of the regression model. Random variation (i.e. an error term) around the regression slope is not considered. Imputed values are therefore often too precise and lead to an overestimation of the correlation between X and Y.
Stochastic regression imputation was developed in order to solve this issue of deterministic regression imputation. Stochastic regression imputation adds a random error term to the predicted value and is therefore able to reproduce the correlation of X and Y more appropriately.
In the following example, I’ll show you the differences between the two approaches of deterministic and stochastic regression imputation in more detail.
Regression Imputation in R (Example)
Before we can start with our regression imputation example, we need some data with missing values. So let’s create some synthetic example data with R:
# Example data set.seed(9090909) # Create reproducible data N <- 2000 # Sample size y <- round(rnorm(N, 20, 10)) # Dependent variable x1 <- round(0.2 * y + rnorm(N, 5), 2) # Predictor 1 x2 <- round(y * rpois(N, 5)) # Predictor 2 x3 <- round(0.01 * y + runif(N, 0, 10)) # Predictor 3 data <- data.frame(y, x1, x2, x3) data$y[rbinom(N, 1, 0.2) == 1] <- NA # Aproximately 10% missings in y head(data) # First 6 rows of our example data
We have created a target variable Y and three auxiliary variables X1, X2, and X3. The variable Y has some missing values, displayed as NA in rows 5 and 6. The first rows of our data set look as follows.
Table 1: First 6 Rows of Our Synthetic Example Data in R
A very recommendable R package for regression imputation (and also for other imputation methods) is the mice package. Install and load the package in R.
# Install and load the R package mice install.packages("mice") # Needs to be done only once library("mice") # Load package
Now, let’s apply a deterministic regression imputation to our example data. The function mice() is used to impute the data; method = “norm.predict” is the specification for deterministic regression imputation; and m = 1 specifies the number of imputed data sets (in our case single imputation).
# Deterministic regression imputation imp <- mice(data, method = "norm.predict", m = 1) # Impute data data_det <- complete(imp) # Store data
We can use almost the same code for stochastic regression imputation. We only have to change method = “norm.predict” to method = “norm.nob”.
# Stochastic regression imputation imp <- mice(data, method = "norm.nob", m = 1) # Impute data data_sto <- complete(imp) # Store data
Let’s graphically check how well our missing data imputations worked:
# Graphical comparison of deterministic and stochastic regression imputation par(mfrow = c(1, 2)) # Both plots in one graphic # Deterministic regression imputation plot(x1[!is.na(data$y)], data_det$y[!is.na(data$y)], # Plot of observed values xlim = c(0, 20), ylim = c(- 15, 60), main = "Deterministic Regression Imputation", xlab = "X1", ylab = "Y") points(x1[is.na(data$y)], data_det$y[is.na(data$y)], # Plot of missing values col = "red") abline(lm(y ~ x1, data_det), col = "#1b98e0", lwd = 1.5) # Regression slope legend("topleft", # Legend c("Observed Values", "Imputed Values", "Regression Y ~ X1"), pch = c(1, 1, NA), lty = c(NA, NA, 1), col = c("black", "red", "#1b98e0")) # Stochastic regression imputation plot(x1[!is.na(data$y)], data_sto$y[!is.na(data$y)], # Plot of observed values xlim = c(0, 20), ylim = c(- 15, 60), main = "Stochastic Regression Imputation", xlab = "X1", ylab = "Y") points(x1[is.na(data$y)], data_sto$y[is.na(data$y)], # Plot of missing values col = "red") abline(lm(y ~ x1, data_det), col = "#1b98e0", lwd = 1.5) # Regression slope legend("topleft", # Legend c("Observed Values", "Imputed Values", "Regression Y ~ X1"), pch = c(1, 1, NA), lty = c(NA, NA, 1), col = c("black", "red", "#1b98e0"))
Graphic 1: Imputed Values of Deterministic & Stochastic Regression Imputation (Correlation Plots of X1 & Y)
Graphic 1 visualizes the main drawback of deterministic regression imputation: The imputed values (red bubbles) are way too close to the regression slope (blue line)!
In contrast, the imputation by stochastic regression worked much better. The distribution of imputed values is similar compared to the observed values and, hence, much more realistic.
But what does that actually mean for the analysis of our data? Have we introduced bias with the deterministic regression imputation? Let’s check some numbers:
# Correlation between X1 and Y round(cor(y, x1), 3) # True correlation ##  0.897 round(cor(data_det$y, data_det$x1), 3) # Correlation after deterministic regression imputation ##  0.912 round(cor(data_sto$y, data_sto$x1), 3) # Correlation after stochastic regression imputation ##  0.894
The correlation coefficients between X1 and Y for our data without missings; after deterministic regression imputation; and after stochastic regression imputation confirm what we already saw graphically. While stochastic regression imputation reproduces the true correlation almost prefectly (true = 0.897; stochastic = 0.894), deterministic regression imputation overestimates the correlation (true = 0.897; deterministic = 0.912).
You might say now: OK, but does such a small difference really matter?
Answer: Yes it does. First of all, you should always avoid bias, no matter how small it is. Second, you don’t know how much bias you introduce when applying a flawed statistical method such as deterministic regression imputation. In your specific database, the bias might be much larger!
Video: Deterministic vs. Stochastic Imputation
Do you still have problems to understand the difference between deterministic and stochastic regression imputation? Don’t worry and watch the following video of Iris Eekhout’s YoutTube channel.
The video is only 17 seconds long. However, it shows perfectly the difference between deterministic regression imputation (red dots at the beginning of the video) and stochastic regression imputation (red dots at the end of the video).
Are There Even Better Alternatives? (The Drawbacks of Stochastic Regression Imputation)
At this point, we have learned that stochastic regression imputation outperforms an imputation by deterministic regression. However, the field of statistics is constantly evolving and especially one imputation method has gathered a lot of attention in recent research literature.
The method I’m talking about is called predictive mean matching (click to learn more).
But how can stochastic regression imputation be improved? Are there any drawbacks?! You guessed it, yes there are!
Drawback 1: Stochastic regression imputation may lead to implausible values. Variables are often restricted to a certain range of values (e.g. income should always be positive). Regression imputation is not able to impute according to such restrictions.
Consider the following example of implausible imputed values:
# Income data set.seed(91919) # Set seed N <- 1000 # Sample size income <- round(rnorm(N, 0, 500)) # Create some synthetic income data income[income < 0] <- income[income < 0] * (- 1) x1 <- income + rnorm(N, 1000, 1500) # Auxiliary variables x2 <- income + rnorm(N, - 5000, 2000) income[rbinom(N, 1, 0.1) == 1] <- NA # Create 10% missingness in income data_inc_miss <- data.frame(income, x1, x2) imp_inc <- mice(data_inc_miss, method = "norm.nob", m = 1) # Impute data data_inc <- complete(imp_inc) data_inc$income[data_inc$income < 0] # 10 values below 0 (unplausible) ##  -57.8859452 -58.9457388 -104.1099599 -147.9968026 -70.9192935 -280.2621111 ##  -0.5694554 -91.3508690 -76.8628876 -468.5756297
After imputing our income variable, we end up with 10 implausible values. Clearly a result we wouldn’t want to see in practice.
Drawback 2: Stochastic regression imputation leads to poor results when data is heteroscedastic. The imputation method assumes that the random error has on average the same size for all parts of the distribution, often resulting in too small or too large random error terms for the imputed values.
Consider the following example of heteroscedastic data:
# Heteroscedastic data set.seed(654654) # Set seed N <- 1:5000 # Sample size a <- 0 b <- 1 sigma2 <- N^2 eps <- rnorm(N, mean = 0, sd = sqrt(sigma2)) y <- a + b * N + eps # Heteroscedastic variable x <- 30 * N + rnorm(N[length(N)], 1000, 200) # Correlated variable y[rbinom(N[length(N)], 1, 0.3) == 1] <- NA # 30% missings data_het_miss <- data.frame(y, x) imp_het <- mice(data_het_miss, method = "norm.nob", m = 1) # Impute data data_het <- complete(imp_het) plot(x[!is.na(data_het$y)], data_het$y[!is.na(data_het$y)], # Plot of observed values main = "Stochastic Regression Imputation (Heteroscedastic Data)", xlab = "X", ylab = "Y") points(x[is.na(y)], data_het$y[is.na(y)], # Plot of missing values col = "red") abline(lm(y ~ x, data_het), col = "#1b98e0", lwd = 1.5) legend("topleft", # Legend c("Observed Values", "Imputed Values", "Regression Y ~ X"), pch = c(1, 1, NA), lty = c(NA, NA, 1), col = c("black", "red", "#1b98e0"))
Graphic 2: Stochastic Regression Imputation of Heteroscedastic Data
Graphic 2 reveals the second major drawback of stochastic regression imputation. The observed values (black bubbles) are very heteroscedastic, i.e. the random error is much smaller for smaller values of X an Y and much larger for larger values.
This heteroscedasticity, however, is not reflected by the imputation (red bubbles). The imputed values are equally spread around the regression slope (blue line), no matter how large X and Y are. Another result we wouldn’t want to see in practice.
Find out here whether predictive mean matching is capable of solving the issues of stochastic regression imputation.
It’s Your Turn!
I showed you all the important things I know about regression imputation.
Now, I’d love to hear from your experiences! Have you already used regression imputation in practice? Have you experienced any of the drawbacks I mentioned above? Do you think about switching to newer methods such as predictive mean matching?
Let me know about your questions and experiences in the comments!
The header graphic of this page shows a heteroscedastic relation between two variables.
Do you want to know how to create such a plot? Here’s the R code.
# Create some data n <- 1:3000 a <- 0 b <- 1 sigma2 <- n^1.9 eps <- rnorm(n, mean = 0, sd = sqrt(sigma2)) y <- - 1 * (a + b * n + eps) # Set colors palette <- colorRampPalette(colors = c("#1b98e0", "#FFFFFF")) cols <- palette(3000) par(bg = "#353436") # Produce plot plot(n, y, pch = 18, col = cols) # Estimate regression line mod <- lm(y ~ n) res <- residuals(mod) abline(coef(mod), col = "red", lwd = 3)