5. Missing Data
5.1. Generate data
[1]:
set.seed(37)
suppressMessages({
library('missForest')
})
getData <- function(N=1000) {
x1 <- rnorm(N, mean=0, sd=1)
x2 <- rnorm(N, mean=0, sd=1)
y <- 5 + 3.2 * x1 - 4.2 * x2 + rnorm(N, mean=0, sd=1)
df <- data.frame(x1=x1, x2=x2, y=y)
return(df)
}
df <- getData()
df.mis <- prodNA(df, noNA=0.1)
5.2. Visualize missingness
[2]:
suppressMessages({
library('mice')
})
options(repr.plot.width=5, repr.plot.height=5)
p <- md.pattern(df.mis)
[3]:
suppressMessages({
library('VIM')
})
options(repr.plot.width=6, repr.plot.height=4)
p <- aggr(
df.mis,
col=c('navyblue','yellow'),
numbers=TRUE,
sortVars=TRUE,
labels=names(df.mis),
cex.axis=.7,
gap=3,
ylab=c('Missing data', 'Pattern')
)
Variables sorted by number of missings:
Variable Count
x1 0.108
y 0.097
x2 0.095
5.3. MICE
This approach imputes 5 datasets using mice.
[4]:
df.imp <- mice(df.mis, m=5, maxit=50, method='pmm', seed=500, print=FALSE)
print(summary(df.imp))
Class: mids
Number of multiple imputations: 5
Imputation methods:
x1 x2 y
"pmm" "pmm" "pmm"
PredictorMatrix:
x1 x2 y
x1 0 1 1
x2 1 0 1
y 1 1 0
NULL
[5]:
df.model <- with(data=df.imp, exp=lm(y ~ x1 + x2))
df.combine = pool(df.model)
print(summary(df.combine))
estimate std.error statistic df p.value
(Intercept) 4.978536 0.03615236 137.70984 59.17146 0
x1 3.224436 0.03845041 83.85961 28.21157 0
x2 -4.201357 0.04062838 -103.40940 21.56991 0
5.4. Amelia
Using Amelia.
[6]:
suppressMessages({
library('Amelia')
})
df.imp <- amelia(df.mis, m=5, parallel='multicore', print=FALSE)
-- Imputation 1 --
1 2 3
-- Imputation 2 --
1 2 3
-- Imputation 3 --
1 2 3
-- Imputation 4 --
1 2 3
-- Imputation 5 --
1 2 3
[7]:
m <- lm(y ~ x1 + x2, data=df.imp$imputations[[1]])
print(summary(m))
Call:
lm(formula = y ~ x1 + x2, data = df.imp$imputations[[1]])
Residuals:
Min 1Q Median 3Q Max
-2.7700 -0.6489 0.0266 0.6592 3.4818
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.01673 0.03138 159.9 <2e-16 ***
x1 3.23168 0.03058 105.7 <2e-16 ***
x2 -4.19395 0.03082 -136.1 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.9918 on 997 degrees of freedom
Multiple R-squared: 0.9691, Adjusted R-squared: 0.969
F-statistic: 1.563e+04 on 2 and 997 DF, p-value: < 2.2e-16
5.5. missForest
Use missForest.
[8]:
df.imp <- missForest(df.mis, verbose=FALSE)
missForest iteration 1 in progress...done!
missForest iteration 2 in progress...done!
missForest iteration 3 in progress...done!
missForest iteration 4 in progress...done!
missForest iteration 5 in progress...done!
[9]:
m <- lm(y ~ x1 + x2, data=df.imp$ximp)
print(summary(m))
Call:
lm(formula = y ~ x1 + x2, data = df.imp$ximp)
Residuals:
Min 1Q Median 3Q Max
-3.2084 -0.5477 0.0094 0.5762 3.4938
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.99766 0.02881 173.5 <2e-16 ***
x1 3.24033 0.02854 113.5 <2e-16 ***
x2 -4.19705 0.02872 -146.1 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.9104 on 997 degrees of freedom
Multiple R-squared: 0.9733, Adjusted R-squared: 0.9733
F-statistic: 1.82e+04 on 2 and 997 DF, p-value: < 2.2e-16
5.6. Hmisc
Use Hmisc.
[10]:
suppressMessages({
library('Hmisc')
})
df.imp <- aregImpute(~ x1 + x2 + y, data=df.mis, n.impute=5)
Iteration 8
[11]:
print(fit.mult.impute(y ~ x1 + x2, glm, df.imp, data=df))
Variance Inflation Factors Due to Imputation:
(Intercept) x1 x2
1.32 1.40 1.03
Rate of Missing Information:
(Intercept) x1 x2
0.24 0.29 0.03
d.f. for t-distribution for Tests of Single Coefficients:
(Intercept) x1 x2
68.07 48.58 6034.09
The following fit components were averaged over the 5 model fits:
fitted.values linear.predictors
Call: fit.mult.impute(formula = y ~ x1 + x2, fitter = glm, xtrans = df.imp,
data = df)
Coefficients:
(Intercept) x1 x2
4.986 3.224 -4.200
Degrees of Freedom: 999 Total (i.e. Null); 997 Residual
Null Deviance: 31230
Residual Deviance: 1058 AIC: 2902
5.7. mi
Impute using mi.
[12]:
suppressMessages({
library('mi')
})
df.imp <- mi(df.mis, seed=37)
[13]:
summary(df.imp)
$x1
$x1$is_missing
missing
FALSE TRUE
892 108
$x1$imputed
Min. 1st Qu. Median Mean 3rd Qu. Max.
-1.31092 -0.29634 0.05360 0.01444 0.32347 1.03555
$x1$observed
Min. 1st Qu. Median Mean 3rd Qu. Max.
-1.370380 -0.340412 -0.007915 0.000000 0.346205 1.855986
$x2
$x2$is_missing
missing
FALSE TRUE
905 95
$x2$imputed
Min. 1st Qu. Median Mean 3rd Qu. Max.
-1.16181 -0.38240 -0.09256 -0.06233 0.25807 1.45394
$x2$observed
Min. 1st Qu. Median Mean 3rd Qu. Max.
-1.63933 -0.30576 0.02156 0.00000 0.32167 1.53286
$y
$y$is_missing
missing
FALSE TRUE
903 97
$y$imputed
Min. 1st Qu. Median Mean 3rd Qu. Max.
-1.57526 -0.27385 0.07847 0.05184 0.37816 1.17524
$y$observed
Min. 1st Qu. Median Mean 3rd Qu. Max.
-1.5729109 -0.3407619 0.0007829 0.0000000 0.3281380 1.5120664