I: Ad hoc methods and mice

This is the first vignette in the Winnipeg series.

This vignette will give you an introduction to the R-package mice, an open-source tool for flexible imputation of incomplete data, developed by Stef van Buuren and Karin Groothuis-Oudshoorn (2011). Over the last decade, mice has become an important piece of imputation software, offering a very flexible environment for dealing with incomplete data. Moreover, the ability to integrate mice with other packages in R, and vice versa, offers many options for applied researchers.

The aim of this introduction is to enhance your understanding of multiple imputation, in general. You will learn how to multiply impute simple datasets and how to obtain the imputed data for further analysis. The main objective is to increase your knowledge and understanding on applications of multiple imputation.

No previous experience with R is required.

Working with mice

1. Open R and load the packages mice and lattice

require(mice)
require(lattice)
set.seed(123)

If mice is not yet installed, run:

install.packages("mice")

2. Inspect the incomplete data

The mice package contains several datasets. Once the package is loaded, these datasets can be used. Have a look at the nhanes dataset (Schafer, 1997, Table 6.14) by typing

nhanes

##    age  bmi hyp chl
## 1    1   NA  NA  NA
## 2    2 22.7   1 187
## 3    1   NA   1 187
## 4    3   NA  NA  NA
## 5    1 20.4   1 113
## 6    3   NA  NA 184
## 7    1 22.5   1 118
## 8    1 30.1   1 187
## 9    2 22.0   1 238
## 10   2   NA  NA  NA
## 11   1   NA  NA  NA
## 12   2   NA  NA  NA
## 13   3 21.7   1 206
## 14   2 28.7   2 204
## 15   1 29.6   1  NA
## 16   1   NA  NA  NA
## 17   3 27.2   2 284
## 18   2 26.3   2 199
## 19   1 35.3   1 218
## 20   3 25.5   2  NA
## 21   1   NA  NA  NA
## 22   1 33.2   1 229
## 23   1 27.5   1 131
## 24   3 24.9   1  NA
## 25   2 27.4   1 186

The nhanes dataset is a small data set with non-monotone missing values. It contains 25 observations on four variables: age group, body mass index, hypertension and cholesterol (mg/dL).

To learn more about the data, use one of the two following help commands:

help(nhanes)
?nhanes

3. Get an overview of the data by the summary() command:

summary(nhanes)

##       age            bmi            hyp            chl     
##  Min.   :1.00   Min.   :20.4   Min.   :1.00   Min.   :113  
##  1st Qu.:1.00   1st Qu.:22.6   1st Qu.:1.00   1st Qu.:185  
##  Median :2.00   Median :26.8   Median :1.00   Median :187  
##  Mean   :1.76   Mean   :26.6   Mean   :1.24   Mean   :191  
##  3rd Qu.:2.00   3rd Qu.:28.9   3rd Qu.:1.00   3rd Qu.:212  
##  Max.   :3.00   Max.   :35.3   Max.   :2.00   Max.   :284  
##                 NA's   :9      NA's   :8      NA's   :10

4. Inspect the missing data pattern

Check the missingness pattern for the nhanes dataset

md.pattern(nhanes)

##    age hyp bmi chl   
## 13   1   1   1   1  0
##  1   1   1   0   1  1
##  3   1   1   1   0  1
##  1   1   0   0   1  2
##  7   1   0   0   0  3
##      0   8   9  10 27

The missingness pattern shows that there are 27 missing values in total: 10 for chl , 9 for bmi and 8 for hyp. Moreover, there are thirteen completely observed rows, four rows with 1 missing, one row with 2 missings and seven rows with 3 missings. Looking at the missing data pattern is always useful (but may be difficult for datasets with many variables). It can give you an indication on how much information is missing and how the missingness is distributed.

Ad Hoc imputation methods

5. Form a regression model where age is predicted from bmi.

fit <- with(nhanes, lm(age ~ bmi))
summary(fit)

## 
## Call:
## lm(formula = age ~ bmi)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -1.266 -0.561 -0.122  0.466  1.234 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept)   3.7672     1.3194    2.86    0.013 *
## bmi          -0.0736     0.0491   -1.50    0.156  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.802 on 14 degrees of freedom
##   (9 observations deleted due to missingness)
## Multiple R-squared:  0.138,  Adjusted R-squared:  0.0767 
## F-statistic: 2.25 on 1 and 14 DF,  p-value: 0.156

6. Impute the missing data in the nhanes dataset with mean imputation.

imp <- mice(nhanes, method = "mean", m = 1, maxit = 1)

## 
##  iter imp variable
##   1   1  bmi  hyp  chl

The imputations are now done. As you can see, the algorithm ran for 1 iteration (maxit = 1) and presented us with only 1 imputation (m = 1) for each missing datum. This is correct, as substituting each missing data multiple times with the observed data mean would not make any sense (the inference would be equal, no matter which imputed dataset we would analyze). Likewise, more iterations would be computationally inefficient as the observed data mean does not change based on our imputations. We named the imputed object imp following the convention used in mice, but if you wish you can name it anything you’d like.

7. Explore the imputed data with the complete() function. What do you think the variable means are? What happened to the regression equation after imputation?

complete(imp)

##    age  bmi  hyp chl
## 1    1 26.6 1.24 191
## 2    2 22.7 1.00 187
## 3    1 26.6 1.00 187
## 4    3 26.6 1.24 191
## 5    1 20.4 1.00 113
## 6    3 26.6 1.24 184
## 7    1 22.5 1.00 118
## 8    1 30.1 1.00 187
## 9    2 22.0 1.00 238
## 10   2 26.6 1.24 191
## 11   1 26.6 1.24 191
## 12   2 26.6 1.24 191
## 13   3 21.7 1.00 206
## 14   2 28.7 2.00 204
## 15   1 29.6 1.00 191
## 16   1 26.6 1.24 191
## 17   3 27.2 2.00 284
## 18   2 26.3 2.00 199
## 19   1 35.3 1.00 218
## 20   3 25.5 2.00 191
## 21   1 26.6 1.24 191
## 22   1 33.2 1.00 229
## 23   1 27.5 1.00 131
## 24   3 24.9 1.00 191
## 25   2 27.4 1.00 186

We see the repetitive numbers 26.5625 for bmi, 1.2352594 for hyp, and 191.4 for chl. These can be confirmed as the means of the respective variables (columns):

colMeans(nhanes, na.rm = TRUE)

##    age    bmi    hyp    chl 
##   1.76  26.56   1.24 191.40

We saw during the inspection of the missing data pattern that variable age has no missings. Therefore nothing is imputed for age because we would not want to alter the observed (and bonafide) values.

To inspect the regression model with the imputed data, run:

fit <- with(imp, lm(age ~ bmi))
summary(fit)

## 
##  ## summary of imputation 1 :
## 
## Call:
## lm(formula = age ~ bmi)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.2135 -0.7600 -0.0957  0.3973  1.2869 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept)   3.7147     1.3290    2.80     0.01 *
## bmi          -0.0736     0.0497   -1.48     0.15  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.811 on 23 degrees of freedom
## Multiple R-squared:  0.0872, Adjusted R-squared:  0.0475 
## F-statistic:  2.2 on 1 and 23 DF,  p-value: 0.152

It is clear that nothing changed, but then again this is not surprising as variable bmi is somewhat normally distributed and we are just adding weight to the mean.

densityplot(nhanes$bmi)

8. Impute the missing data in the nhanes dataset with regression imputation.

imp <- mice(nhanes, method = "norm.predict", m = 1, maxit = 1)

## 
##  iter imp variable
##   1   1  bmi  hyp  chl

The imputations are now done. This code imputes the missing values in the data set by the regression imputation method. The argument method = "norm.predict" first fits a regression model for each observed value, based on the corresponding values in other variables and then imputes the missing values with the predicted values.

9. Again, inspect the completed data and investigate the imputed data regression model.

complete(imp)

##    age  bmi  hyp chl
## 1    1 32.0 1.13 198
## 2    2 22.7 1.00 187
## 3    1 28.8 1.00 187
## 4    3 23.2 1.53 229
## 5    1 20.4 1.00 113
## 6    3 21.1 1.48 184
## 7    1 22.5 1.00 118
## 8    1 30.1 1.00 187
## 9    2 22.0 1.00 238
## 10   2 31.1 1.42 239
## 11   1 31.4 1.12 194
## 12   2 25.1 1.26 195
## 13   3 21.7 1.00 206
## 14   2 28.7 2.00 204
## 15   1 29.6 1.00 181
## 16   1 28.6 1.04 174
## 17   3 27.2 2.00 284
## 18   2 26.3 2.00 199
## 19   1 35.3 1.00 218
## 20   3 25.5 2.00 243
## 21   1 34.8 1.21 219
## 22   1 33.2 1.00 229
## 23   1 27.5 1.00 131
## 24   3 24.9 1.00 244
## 25   2 27.4 1.00 186

The repetitive numbering is gone. We have now obtained a more natural looking set of imputations: instead of filling in the same bmi for all ages, we now take age (as well as hyp and chl) into account when imputing bmi.

To inspect the regression model with the imputed data, run:

fit <- with(imp, lm(age ~ bmi))
summary(fit)

## 
##  ## summary of imputation 1 :
## 
## Call:
## lm(formula = age ~ bmi)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.4799 -0.4596  0.0109  0.5951  1.2354 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   4.6258     0.9245    5.00  4.6e-05 ***
## bmi          -0.1052     0.0335   -3.14   0.0046 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.71 on 23 degrees of freedom
## Multiple R-squared:   0.3,   Adjusted R-squared:  0.269 
## F-statistic: 9.84 on 1 and 23 DF,  p-value: 0.00462

It is clear that something has changed. In fact, we extrapolated (part of) the regression model for the observed data to missing data in bmi. In other words; the relation (read: information) gets stronger and we’ve obtained more observations.

10. Impute the missing data in the nhanes dataset with stochastic regression imputation.

imp <- mice(nhanes, method = "norm.nob", m = 1, maxit = 1)

## 
##  iter imp variable
##   1   1  bmi  hyp  chl

The imputations are now done. This code imputes the missing values in the data set by the stochastic regression imputation method. The function does not incorporate the variability of the regression weights, so it is not ‘proper’ in the sense of Rubin (1987). For small samples, the variability of the imputed data will be underestimated.

11. Again, inspect the completed data and investigate the imputed data regression model.

complete(imp)

##    age  bmi   hyp chl
## 1    1 35.5 1.119 208
## 2    2 22.7 1.000 187
## 3    1 28.0 1.000 187
## 4    3 28.4 1.189 210
## 5    1 20.4 1.000 113
## 6    3 16.3 0.923 184
## 7    1 22.5 1.000 118
## 8    1 30.1 1.000 187
## 9    2 22.0 1.000 238
## 10   2 28.7 1.499 252
## 11   1 30.2 1.317 163
## 12   2 27.2 1.363 194
## 13   3 21.7 1.000 206
## 14   2 28.7 2.000 204
## 15   1 29.6 1.000 209
## 16   1 31.1 1.532 181
## 17   3 27.2 2.000 284
## 18   2 26.3 2.000 199
## 19   1 35.3 1.000 218
## 20   3 25.5 2.000 216
## 21   1 27.7 1.927 167
## 22   1 33.2 1.000 229
## 23   1 27.5 1.000 131
## 24   3 24.9 1.000 247
## 25   2 27.4 1.000 186

We have once more obtained a more natural looking set of imputations, where instead of filling in the same bmi for all ages, we now take age (as well as hyp and chl) into account when imputing bmi. We also add a random error to allow for our imputations to be off the regression line.

To inspect the regression model with the imputed data, run:

fit <- with(imp, lm(age ~ bmi))
summary(fit)

## 
##  ## summary of imputation 1 :
## 
## Call:
## lm(formula = age ~ bmi)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -1.371 -0.490 -0.017  0.383  1.353 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   4.2261     0.9182    4.60  0.00013 ***
## bmi          -0.0909     0.0334   -2.72  0.01218 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.738 on 23 degrees of freedom
## Multiple R-squared:  0.244,  Adjusted R-squared:  0.211 
## F-statistic: 7.41 on 1 and 23 DF,  p-value: 0.0122

12. Re-run the stochastic imputation model with seed 123 and verify if your results are the same as the ones below

## 
##  ## summary of imputation 1 :
## 
## Call:
## lm(formula = age ~ bmi)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.2814 -0.6142 -0.0749  0.4688  1.3332 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)   4.1252     1.1261    3.66   0.0013 **
## bmi          -0.0904     0.0426   -2.12   0.0449 * 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.776 on 23 degrees of freedom
## Multiple R-squared:  0.164,  Adjusted R-squared:  0.127 
## F-statistic:  4.5 on 1 and 23 DF,  p-value: 0.0449

The imputation procedure uses random sampling, and therefore, the results will be (perhaps slightly) different if we repeat the imputations. In order to get exactly the same result, you can use the seed argument

imp <- mice(nhanes, method = "norm.nob", m = 1, maxit = 1, seed = 123)
fit <- with(imp, lm(age ~ bmi))
summary(fit)

where 123 is some arbitrary number that you can choose yourself. Re-running this command will always yields the same imputed values. The ability to replicate one’s findings exactly is considered essential in today’s reproducible science.

Multiple imputation

13. Let us impute the missing data in the nhanes dataset

imp <- mice(nhanes)

## 
##  iter imp variable
##   1   1  bmi  hyp  chl
##   1   2  bmi  hyp  chl
##   1   3  bmi  hyp  chl
##   1   4  bmi  hyp  chl
##   1   5  bmi  hyp  chl
##   2   1  bmi  hyp  chl
##   2   2  bmi  hyp  chl
##   2   3  bmi  hyp  chl
##   2   4  bmi  hyp  chl
##   2   5  bmi  hyp  chl
##   3   1  bmi  hyp  chl
##   3   2  bmi  hyp  chl
##   3   3  bmi  hyp  chl
##   3   4  bmi  hyp  chl
##   3   5  bmi  hyp  chl
##   4   1  bmi  hyp  chl
##   4   2  bmi  hyp  chl
##   4   3  bmi  hyp  chl
##   4   4  bmi  hyp  chl
##   4   5  bmi  hyp  chl
##   5   1  bmi  hyp  chl
##   5   2  bmi  hyp  chl
##   5   3  bmi  hyp  chl
##   5   4  bmi  hyp  chl
##   5   5  bmi  hyp  chl

imp

## Multiply imputed data set
## Call:
## mice(data = nhanes)
## Number of multiple imputations:  5
## Missing cells per column:
## age bmi hyp chl 
##   0   9   8  10 
## Imputation methods:
##   age   bmi   hyp   chl 
##    "" "pmm" "pmm" "pmm" 
## VisitSequence:
## bmi hyp chl 
##   2   3   4 
## PredictorMatrix:
##     age bmi hyp chl
## age   0   0   0   0
## bmi   1   0   1   1
## hyp   1   1   0   1
## chl   1   1   1   0
## Random generator seed value:  NA

The imputations are now done. As you can see, the algorithm ran for 5 iterations (the default) and presented us with 5 imputations for each missing datum. For the rest of this document we will omit printing of the iteration cycle when we run mice. We do so by adding print = FALSE to the mice call.

The object imp contains a multiply imputed data set (of class mids). It encapsulates all information from imputing the nhanes dataset, such as the original data, the imputed values, the number of missing values, number of iterations, and so on.

To obtain an overview of the information stored in the object imp, use the attributes() function:

attributes(imp)

## $names
##  [1] "call"            "data"            "m"              
##  [4] "nmis"            "imp"             "method"         
##  [7] "predictorMatrix" "visitSequence"   "form"           
## [10] "post"            "seed"            "iteration"      
## [13] "lastSeedValue"   "chainMean"       "chainVar"       
## [16] "loggedEvents"    "pad"            
## 
## $class
## [1] "mids"

For example, the original data are stored as

imp$data

##    age  bmi hyp chl
## 1    1   NA  NA  NA
## 2    2 22.7   1 187
## 3    1   NA   1 187
## 4    3   NA  NA  NA
## 5    1 20.4   1 113
## 6    3   NA  NA 184
## 7    1 22.5   1 118
## 8    1 30.1   1 187
## 9    2 22.0   1 238
## 10   2   NA  NA  NA
## 11   1   NA  NA  NA
## 12   2   NA  NA  NA
## 13   3 21.7   1 206
## 14   2 28.7   2 204
## 15   1 29.6   1  NA
## 16   1   NA  NA  NA
## 17   3 27.2   2 284
## 18   2 26.3   2 199
## 19   1 35.3   1 218
## 20   3 25.5   2  NA
## 21   1   NA  NA  NA
## 22   1 33.2   1 229
## 23   1 27.5   1 131
## 24   3 24.9   1  NA
## 25   2 27.4   1 186

and the imputations are stored as

imp$imp

## $age
## NULL
## 
## $bmi
##       1    2    3    4    5
## 1  30.1 27.2 29.6 35.3 29.6
## 3  29.6 29.6 29.6 26.3 30.1
## 4  27.4 20.4 21.7 27.4 25.5
## 6  24.9 24.9 20.4 21.7 20.4
## 10 27.5 27.5 27.4 24.9 22.0
## 11 30.1 28.7 29.6 22.0 33.2
## 12 27.5 29.6 29.6 27.5 28.7
## 16 26.3 30.1 29.6 28.7 27.2
## 21 26.3 22.0 27.2 35.3 24.9
## 
## $hyp
##    1 2 3 4 5
## 1  1 1 1 1 1
## 4  2 1 1 2 2
## 6  2 1 2 2 1
## 10 2 1 1 2 1
## 11 1 1 1 1 1
## 12 2 1 2 1 1
## 16 1 1 1 1 1
## 21 1 1 1 1 1
## 
## $chl
##      1   2   3   4   5
## 1  187 131 187 206 199
## 4  184 187 186 204 186
## 10 218 187 186 131 187
## 11 199 187 238 131 204
## 12 186 187 218 204 218
## 15 199 187 238 229 199
## 16 187 238 131 187 187
## 20 184 218 218 186 206
## 21 187 131 187 204 187
## 24 186 187 206 218 218

14. Extract the completed data

By default, mice() calculates five (m = 5) imputed data sets. In order to get the third imputed data set, use the complete() function

c3 <- complete(imp, 3) 
md.pattern(c3)

##      age bmi hyp chl  
## [1,]   1   1   1   1 0
## [2,]   0   0   0   0 0

The collection of the \(m\) imputed data sets can be exported by function complete() in long, broad and repeated formats. For example,

c.long <- complete(imp, "long")  
c.long

##     .imp .id age  bmi hyp chl
## 1      1   1   1 30.1   1 187
## 2      1   2   2 22.7   1 187
## 3      1   3   1 29.6   1 187
## 4      1   4   3 27.4   2 184
## 5      1   5   1 20.4   1 113
## 6      1   6   3 24.9   2 184
## 7      1   7   1 22.5   1 118
## 8      1   8   1 30.1   1 187
## 9      1   9   2 22.0   1 238
## 10     1  10   2 27.5   2 218
## 11     1  11   1 30.1   1 199
## 12     1  12   2 27.5   2 186
## 13     1  13   3 21.7   1 206
## 14     1  14   2 28.7   2 204
## 15     1  15   1 29.6   1 199
## 16     1  16   1 26.3   1 187
## 17     1  17   3 27.2   2 284
## 18     1  18   2 26.3   2 199
## 19     1  19   1 35.3   1 218
## 20     1  20   3 25.5   2 184
## 21     1  21   1 26.3   1 187
## 22     1  22   1 33.2   1 229
## 23     1  23   1 27.5   1 131
## 24     1  24   3 24.9   1 186
## 25     1  25   2 27.4   1 186
## 26     2   1   1 27.2   1 131
## 27     2   2   2 22.7   1 187
## 28     2   3   1 29.6   1 187
## 29     2   4   3 20.4   1 187
## 30     2   5   1 20.4   1 113
## 31     2   6   3 24.9   1 184
## 32     2   7   1 22.5   1 118
## 33     2   8   1 30.1   1 187
## 34     2   9   2 22.0   1 238
## 35     2  10   2 27.5   1 187
## 36     2  11   1 28.7   1 187
## 37     2  12   2 29.6   1 187
## 38     2  13   3 21.7   1 206
## 39     2  14   2 28.7   2 204
## 40     2  15   1 29.6   1 187
## 41     2  16   1 30.1   1 238
## 42     2  17   3 27.2   2 284
## 43     2  18   2 26.3   2 199
## 44     2  19   1 35.3   1 218
## 45     2  20   3 25.5   2 218
## 46     2  21   1 22.0   1 131
## 47     2  22   1 33.2   1 229
## 48     2  23   1 27.5   1 131
## 49     2  24   3 24.9   1 187
## 50     2  25   2 27.4   1 186
## 51     3   1   1 29.6   1 187
## 52     3   2   2 22.7   1 187
## 53     3   3   1 29.6   1 187
## 54     3   4   3 21.7   1 186
## 55     3   5   1 20.4   1 113
## 56     3   6   3 20.4   2 184
## 57     3   7   1 22.5   1 118
## 58     3   8   1 30.1   1 187
## 59     3   9   2 22.0   1 238
## 60     3  10   2 27.4   1 186
## 61     3  11   1 29.6   1 238
## 62     3  12   2 29.6   2 218
## 63     3  13   3 21.7   1 206
## 64     3  14   2 28.7   2 204
## 65     3  15   1 29.6   1 238
## 66     3  16   1 29.6   1 131
## 67     3  17   3 27.2   2 284
## 68     3  18   2 26.3   2 199
## 69     3  19   1 35.3   1 218
## 70     3  20   3 25.5   2 218
## 71     3  21   1 27.2   1 187
## 72     3  22   1 33.2   1 229
## 73     3  23   1 27.5   1 131
## 74     3  24   3 24.9   1 206
## 75     3  25   2 27.4   1 186
## 76     4   1   1 35.3   1 206
## 77     4   2   2 22.7   1 187
## 78     4   3   1 26.3   1 187
## 79     4   4   3 27.4   2 204
## 80     4   5   1 20.4   1 113
## 81     4   6   3 21.7   2 184
## 82     4   7   1 22.5   1 118
## 83     4   8   1 30.1   1 187
## 84     4   9   2 22.0   1 238
## 85     4  10   2 24.9   2 131
## 86     4  11   1 22.0   1 131
## 87     4  12   2 27.5   1 204
## 88     4  13   3 21.7   1 206
## 89     4  14   2 28.7   2 204
## 90     4  15   1 29.6   1 229
## 91     4  16   1 28.7   1 187
## 92     4  17   3 27.2   2 284
## 93     4  18   2 26.3   2 199
## 94     4  19   1 35.3   1 218
## 95     4  20   3 25.5   2 186
## 96     4  21   1 35.3   1 204
## 97     4  22   1 33.2   1 229
## 98     4  23   1 27.5   1 131
## 99     4  24   3 24.9   1 218
## 100    4  25   2 27.4   1 186
## 101    5   1   1 29.6   1 199
## 102    5   2   2 22.7   1 187
## 103    5   3   1 30.1   1 187
## 104    5   4   3 25.5   2 186
## 105    5   5   1 20.4   1 113
## 106    5   6   3 20.4   1 184
## 107    5   7   1 22.5   1 118
## 108    5   8   1 30.1   1 187
## 109    5   9   2 22.0   1 238
## 110    5  10   2 22.0   1 187
## 111    5  11   1 33.2   1 204
## 112    5  12   2 28.7   1 218
## 113    5  13   3 21.7   1 206
## 114    5  14   2 28.7   2 204
## 115    5  15   1 29.6   1 199
## 116    5  16   1 27.2   1 187
## 117    5  17   3 27.2   2 284
## 118    5  18   2 26.3   2 199
## 119    5  19   1 35.3   1 218
## 120    5  20   3 25.5   2 206
## 121    5  21   1 24.9   1 187
## 122    5  22   1 33.2   1 229
## 123    5  23   1 27.5   1 131
## 124    5  24   3 24.9   1 218
## 125    5  25   2 27.4   1 186

and

c.broad <- complete(imp, "broad")
c.broad

##    age.1 bmi.1 hyp.1 chl.1 age.2 bmi.2 hyp.2 chl.2 age.3 bmi.3 hyp.3 chl.3
## 1      1  30.1     1   187     1  27.2     1   131     1  29.6     1   187
## 2      2  22.7     1   187     2  22.7     1   187     2  22.7     1   187
## 3      1  29.6     1   187     1  29.6     1   187     1  29.6     1   187
## 4      3  27.4     2   184     3  20.4     1   187     3  21.7     1   186
## 5      1  20.4     1   113     1  20.4     1   113     1  20.4     1   113
## 6      3  24.9     2   184     3  24.9     1   184     3  20.4     2   184
## 7      1  22.5     1   118     1  22.5     1   118     1  22.5     1   118
## 8      1  30.1     1   187     1  30.1     1   187     1  30.1     1   187
## 9      2  22.0     1   238     2  22.0     1   238     2  22.0     1   238
## 10     2  27.5     2   218     2  27.5     1   187     2  27.4     1   186
## 11     1  30.1     1   199     1  28.7     1   187     1  29.6     1   238
## 12     2  27.5     2   186     2  29.6     1   187     2  29.6     2   218
## 13     3  21.7     1   206     3  21.7     1   206     3  21.7     1   206
## 14     2  28.7     2   204     2  28.7     2   204     2  28.7     2   204
## 15     1  29.6     1   199     1  29.6     1   187     1  29.6     1   238
## 16     1  26.3     1   187     1  30.1     1   238     1  29.6     1   131
## 17     3  27.2     2   284     3  27.2     2   284     3  27.2     2   284
## 18     2  26.3     2   199     2  26.3     2   199     2  26.3     2   199
## 19     1  35.3     1   218     1  35.3     1   218     1  35.3     1   218
## 20     3  25.5     2   184     3  25.5     2   218     3  25.5     2   218
## 21     1  26.3     1   187     1  22.0     1   131     1  27.2     1   187
## 22     1  33.2     1   229     1  33.2     1   229     1  33.2     1   229
## 23     1  27.5     1   131     1  27.5     1   131     1  27.5     1   131
## 24     3  24.9     1   186     3  24.9     1   187     3  24.9     1   206
## 25     2  27.4     1   186     2  27.4     1   186     2  27.4     1   186
##    age.4 bmi.4 hyp.4 chl.4 age.5 bmi.5 hyp.5 chl.5
## 1      1  35.3     1   206     1  29.6     1   199
## 2      2  22.7     1   187     2  22.7     1   187
## 3      1  26.3     1   187     1  30.1     1   187
## 4      3  27.4     2   204     3  25.5     2   186
## 5      1  20.4     1   113     1  20.4     1   113
## 6      3  21.7     2   184     3  20.4     1   184
## 7      1  22.5     1   118     1  22.5     1   118
## 8      1  30.1     1   187     1  30.1     1   187
## 9      2  22.0     1   238     2  22.0     1   238
## 10     2  24.9     2   131     2  22.0     1   187
## 11     1  22.0     1   131     1  33.2     1   204
## 12     2  27.5     1   204     2  28.7     1   218
## 13     3  21.7     1   206     3  21.7     1   206
## 14     2  28.7     2   204     2  28.7     2   204
## 15     1  29.6     1   229     1  29.6     1   199
## 16     1  28.7     1   187     1  27.2     1   187
## 17     3  27.2     2   284     3  27.2     2   284
## 18     2  26.3     2   199     2  26.3     2   199
## 19     1  35.3     1   218     1  35.3     1   218
## 20     3  25.5     2   186     3  25.5     2   206
## 21     1  35.3     1   204     1  24.9     1   187
## 22     1  33.2     1   229     1  33.2     1   229
## 23     1  27.5     1   131     1  27.5     1   131
## 24     3  24.9     1   218     3  24.9     1   218
## 25     2  27.4     1   186     2  27.4     1   186

are completed data sets in long and broad format, respectively. See ?complete for more detail.

Conclusion

We have seen that (multiple) imputation is straightforward with mice. However, don’t let the simplicity of the software fool you into thinking that the problem itself is also straightforward. In the next vignette we will therefore explore how the mice package can flexibly provide us the tools to assess and control the imputation of missing data.

References

Rubin, D. B. Multiple imputation for nonresponse in surveys. John Wiley & Sons, 1987. Amazon

Schafer, J.L. (1997). Analysis of Incomplete Multivariate Data. London: Chapman & Hall. Table 6.14. Amazon

Van Buuren, S. and Groothuis-Oudshoorn, K. (2011). mice: Multivariate Imputation by Chained Equations in R. Journal of Statistical Software, 45(3), 1-67. pdf

- End of Vignette

I: Ad hoc methods and `mice`

Gerko Vink and Stef van Buuren

Practical 1 of 4

Working with mice

Ad Hoc imputation methods

Multiple imputation