This is the second vignette in the Winnipeg series.
In this vignette we will focus on analyzing the relation between the data and the missingness, and on creating imputations. For non-R
users: In R
one can simply call the help function for a any specific function func
by typing help(func)
. E.g. help(mice)
directs you to the help page of the mice
function.
1. Open R
and load the packages mice
and lattice
. Also, fix the random seed.
require(mice)
require(lattice)
set.seed(123)
We choose seed value 123
. This is an arbitrary value; any value would be an equally good seed value. Fixing the random seed enables you (and others) to exactly replicate anything that involves random number generators. If you set the seed in your R
instance to 123
, you will get the exact same results and plots as we present in this document.
2. The boys
dataset is part of mice
. It is a subset of a large Dutch dataset containing growth measures from the Fourth Dutch Growth Study. Inspect the help for boys
dataset and make yourself familiar with its contents.
To learn more about the contents of the data, use one of the two following help commands:
help(boys)
?boys
3. Get an overview of the data. Find information about the size of the data, the variables measured and the amount of missingness.
head(boys)
## age hgt wgt bmi hc gen phb tv reg
## 3 0.035 50.1 3.65 14.5 33.7 <NA> <NA> NA south
## 4 0.038 53.5 3.37 11.8 35.0 <NA> <NA> NA south
## 18 0.057 50.0 3.14 12.6 35.2 <NA> <NA> NA south
## 23 0.060 54.5 4.27 14.4 36.7 <NA> <NA> NA south
## 28 0.062 57.5 5.03 15.2 37.3 <NA> <NA> NA south
## 36 0.068 55.5 4.66 15.1 37.0 <NA> <NA> NA south
nrow(boys)
## [1] 748
summary(boys)
## age hgt wgt bmi
## Min. : 0.04 Min. : 50.0 Min. : 3.1 Min. :11.8
## 1st Qu.: 1.58 1st Qu.: 84.9 1st Qu.: 11.7 1st Qu.:15.9
## Median :10.50 Median :147.3 Median : 34.6 Median :17.4
## Mean : 9.16 Mean :132.2 Mean : 37.2 Mean :18.1
## 3rd Qu.:15.27 3rd Qu.:175.2 3rd Qu.: 59.6 3rd Qu.:19.5
## Max. :21.18 Max. :198.0 Max. :117.4 Max. :31.7
## NA's :20 NA's :4 NA's :21
## hc gen phb tv reg
## Min. :33.7 G1 : 56 P1 : 63 Min. : 1 north: 81
## 1st Qu.:48.1 G2 : 50 P2 : 40 1st Qu.: 4 east :161
## Median :53.0 G3 : 22 P3 : 19 Median :12 west :239
## Mean :51.5 G4 : 42 P4 : 32 Mean :12 south:191
## 3rd Qu.:56.0 G5 : 75 P5 : 50 3rd Qu.:20 city : 73
## Max. :65.0 NA's:503 P6 : 41 Max. :25 NA's : 3
## NA's :46 NA's:503 NA's :522
4. As we have seen before, the function md.pattern()
can be used to display all different missing data patterns. How many different missing data patterns are present in the boys dataframe and which pattern occurs most frequently in the data?
md.pattern(boys)
## age reg wgt hgt bmi hc gen phb tv
## 223 1 1 1 1 1 1 1 1 1 0
## 1 1 1 1 1 1 1 1 0 1 1
## 19 1 1 1 1 1 1 1 1 0 1
## 1 1 1 1 1 1 1 0 1 0 2
## 1 1 1 0 0 0 1 1 1 1 3
## 437 1 1 1 1 1 1 0 0 0 3
## 1 1 1 0 0 0 0 1 1 1 4
## 43 1 1 1 1 1 0 0 0 0 4
## 3 1 0 1 1 1 1 0 0 0 4
## 16 1 1 1 0 0 1 0 0 0 5
## 1 1 1 0 1 0 1 0 0 0 5
## 1 1 1 1 0 0 0 0 0 0 6
## 1 1 1 0 0 0 0 0 0 0 7
## 0 3 4 20 21 46 503 503 522 1622
There are 13 patterns in total, with the pattern where gen
, phb
and tv
are missing occuring the most.
5. How many patterns occur for which the variable gen
(genital Tannerstage) is missing?
mpat <- md.pattern(boys)
sum(mpat[, "gen"] == 0)
## [1] 8
Answer: 8 patterns (503 cases)
6. Let us focus more precisely on the missing data patterns. Does the missing data of gen
depend on age
? One could for example check this by making a histogram of age
separately for the cases with known genital stages and for cases with missing genital stages.
To create said histogram in R
, a missingness indicator for gen
has to be created. A missingness indicator is a dummy variable with value 1
for observed values (in this case genital status) and 0
for missing values. Create a missingness indicator for gen
by typing
R <- is.na(boys$gen)
R
## [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [12] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [23] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [34] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [45] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [56] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [67] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [78] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [89] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [100] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [111] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [122] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [133] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [144] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [155] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [166] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [177] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [188] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [199] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [210] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [221] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [232] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [243] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [254] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [265] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [276] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [287] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [298] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [309] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [320] TRUE FALSE TRUE TRUE TRUE FALSE FALSE FALSE TRUE TRUE FALSE
## [331] FALSE FALSE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [342] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE
## [353] FALSE FALSE FALSE FALSE TRUE TRUE FALSE FALSE TRUE TRUE FALSE
## [364] FALSE FALSE TRUE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE
## [375] TRUE FALSE FALSE FALSE FALSE FALSE TRUE FALSE TRUE FALSE FALSE
## [386] FALSE FALSE FALSE TRUE FALSE FALSE TRUE FALSE FALSE FALSE FALSE
## [397] FALSE FALSE TRUE FALSE FALSE FALSE FALSE TRUE TRUE FALSE TRUE
## [408] FALSE FALSE FALSE TRUE FALSE FALSE FALSE TRUE FALSE FALSE FALSE
## [419] FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE TRUE
## [430] FALSE TRUE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE TRUE
## [441] FALSE TRUE TRUE FALSE FALSE FALSE FALSE FALSE TRUE FALSE TRUE
## [452] TRUE FALSE FALSE FALSE FALSE TRUE TRUE FALSE FALSE FALSE FALSE
## [463] FALSE TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE TRUE TRUE
## [474] FALSE FALSE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE TRUE
## [485] FALSE TRUE FALSE TRUE FALSE FALSE TRUE FALSE FALSE TRUE FALSE
## [496] TRUE TRUE TRUE FALSE TRUE FALSE FALSE TRUE TRUE FALSE TRUE
## [507] FALSE FALSE TRUE FALSE TRUE FALSE FALSE TRUE TRUE FALSE TRUE
## [518] TRUE FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE FALSE TRUE
## [529] FALSE FALSE TRUE FALSE FALSE TRUE TRUE FALSE TRUE FALSE TRUE
## [540] TRUE TRUE FALSE TRUE FALSE TRUE FALSE FALSE TRUE FALSE FALSE
## [551] TRUE FALSE FALSE TRUE TRUE TRUE FALSE TRUE TRUE FALSE FALSE
## [562] TRUE FALSE TRUE TRUE TRUE FALSE FALSE TRUE TRUE TRUE TRUE
## [573] FALSE TRUE FALSE TRUE FALSE FALSE FALSE TRUE FALSE FALSE FALSE
## [584] TRUE FALSE FALSE FALSE TRUE TRUE TRUE FALSE TRUE TRUE TRUE
## [595] FALSE FALSE FALSE FALSE TRUE TRUE FALSE TRUE FALSE TRUE FALSE
## [606] FALSE FALSE TRUE TRUE FALSE TRUE FALSE FALSE TRUE FALSE FALSE
## [617] FALSE TRUE TRUE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE
## [628] TRUE TRUE TRUE FALSE FALSE FALSE FALSE TRUE TRUE TRUE FALSE
## [639] FALSE FALSE TRUE TRUE FALSE TRUE TRUE FALSE FALSE TRUE TRUE
## [650] TRUE TRUE TRUE FALSE TRUE FALSE TRUE TRUE TRUE FALSE TRUE
## [661] TRUE TRUE TRUE FALSE TRUE TRUE TRUE FALSE FALSE TRUE FALSE
## [672] TRUE FALSE TRUE TRUE TRUE FALSE FALSE FALSE TRUE FALSE TRUE
## [683] TRUE FALSE TRUE TRUE TRUE FALSE TRUE FALSE TRUE TRUE TRUE
## [694] FALSE FALSE TRUE TRUE TRUE FALSE FALSE FALSE TRUE FALSE TRUE
## [705] TRUE TRUE TRUE FALSE FALSE FALSE TRUE TRUE FALSE TRUE TRUE
## [716] TRUE FALSE FALSE TRUE FALSE FALSE TRUE TRUE FALSE FALSE TRUE
## [727] FALSE TRUE FALSE FALSE TRUE TRUE FALSE FALSE FALSE TRUE FALSE
## [738] FALSE TRUE FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
As we can see, the missingness indicator tells us for each value in gen
whether it is missing (TRUE
) or observed (FALSE
).
A histogram can be made with the function histogram()
.
histogram(boys$gen)
or, equivalently, one could use
histogram(~gen, data = boys)
Writing the latter line of code for plots is more efficient than selecting every part of the boys
data with the boys$...
command, especially if plots become more advanced. The code for a conditional histogram of age
given R
is
histogram(~age|R, data = boys)
The histogram shows that the missingness in gen
is not equally distributed across age
.
7. Impute the boys
dataset with mice using all default settings and name the mids
(multiply imputed data set) object imp1
.
imp1 <- mice(boys, print = FALSE)
8. Compare the means of the imputed data with the means of the incomplete data. One can use the function complete()
with a mids
-object as argument to obtain an imputed dataset. As default, the first imputed dataset will be given by this function.
summary(boys)
## age hgt wgt bmi
## Min. : 0.04 Min. : 50.0 Min. : 3.1 Min. :11.8
## 1st Qu.: 1.58 1st Qu.: 84.9 1st Qu.: 11.7 1st Qu.:15.9
## Median :10.50 Median :147.3 Median : 34.6 Median :17.4
## Mean : 9.16 Mean :132.2 Mean : 37.2 Mean :18.1
## 3rd Qu.:15.27 3rd Qu.:175.2 3rd Qu.: 59.6 3rd Qu.:19.5
## Max. :21.18 Max. :198.0 Max. :117.4 Max. :31.7
## NA's :20 NA's :4 NA's :21
## hc gen phb tv reg
## Min. :33.7 G1 : 56 P1 : 63 Min. : 1 north: 81
## 1st Qu.:48.1 G2 : 50 P2 : 40 1st Qu.: 4 east :161
## Median :53.0 G3 : 22 P3 : 19 Median :12 west :239
## Mean :51.5 G4 : 42 P4 : 32 Mean :12 south:191
## 3rd Qu.:56.0 G5 : 75 P5 : 50 3rd Qu.:20 city : 73
## Max. :65.0 NA's:503 P6 : 41 Max. :25 NA's : 3
## NA's :46 NA's:503 NA's :522
summary(complete(imp1))
## age hgt wgt bmi
## Min. : 0.04 Min. : 50 Min. : 3.1 Min. :11.8
## 1st Qu.: 1.58 1st Qu.: 83 1st Qu.: 11.7 1st Qu.:15.9
## Median :10.50 Median :146 Median : 34.7 Median :17.5
## Mean : 9.16 Mean :131 Mean : 37.2 Mean :18.1
## 3rd Qu.:15.27 3rd Qu.:175 3rd Qu.: 59.6 3rd Qu.:19.5
## Max. :21.18 Max. :198 Max. :117.4 Max. :31.7
## hc gen phb tv reg
## Min. :33.7 G1:322 P1:320 Min. : 1.00 north: 81
## 1st Qu.:48.3 G2:141 P2:122 1st Qu.: 2.00 east :161
## Median :53.2 G3: 48 P3: 51 Median : 4.00 west :241
## Mean :51.6 G4: 84 P4: 67 Mean : 8.56 south:192
## 3rd Qu.:56.0 G5:153 P5: 98 3rd Qu.:15.00 city : 73
## Max. :65.0 P6: 90 Max. :25.00
Most means are roughly equal, except the mean of tv
, which is much lower in the first imputed data set, when compared to the incomplete data. This makes sense because most genital measures are unobserved for the lower ages. When imputing these values, the means should decrease.
Investigating univariate properties by using functions such as summary()
, may not be ideal in the case of hundreds of variables. To extract just the information you need, for all imputed datasets, we can make use of the with()
function. To obtain summaries for each imputed tv
only, type
summary(with(imp1, mean(tv)))
##
## ## summary of imputation 1 :
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.56 8.56 8.56 8.56 8.56 8.56
##
## ## summary of imputation 2 :
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.43 8.43 8.43 8.43 8.43 8.43
##
## ## summary of imputation 3 :
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.26 8.26 8.26 8.26 8.26 8.26
##
## ## summary of imputation 4 :
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.72 8.72 8.72 8.72 8.72 8.72
##
## ## summary of imputation 5 :
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.26 8.26 8.26 8.26 8.26 8.26
We have seen that creating multiple imputations is straightforward with the R
package mice
. The package is designed to allow you to assess and control the imputations themselves, and relations between the observed and imputed data.
It is important to ‘gain’ this control as a user. After all, we are imputing values and we aim to properly adress the uncertainty about the missingness problem.
- End of Vignette