Combine estimates by pooling rulesSource:
pool() function combines the estimates from
repeated complete data analyses. The typical sequence of steps to
perform a multiple imputation analysis is:
Impute the missing data by the
mice()function, resulting in a multiple imputed data set (class
Fit the model of interest (scientific model) on each imputed data set by the
with()function, resulting an object of class
Pool the estimates from each model into a single set of estimates and standard errors, resulting in an object of class
A common error is to reverse steps 2 and 3, i.e., to pool the
multiply-imputed data instead of the estimates. Doing so may severely bias
the estimates of scientific interest and yield incorrect statistical
intervals and p-values. The
pool() function will detect
pool(object, dfcom = NULL, rule = NULL, custom.t = NULL) pool.syn(object, dfcom = NULL, rule = "reiter2003")
A positive number representing the degrees of freedom in the complete-data analysis. Normally, this would be the number of independent observation minus the number of fitted parameters. The default (
dfcom = NULL) extract this information in the following order: 1) the component
glance()function is found, 2) the result of
df.residual(applied to the first fitted model, and 3) as
999999. In the last case, the warning
"Large sample assumed"is printed. If the degrees of freedom is incorrect, specify the appropriate value manually.
A string indicating the pooling rule. Currently supported are
"rubin1987"(default, for missing data) and
"reiter2003"(for synthetic data created from a complete data set).
A custom character string to be parsed as a calculation rule for the total variance
t. The custom rule can use the other calculated pooling statistics where the dimensions must come from
.data$. The default
tcalculation would have the form
".data$ubar + (1 + 1 / .data$m) * .data$b". See examples for an example.
An object of class
mipo, which stands for 'multiple imputation
"reiter2003" values for
set to `NA`, as these statistics do not apply for data synthesised from
fully observed data.
pool() function averages the estimates of the complete
data model, computes the total variance over the repeated analyses
by Rubin's rules (Rubin, 1987, p. 76), and computes the following
diagnostic statistics per estimate:
Relative increase in variance due to nonresponse
Residual degrees of freedom for hypothesis testing
Proportion of total variance due to missingness
Fraction of missing information
The degrees of freedom calculation for the pooled estimates uses the Barnard-Rubin adjustment for small samples (Barnard and Rubin, 1999).
pool.syn() function combines estimates by Reiter's partially
synthetic data pooling rules (Reiter, 2003). This combination rule
assumes that the data that is synthesised is completely observed.
Pooling differs from Rubin's method in the calculation of the total
variance and the degrees of freedom.
Pooling requires the following input from each fitted model:
the estimates of the model;
the standard error of each estimate;
the residual degrees of freedom of the model.
mice 3.0+, the
package takes care of filtering out the relevant parts of the
complete-data analysis. It may happen that you'll see the messages
Error: No tidy method for objects of class ... or
Error: No glance method for objects of class .... The message
means that your complete-data method used in
with(imp, ...) has
glance method defined in the
broom.mixed package contains
for mixed models. If you are using a mixed model, first run
library(broom.mixed) before calling
glance methods are defined for your analysis
m parameter estimates and their variance
estimates (the square of the standard errors) from the
models stored in
fit$analyses. For each parameter, run
pool.scalar to obtain the pooled parameters estimate, its variance, the
degrees of freedom, the relative increase in variance and the fraction of missing
An alternative is to write your own
methods and add these to
broom according to the specifications
given in https://broom.tidymodels.org.
In versions prior to
mice 3.0 pooling required that
vcov() methods were available for fitted
objects. This feature is no longer supported. The reason is that
vcov() methods are inconsistent across packages, leading to
buggy behaviour of the
mice 3.13.2 function
pool() uses the robust
the standard error estimate for pooling when it can extract
robust.se from the
Barnard, J. and Rubin, D.B. (1999). Small sample degrees of freedom with multiple imputation. Biometrika, 86, 948-955.
Rubin, D.B. (1987). Multiple Imputation for Nonresponse in Surveys. New York: John Wiley and Sons.
Reiter, J.P. (2003). Inference for Partially Synthetic, Public Use Microdata Sets. Survey Methodology, 29, 181-189.
van Buuren S and Groothuis-Oudshoorn K (2011).
Imputation by Chained Equations in
R. Journal of Statistical
Software, 45(3), 1-67. doi:10.18637/jss.v045.i03
# impute missing data, analyse and pool using the classic MICE workflow imp <- mice(nhanes, maxit = 2, m = 2) #> #> iter imp variable #> 1 1 bmi hyp chl #> 1 2 bmi hyp chl #> 2 1 bmi hyp chl #> 2 2 bmi hyp chl fit <- with(data = imp, exp = lm(bmi ~ hyp + chl)) summary(pool(fit)) #> term estimate std.error statistic df p.value #> 1 (Intercept) 20.60793363 4.64881537 4.4329430 10.12935 0.001229288 #> 2 hyp 0.21992681 1.99646262 0.1101582 19.71919 0.913397289 #> 3 chl 0.02823202 0.02531475 1.1152403 11.47827 0.287552697 # generate fully synthetic data, analyse and pool imp <- mice(cars, maxit = 2, m = 2, where = matrix(TRUE, nrow(cars), ncol(cars)) ) #> #> iter imp variable #> 1 1 speed dist #> 1 2 speed dist #> 2 1 speed dist #> 2 2 speed dist fit <- with(data = imp, exp = lm(speed ~ dist)) summary(pool.syn(fit)) #> term estimate std.error statistic df p.value #> 1 (Intercept) 11.45981833 1.35768034 8.440734 9.843846 8.128227e-06 #> 2 dist 0.08662892 0.03546026 2.442986 2.896935 9.529340e-02 # use a custom pooling rule for the total variance about the estimate # e.g. use t = b + b/m instead of t = ubar + b + b/m imp <- mice(nhanes, maxit = 2, m = 2) #> #> iter imp variable #> 1 1 bmi hyp chl #> 1 2 bmi hyp chl #> 2 1 bmi hyp chl #> 2 2 bmi hyp chl fit <- with(data = imp, exp = lm(bmi ~ hyp + chl)) pool(fit, custom.t = ".data$b + .data$b / .data$m") #> Class: mipo m = 2 #> term m estimate ubar b t dfcom df #> 1 (Intercept) 2 24.13311859 2.244648e+01 6.5406334371 9.8109501556 22 0 #> 2 hyp 2 -2.26888722 4.854083e+00 0.2991634667 0.4487452001 22 0 #> 3 chl 2 0.02693878 6.328178e-04 0.0001986903 0.0002980354 22 0 #> riv lambda fmi #> 1 0.43708196 1 0.7680485 #> 2 0.09244696 1 0.6948746 #> 3 0.47096565 1 0.7733915