Selects predictors according to simple statistics
Arguments
- data
Matrix or data frame with incomplete data.
- mincor
A scalar, numeric vector (of size
ncol(data))
or numeric matrix (square, of sizencol(data)
specifying the minimum threshold(s) against which the absolute correlation in the data is compared.- minpuc
A scalar, vector (of size
ncol(data))
or matrix (square, of sizencol(data)
specifying the minimum threshold(s) for the proportion of usable cases.- include
A string or a vector of strings containing one or more variable names from
names(data)
. Variables specified are always included as a predictor.- exclude
A string or a vector of strings containing one or more variable names from
names(data)
. Variables specified are always excluded as a predictor.- method
A string specifying the type of correlation. Use
'pearson'
(default),'kendall'
or'spearman'
. Can be abbreviated.
Details
This function creates a predictor matrix using the variable selection procedure described in Van Buuren et al.~(1999, p.~687–688). The function is designed to aid in setting up a good imputation model for data with many variables.
Basic workings: The procedure calculates for each variable pair (i.e.
target-predictor pair) two correlations using all available cases per pair.
The first correlation uses the values of the target and the predictor
directly. The second correlation uses the (binary) response indicator of the
target and the values of the predictor. If the largest (in absolute value) of
these correlations exceeds mincor
, the predictor will be added to the
imputation set. The default value for mincor
is 0.1.
In addition, the procedure eliminates predictors whose proportion of usable
cases fails to meet the minimum specified by minpuc
. The default value
is 0, so predictors are retained even if they have no usable case.
Finally, the procedure includes any predictors named in the include
argument (which is useful for background variables like age and sex) and
eliminates any predictor named in the exclude
argument. If a variable
is listed in both include
and exclude
arguments, the
include
argument takes precedence.
Advanced topic: mincor
and minpuc
are typically specified as
scalars, but vectors and squares matrices of appropriate size will also work.
Each element of the vector corresponds to a row of the predictor matrix, so
the procedure can effectively differentiate between different target
variables. Setting a high values for can be useful for auxiliary, less
important, variables. The set of predictor for those variables can remain
relatively small. Using a square matrix extends the idea to the columns, so
that one can also apply cellwise thresholds.
Note
quickpred()
uses data.matrix
to convert
factors to numbers through their internal codes. Especially for unordered
factors the resulting quantification may not make sense.
References
van Buuren, S., Boshuizen, H.C., Knook, D.L. (1999) Multiple imputation of missing blood pressure covariates in survival analysis. Statistics in Medicine, 18, 681–694.
van Buuren, S. and Groothuis-Oudshoorn, K. (2011). mice
: Multivariate
Imputation by Chained Equations in R
. Journal of Statistical
Software, 45(3), 1-67. doi:10.18637/jss.v045.i03
Examples
# default: include all predictors with absolute correlation over 0.1
quickpred(nhanes)
#> age bmi hyp chl
#> age 0 0 0 0
#> bmi 1 0 1 1
#> hyp 1 0 0 1
#> chl 1 1 1 0
# all predictors with absolute correlation over 0.4
quickpred(nhanes, mincor = 0.4)
#> age bmi hyp chl
#> age 0 0 0 0
#> bmi 0 0 0 0
#> hyp 1 0 0 1
#> chl 1 0 1 0
# include age and bmi, exclude chl
quickpred(nhanes, mincor = 0.4, inc = c("age", "bmi"), exc = "chl")
#> age bmi hyp chl
#> age 0 0 0 0
#> bmi 1 0 0 0
#> hyp 1 1 0 0
#> chl 1 1 1 0
# only include predictors with at least 30% usable cases
quickpred(nhanes, minpuc = 0.3)
#> age bmi hyp chl
#> age 0 0 0 0
#> bmi 1 0 0 0
#> hyp 1 0 0 0
#> chl 1 1 1 0
# use low threshold for bmi, and high thresholds for hyp and chl
pred <- quickpred(nhanes, mincor = c(0, 0.1, 0.5, 0.5))
pred
#> age bmi hyp chl
#> age 0 0 0 0
#> bmi 1 0 1 1
#> hyp 1 0 0 0
#> chl 1 0 0 0
# use it directly from mice
imp <- mice(nhanes, pred = quickpred(nhanes, minpuc = 0.25, include = "age"))
#>
#> iter imp variable
#> 1 1 bmi hyp chl
#> 1 2 bmi hyp chl
#> 1 3 bmi hyp chl
#> 1 4 bmi hyp chl
#> 1 5 bmi hyp chl
#> 2 1 bmi hyp chl
#> 2 2 bmi hyp chl
#> 2 3 bmi hyp chl
#> 2 4 bmi hyp chl
#> 2 5 bmi hyp chl
#> 3 1 bmi hyp chl
#> 3 2 bmi hyp chl
#> 3 3 bmi hyp chl
#> 3 4 bmi hyp chl
#> 3 5 bmi hyp chl
#> 4 1 bmi hyp chl
#> 4 2 bmi hyp chl
#> 4 3 bmi hyp chl
#> 4 4 bmi hyp chl
#> 4 5 bmi hyp chl
#> 5 1 bmi hyp chl
#> 5 2 bmi hyp chl
#> 5 3 bmi hyp chl
#> 5 4 bmi hyp chl
#> 5 5 bmi hyp chl