Skip to contents

Imputes univariate missing data using Bayesian linear regression following a preprocessing lasso variable selection step.

Usage

mice.impute.lasso.select.norm(y, ry, x, wy = NULL, nfolds = 10, ...)

Arguments

y

Vector to be imputed

ry

Logical vector of length length(y) indicating the the subset y[ry] of elements in y to which the imputation model is fitted. The ry generally distinguishes the observed (TRUE) and missing values (FALSE) in y.

x

Numeric design matrix with length(y) rows with predictors for y. Matrix x may have no missing values.

wy

Logical vector of length length(y). A TRUE value indicates locations in y for which imputations are created.

nfolds

The number of folds for the cross-validation of the lasso penalty. The default is 10.

...

Other named arguments.

Value

Vector with imputed data, same type as y, and of length sum(wy)

Details

The method consists of the following steps:

  1. For a given y variable under imputation, fit a linear regression with lasso penalty using y[ry] as dependent variable and x[ry, ] as predictors. Coefficients that are not shrunk to 0 define an active set of predictors that will be used for imputation

  2. Define a Bayesian linear model using y[ry] as the dependent variable, the active set of x[ry, ] as predictors, and standard non-informative priors

  3. Draw parameter values for the intercept, regression weights, and error variance from their posterior distribution

  4. Draw imputations from the posterior predictive distribution

The user can specify a predictorMatrix in the mice call to define which predictors are provided to this univariate imputation method. The lasso regularization will select, among the variables indicated by the user, the ones that are important for imputation at any given iteration. Therefore, users may force the exclusion of a predictor from a given imputation model by specifying a 0 entry. However, a non-zero entry does not guarantee the variable will be used, as this decision is ultimately made by the lasso variable selection procedure.

The method is based on the Indirect Use of Regularized Regression (IURR) proposed by Zhao & Long (2016) and Deng et al (2016).

References

Deng, Y., Chang, C., Ido, M. S., & Long, Q. (2016). Multiple imputation for general missing data patterns in the presence of high-dimensional data. Scientific reports, 6(1), 1-10.

Zhao, Y., & Long, Q. (2016). Multiple imputation in the presence of high-dimensional data. Statistical Methods in Medical Research, 25(5), 2021-2035.

Author

Edoardo Costantini, 2021