Find index of matched donor units

Usage

matchindex(d, t, k = 5L)

Arguments

d: Numeric vector with values from donor cases.
t: Numeric vector with values from target cases.
k: Integer, number of unique donors from which a random draw is made. For k = 1 the function returns the index in d corresponding to the closest unit. For multiple imputation, the advice is to set values in the range of k = 5 to k = 10.

Value

An integer vector with length(t) elements. Each element is an index in the array d.

Details

For each element in t, the method finds the k nearest neighbours in d, randomly draws one of these neighbours, and returns its position in vector d.

Fast predictive mean matching algorithm in seven steps:

Shuffle records to remove effects of ties
Obtain sorting order on shuffled data
Calculate index on input data and sort it
Pre-sample vector h with values between 1 and k

For each of the n0 elements in t:

find the two adjacent neighbours
find the h_i'th nearest neighbour
store the index of that neighbour

Return vector of n0 positions in d.

We may use the function to perform predictive mean matching under a given predictive model. To do so, specify both d and t as predictions from the same model. Suppose that y contains the observed outcomes of the donor cases (in the same sequence as d), then y[matchindex(d, t)] returns one matched outcome for every target case.

See https://github.com/amices/mice/issues/236. This function is a replacement for the matcher() function that has been in default in mice since version 2.22 (June 2014).

Author

Stef van Buuren, Nasinski Maciej, Alexander Robitzsch

Examples

set.seed(1)

# Inputs need not be sorted
d <- c(-5, 5, 0, 10, 12)
t <- c(-6, -4, 0, 2, 4, -2, 6)

# Index (in vector a) of closest match
idx <- matchindex(d, t, 1)
idx
#> [1] 1 1 3 3 2 3 2

# To check: show values of closest match

# Random draw among indices of the 5 closest predictors
matchindex(d, t)
#> [1] 3 1 5 5 2 3 1

# An example
train <- mtcars[1:20, ]
test <- mtcars[21:32, ]
fit <- lm(mpg ~ disp + cyl, data = train)
d <- fitted.values(fit)
t <- predict(fit, newdata = test)  # note: not using mpg
idx <- matchindex(d, t)

# Borrow values from train to produce 12 synthetic values for mpg in test.
# Synthetic values are plausible values that could have been observed if
# they had been measured.
train$mpg[idx]
#>  [1] 22.8 16.4 15.2 14.3 14.7 30.4 22.8 22.8 14.3 21.0 17.3 24.4

# Exercise: Create a distribution of 1000 plausible values for each of the
# twelve mpg entries in test, and count how many times the true value
# (which we know here) is located within the inter-quartile range of each
# distribution. Is your count anywhere close to 500? Why? Why not?