In this document we explore the confidence validity of bootstrap estimates. The goal of this investigation is to study the possibility of using the bootstrap to simulate a monte carlo experiment on a single (small) data set.
This section covers the global parameters and the necessary packages for executing the simulation study.
Later on, we will be using random numbers. We fix the seed of the random number generator, so that we can always reproduce the results obtained in this document.
set.seed(123)
The random seed is a number used to initialize the pseudo-random number generator. The initial value (the seed) itself does not need to be random. The resulting process is random because the seed value is not used to generate randomness - it merely forms the starting point of the algorithm for which the results are random.
library(dplyr) # Data manipulation
library(magrittr) # Pipes
library(purrr) # Functional programming
library(mice) # Data imputation
library(psych) # Descriptives
library(knitr) #
library(kableExtra) #Cool tables
In this document we perform a Monte Carlo experiment. Here we define the number of replications that makes up this experiment.
nsim = 10000
We define a completed version of the mice::boys
data set as the origin - conventionally we would consider this to be an infinitely large population - against which we compare our estimates.
origin <- mice(boys, m=1, print = FALSE) %>% # impute only once
mice::complete() # extract completed data
We define the following model to be evaluated in simulation \[\text{wgt} = \beta_0 + \beta_1\text{hgt} + \beta2\text{age} + \epsilon\] For the origin
set, we extract the linear model as follows:
truemodel <- origin %$%
lm(wgt ~ hgt + age)
We use the replicate()
function to draw nsim =
\(10^4\) bootstrap sets from the 748 rows in the origin
.
simdata <- replicate(nsim,
origin[sample(1:748, 748, replace = TRUE), ],
simplify = FALSE)
First we evaluate the model of interest on each of the drawn bootstrap samples. We do this with the purrr::map()
function, which maps the evaluation on each of the listed elements in the simdata
object.
model <- simdata %>%
map(~ lm(wgt ~ hgt + age, data = .x))
Then, we extract the estimates from the model
evaluations.
estimates <- model %>%
map(coef) %>%
do.call(rbind, args = .) # bind rows into matrix
The obtained average estimates are
estimates %>%
describe(quant = c(.025, .975)) %>%
.[, c(2:4, 8:9, 11:12, 14:15)] %>%
kable("html") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
full_width = F)
n | mean | sd | min | max | skew | kurtosis | Q0.025 | Q0.975 | |
---|---|---|---|---|---|---|---|---|---|
(Intercept) | 10000 | -9.1646043 | 2.1277711 | -18.4881782 | -0.2193191 | -0.0459868 | -0.0025258 | -13.3901047 | -5.0878368 |
hgt | 10000 | 0.1903412 | 0.0306338 | 0.0587059 | 0.3172430 | 0.0225346 | -0.0157850 | 0.1310087 | 0.2506239 |
age | 10000 | 2.3309900 | 0.2092572 | 1.5951185 | 3.2581944 | 0.0085180 | -0.0440822 | 1.9243443 | 2.7339516 |
We find that the estimates display on average a low bias.
estimates %>%
colMeans
## (Intercept) hgt age
## -9.1646043 0.1903412 2.3309900
R
We use the confint()
function to extract the confidence intervals for each of the linear models in the model
object.
ci <- model %>%
map(confint)
We then calculate the proportion of confidence intervals that cover the truemodel
parameters
cov <- ci %>%
map(function(x) x[, 1] <= coef(truemodel) & coef(truemodel) <= x[, 2]) %>%
do.call(rbind, args = .) # bind rows into matrix
For unknown \(\sigma^2_\beta\), the \(1-\alpha\) confidence interval for \(\beta\) is defined as
\[\mu = \hat{\beta} \pm t_{n-1, a/2}\frac{S}{\sqrt{N}}.\]. The following code extracts the estimates and the standard error
manual <- model %>%
map(function(x) cbind(vcov(x) %>% diag %>% sqrt(.), coef(x))) %>% # est and resp. vars
map(function(x) cbind(x[, 2] - qt(.975, 747) * x[, 1], x[, 2] + qt(.975, 747) * x[, 1])) %>%
map(function(x) x[, 1] <= coef(truemodel) & coef(truemodel) <= x[, 2]) %>%
do.call(rbind, args = .) # bind rows into matrix
Instead of drawing the standard errors from every model, we can also make use of the bootstrap estimate of the standard error - i.e. the square root of the variance of the vector of estimates. This gives us the empirical standard deviation of the realized bootstrap sampling distribution of the estimates.
manual2 <- model %>%
map(function(x) cbind(sqrt(diag(var(estimates))), coef(x))) %>% # sd of
map(function(x) cbind(x[, 2] - qt(.975, 747) * x[, 1], x[, 2] + qt(.975, 747) * x[, 1])) %>%
map(function(x) x[, 1] <= coef(truemodel) & coef(truemodel) <= x[, 2]) %>%
do.call(rbind, args = .)
Another approach is to calculate the bootstrap CI based on the quantiles of the vectors of estimates. Simply dividing the confidence interval width would then yield an empirical equivalent to \(\pm t_{n-1, a/2}\frac{S}{\sqrt{N}}\).
# confidence intervals and widths
intercept <- data.frame(est = mean(estimates[, 1]),
ciw = diff(quantile(estimates[, 1], probs = c(.025, .975)))) %>%
mutate(low = est - ciw/2,
up = est + ciw/2)
hgt <- data.frame(est = mean(estimates[, 2]),
ciw = diff(quantile(estimates[, 2], probs = c(.025, .975)))) %>%
mutate(low = est - ciw/2,
up = est + ciw/2)
age <- data.frame(est = mean(estimates[, 3]),
ciw = diff(quantile(estimates[, 3], probs = c(.025, .975)))) %>%
mutate(low = est - ciw/2,
up = est + ciw/2)
# coverages
covint <- data.frame(est = estimates[, 1],
low = estimates[, 1] - intercept$ciw/2,
up = estimates[, 1] + intercept$ciw/2) %>%
mutate(cov = low <= coef(truemodel)[1] & coef(truemodel)[1] <= up)
covhgt <- data.frame(est = estimates[, 2],
low = estimates[, 2] - hgt$ciw/2,
up = estimates[, 2] + hgt$ciw/2) %>%
mutate(cov = low <= coef(truemodel)[2] & coef(truemodel)[2] <= up)
covage <- data.frame(est = estimates[, 3],
low = estimates[, 3] - age$ciw/2,
up = estimates[, 3] + age$ciw/2) %>%
mutate(cov = low <= coef(truemodel)[3] & coef(truemodel)[3] <= up)
Below are the coverages for each of the investigated scenarios. Results are depicted for the coverage of the confidence intervals over the regression estimates for \(10^4\) simulations.
output <- rbind(colMeans(cov),
colMeans(manual),
colMeans(manual2),
colMeans(cbind(covint$cov, covhgt$cov, covage$cov)))
rownames(output) <- c("confint",
"manual",
"bootstrap SE",
"bootstrap CI")
kable(output, "html") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
full_width = F)
(Intercept) | hgt | age | |
---|---|---|---|
confint | 0.9396 | 0.9273 | 0.9243 |
manual | 0.9396 | 0.9273 | 0.9243 |
bootstrap SE | 0.9514 | 0.9514 | 0.9528 |
bootstrap CI | 0.9496 | 0.9500 | 0.9501 |
The replication of these results can be found here
END OF DOCUMENT
sessionInfo()
## R version 4.0.2 (2020-06-22)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Catalina 10.15.6
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRblas.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] kableExtra_1.2.1 knitr_1.29 psych_2.0.7 mice_3.11.0
## [5] purrr_0.3.4 magrittr_1.5 dplyr_1.0.1
##
## loaded via a namespace (and not attached):
## [1] Rcpp_1.0.5 highr_0.8 pillar_1.4.6 compiler_4.0.2
## [5] tools_4.0.2 digest_0.6.25 viridisLite_0.3.0 evaluate_0.14
## [9] lifecycle_0.2.0 tibble_3.0.3 nlme_3.1-148 lattice_0.20-41
## [13] pkgconfig_2.0.3 rlang_0.4.7 rstudioapi_0.11 yaml_2.2.1
## [17] parallel_4.0.2 xfun_0.16 stringr_1.4.0 httr_1.4.2
## [21] xml2_1.3.2 generics_0.0.2 vctrs_0.3.2 nnet_7.3-14
## [25] webshot_0.5.2 grid_4.0.2 tidyselect_1.1.0 glue_1.4.1
## [29] R6_2.4.1 rmarkdown_2.3 tidyr_1.1.1 MASS_7.3-51.6
## [33] scales_1.1.1 backports_1.1.8 ellipsis_0.3.1 htmltools_0.5.0
## [37] mnormt_2.0.1 rvest_0.3.6 colorspace_1.4-1 stringi_1.4.6
## [41] munsell_0.5.0 broom_0.7.0 tmvnsim_1.0-2 crayon_1.3.4