1 Aim

In this document we explore the confidence validity of bootstrap estimates. The goal of this investigation is to study the possibility of using the bootstrap to simulate a monte carlo experiment on a single (small) data set.

2 Set up

This section covers the global parameters and the necessary packages for executing the simulation study.

2.1 Fixing the random seed

Later on, we will be using random numbers. We fix the seed of the random number generator, so that we can always reproduce the results obtained in this document.

set.seed(123)

The random seed is a number used to initialize the pseudo-random number generator. The initial value (the seed) itself does not need to be random. The resulting process is random because the seed value is not used to generate randomness - it merely forms the starting point of the algorithm for which the results are random.

2.2 Loading the required packages

library(dplyr)    # Data manipulation
library(magrittr) # Pipes
library(purrr)    # Functional programming
library(mice)     # Data imputation
library(psych)    # Descriptives
library(knitr)    # 
library(kableExtra) #Cool tables

2.3 Number of simulations

In this document we perform a Monte Carlo experiment. Here we define the number of replications that makes up this experiment.

nsim = 10000

3 Establish the origin

We define a completed version of the mice::boys data set as the origin - conventionally we would consider this to be an infinitely large population - against which we compare our estimates.

origin <- mice(boys, m=1, print = FALSE) %>% # impute only once
  mice::complete() # extract completed data

3.1 Model of interest

We define the following model to be evaluated in simulation \[\text{wgt} = \beta_0 + \beta_1\text{hgt} + \beta2\text{age} + \epsilon\] For the origin set, we extract the linear model as follows:

truemodel <- origin %$%
  lm(wgt ~ hgt + age)

4 Draw the bootstrap sets

We use the replicate() function to draw nsim =\(10^4\) bootstrap sets from the 748 rows in the origin.

simdata <- replicate(nsim, 
                     origin[sample(1:748, 748, replace = TRUE), ], 
                     simplify = FALSE)

5 Bias

First we evaluate the model of interest on each of the drawn bootstrap samples. We do this with the purrr::map() function, which maps the evaluation on each of the listed elements in the simdata object.

model <- simdata %>% 
  map(~ lm(wgt ~ hgt + age, data = .x))

Then, we extract the estimates from the model evaluations.

estimates <- model %>% 
  map(coef) %>% 
  do.call(rbind, args = .) # bind rows into matrix

The obtained average estimates are

estimates %>% 
  describe(quant = c(.025, .975)) %>%
  .[, c(2:4, 8:9, 11:12, 14:15)] %>%
  kable("html") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
                full_width = F)

	n	mean	sd	min	max	skew	kurtosis	Q0.025	Q0.975
(Intercept)	10000	-9.1646043	2.1277711	-18.4881782	-0.2193191	-0.0459868	-0.0025258	-13.3901047	-5.0878368
hgt	10000	0.1903412	0.0306338	0.0587059	0.3172430	0.0225346	-0.0157850	0.1310087	0.2506239
age	10000	2.3309900	0.2092572	1.5951185	3.2581944	0.0085180	-0.0440822	1.9243443	2.7339516

We find that the estimates display on average a low bias.

estimates %>% 
  colMeans

## (Intercept)         hgt         age 
##  -9.1646043   0.1903412   2.3309900

6 Confidence Validity

7 Approach 1: CI’s with base `R`

We use the confint() function to extract the confidence intervals for each of the linear models in the model object.

ci <- model %>% 
  map(confint)

We then calculate the proportion of confidence intervals that cover the truemodel parameters

cov <- ci %>% 
  map(function(x) x[, 1] <= coef(truemodel) & coef(truemodel) <= x[, 2]) %>%
  do.call(rbind, args = .) # bind rows into matrix

8 Approach 2: manually calculated CI’s

For unknown \(\sigma^2_\beta\), the \(1-\alpha\) confidence interval for \(\beta\) is defined as
\[\mu = \hat{\beta} \pm t_{n-1, a/2}\frac{S}{\sqrt{N}}.\]. The following code extracts the estimates and the standard error

manual <- model %>% 
  map(function(x) cbind(vcov(x) %>% diag %>% sqrt(.), coef(x))) %>% # est and resp. vars
  map(function(x) cbind(x[, 2] - qt(.975, 747) * x[, 1],  x[, 2] + qt(.975, 747) * x[, 1])) %>%
  map(function(x) x[, 1] <= coef(truemodel) & coef(truemodel) <= x[, 2]) %>%
  do.call(rbind, args = .) # bind rows into matrix

9 Approach 3: bootstrap CI’s cf. bootstrap SE

Instead of drawing the standard errors from every model, we can also make use of the bootstrap estimate of the standard error - i.e. the square root of the variance of the vector of estimates. This gives us the empirical standard deviation of the realized bootstrap sampling distribution of the estimates.

manual2 <- model %>% 
  map(function(x) cbind(sqrt(diag(var(estimates))), coef(x))) %>% # sd of 
  map(function(x) cbind(x[, 2] - qt(.975, 747) * x[, 1],  x[, 2] + qt(.975, 747) * x[, 1])) %>%
  map(function(x) x[, 1] <= coef(truemodel) & coef(truemodel) <= x[, 2]) %>%
  do.call(rbind, args = .)

10 Approach 4: bootstrap CI’s cf average CI

Another approach is to calculate the bootstrap CI based on the quantiles of the vectors of estimates. Simply dividing the confidence interval width would then yield an empirical equivalent to \(\pm t_{n-1, a/2}\frac{S}{\sqrt{N}}\).

# confidence intervals and widths
intercept <- data.frame(est = mean(estimates[, 1]), 
                        ciw = diff(quantile(estimates[, 1], probs = c(.025, .975)))) %>%
  mutate(low = est - ciw/2,
         up = est + ciw/2)
hgt <- data.frame(est = mean(estimates[, 2]), 
                  ciw = diff(quantile(estimates[, 2], probs = c(.025, .975)))) %>%
  mutate(low = est - ciw/2,
         up = est + ciw/2)
age <- data.frame(est = mean(estimates[, 3]), 
                  ciw = diff(quantile(estimates[, 3], probs = c(.025, .975)))) %>%
  mutate(low = est - ciw/2,
         up = est + ciw/2)
# coverages
covint <- data.frame(est = estimates[, 1], 
                     low = estimates[, 1] - intercept$ciw/2,
                     up = estimates[, 1] + intercept$ciw/2) %>%
  mutate(cov = low <= coef(truemodel)[1] & coef(truemodel)[1] <= up)
covhgt <- data.frame(est = estimates[, 2], 
                     low = estimates[, 2] - hgt$ciw/2,
                     up = estimates[, 2] + hgt$ciw/2) %>%
  mutate(cov = low <= coef(truemodel)[2] & coef(truemodel)[2] <= up)
covage <- data.frame(est = estimates[, 3], 
                     low = estimates[, 3] - age$ciw/2,
                     up = estimates[, 3] + age$ciw/2) %>%
  mutate(cov = low <= coef(truemodel)[3] & coef(truemodel)[3] <= up)

11 Results

Below are the coverages for each of the investigated scenarios. Results are depicted for the coverage of the confidence intervals over the regression estimates for \(10^4\) simulations.

output <- rbind(colMeans(cov), 
                colMeans(manual), 
                colMeans(manual2), 
                colMeans(cbind(covint$cov, covhgt$cov, covage$cov)))
rownames(output) <- c("confint", 
                      "manual", 
                      "bootstrap SE", 
                      "bootstrap CI")
kable(output, "html") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
                full_width = F)

	(Intercept)	hgt	age
confint	0.9396	0.9273	0.9243
manual	0.9396	0.9273	0.9243
bootstrap SE	0.9514	0.9514	0.9528
bootstrap CI	0.9496	0.9500	0.9501

12 Replication

The replication of these results can be found here

END OF DOCUMENT

sessionInfo()

## R version 4.0.2 (2020-06-22)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Catalina 10.15.6
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRblas.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] kableExtra_1.2.1 knitr_1.29       psych_2.0.7      mice_3.11.0     
## [5] purrr_0.3.4      magrittr_1.5     dplyr_1.0.1     
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_1.0.5        highr_0.8         pillar_1.4.6      compiler_4.0.2   
##  [5] tools_4.0.2       digest_0.6.25     viridisLite_0.3.0 evaluate_0.14    
##  [9] lifecycle_0.2.0   tibble_3.0.3      nlme_3.1-148      lattice_0.20-41  
## [13] pkgconfig_2.0.3   rlang_0.4.7       rstudioapi_0.11   yaml_2.2.1       
## [17] parallel_4.0.2    xfun_0.16         stringr_1.4.0     httr_1.4.2       
## [21] xml2_1.3.2        generics_0.0.2    vctrs_0.3.2       nnet_7.3-14      
## [25] webshot_0.5.2     grid_4.0.2        tidyselect_1.1.0  glue_1.4.1       
## [29] R6_2.4.1          rmarkdown_2.3     tidyr_1.1.1       MASS_7.3-51.6    
## [33] scales_1.1.1      backports_1.1.8   ellipsis_0.3.1    htmltools_0.5.0  
## [37] mnormt_2.0.1      rvest_0.3.6       colorspace_1.4-1  stringi_1.4.6    
## [41] munsell_0.5.0     broom_0.7.0       tmvnsim_1.0-2     crayon_1.3.4

Confidence validity of bootstrap regression estimates

Gerko Vink

Markup Languages and Reproducible Programming in Statistics

1 Aim

2 Set up

2.1 Fixing the random seed

2.2 Loading the required packages

2.3 Number of simulations

3 Establish the origin

3.1 Model of interest

4 Draw the bootstrap sets

5 Bias

6 Confidence Validity

7 Approach 1: CI’s with base `R`

8 Approach 2: manually calculated CI’s

9 Approach 3: bootstrap CI’s cf. bootstrap SE

10 Approach 4: bootstrap CI’s cf average CI

11 Results

12 Replication

Confidence validity of bootstrap regression estimates

Gerko Vink

Markup Languages and Reproducible Programming in Statistics

1 Aim

2 Set up

2.1 Fixing the random seed

2.2 Loading the required packages

2.3 Number of simulations

3 Establish the origin

3.1 Model of interest

4 Draw the bootstrap sets

5 Bias

6 Confidence Validity

7 Approach 1: CI’s with base R

8 Approach 2: manually calculated CI’s

9 Approach 3: bootstrap CI’s cf. bootstrap SE

10 Approach 4: bootstrap CI’s cf average CI

11 Results

12 Replication

7 Approach 1: CI’s with base `R`