Confidence intervals via bootstrapping

Confidence intervals

A plausible range of values for the population parameter is called a confidence interval.

  • If we report a point estimate, we probably won’t hit the exact population parameter.

  • If we report a range of plausible values we have a good shot at capturing the parameter.

Bootstrapping

  • One method of constructing a confidence interval, via simulation, is bootstrapping.

  • This term comes from the phrase "pulling oneself up by one's bootstraps", which is a metaphor for accomplishing an impossible task without any outside help.

  • In this case the "impossible" task is estimating a sampling distribution with a single sample and no specific theoretical results.

Bootstrapping scheme

  1. Generate a bootstrap sample - a random sample taken with replacement from the original sample, of the same size as the original sample

  2. Calculate the bootstrap statistic - a statistic such as mean, median, proportion, etc. computed on the bootstrap samples

  3. Repeat steps (1) and (2) many times to create a bootstrap distribution

Kissing data

One of the earliest examples of behavioral asymmetry is a preference in humans for turning the head to the right, rather than to the left, during the final weeks of gestation and for the first 6 months after birth. This is thought to influence subsequent development of perceptual and motor preferences. A study of 124 couples found that 64.5% turned their heads to the right when kissing.


Reference: Güntürkün, Onur. “Human behaviour: adult persistence of head-turning asymmetry.” Nature 421.6924 (2003): 711-711.

Load the data

(kissing = read.csv("https://stat.duke.edu/~mc301/data/kissing.csv", 
                    stringsAsFactors = FALSE) %>% tbl_df())
## # A tibble: 124 × 1
##     side
##    <chr>
## 1   left
## 2  right
## 3  right
## 4  right
## 5  right
## 6  right
## 7  right
## 8  right
## 9  right
## 10 right
## # ... with 114 more rows

Bootstrapping in action

Let's create the bootstrap distribution for the kissing dataset.

set.seed(11012016)
nsim = 100
boot_dist = data.frame(stat = rep(NA, nsim))
for(i in 1:nsim)
{
  boot_sample = sample(kissing$side, size = nrow(kissing), replace = TRUE)
  boot_dist$stat[i] = sum(boot_sample == "right") / nrow(kissing)
}

Plotting the bootstrap distribution

ggplot(boot_dist, aes(x = stat)) + geom_dotplot()

Shape of the bootstrap distribution

To make the shape a little easier to detect, let's increase the number of simulations.

nsim = 15000
boot_dist = data.frame(stat = rep(NA, nsim))
for(i in 1:nsim)
{
  boot_sample = sample(kissing$side, size = nrow(kissing), replace = TRUE)
  boot_dist$stat[i] = sum(boot_sample == "right") / nrow(kissing)
}
ggplot(data = boot_dist, aes(x = stat)) + geom_histogram(bins=20)

The normal distribution, \(N(\mu, \sigma)\)

  • Unimodal and symmetric (bell curve)

  • Follows very strict guidelines about how variably the data are distributed around the mean many variables are nearly normal, but almost never are they exactly normal

empirical

68-95-99.7% rule

empirical

To be more exact:

percentiles <- c(0.0015, 0.025, 0.16, 0.84, 0.975, 0.9985)
round(qnorm(percentiles), 2)
## [1] -2.97 -1.96 -0.99  0.99  1.96  2.97

95% bootstrap interval

The approximate 95% bootstrap confidence interval can be calculated as follows:

\[ \hat{p} \pm 1.96 \times SE_{boot} \]

  • \(\hat{p}\): observed sample proportion
  • \(SE_{boot}\): variability (sample standard deviation) of the bootstrap distribution

Standard error is a term reserved for quantifying the variability of the sample statistic, as opposed to variability of individual observations, that is measured by the standard deviation.

Calculating the 95% bootstrap interval

stat = sum(kissing$side == "right") / 124
se_boot = sd(boot_dist$stat)
(boot_int = round(stat + c(-1,1) * 1.96 * se_boot, 4))
## [1] 0.5607 0.7296
ggplot(data = boot_dist, aes(x = stat)) +
  geom_histogram() +
  geom_vline(xintercept = boot_int, lty = 2)

Interpretation

We are 95% confident that between 56% to 73% of couples turn to the right when kissing.

  • Note we do not need a null hypothesis, but we can still achieve a similar decision rule

  • 95% confident does not mean there is a 95% chance that the interval contains the true \(p\)

Testing means

Rent in Durham

Data from a random sample of 20 1+ bedroom apartments in Durham in 2012.

(durham_apts = read.csv("https://stat.duke.edu/~mc301/data/durham_apts.csv") %>% tbl_df())
## # A tibble: 20 × 1
##     rent
##    <int>
## 1    625
## 2    733
## 3    895
## 4    929
## 5    775
## 6   1349
## 7    599
## 8    749
## 9   1020
## 10   799
## 11   705
## 12   665
## 13  1282
## 14  1143
## 15  1209
## 16   500
## 17  1495
## 18  1076
## 19   975
## 20   879

Exploratory analysis

ggplot(data = durham_apts, aes(x = rent)) +
  geom_dotplot()

durham_apts %>%
  summarise(xbar = mean(rent), med = median(rent))
## # A tibble: 1 × 2
##    xbar   med
##   <dbl> <dbl>
## 1 920.1   887

Bootstrap distribution of the mean & median

nsim = 15000
boot_dist = data.frame(mean = rep(NA, nsim))
for(i in 1:nsim)
{
  boot_sample = sample(durham_apts$rent, size = nrow(durham_apts), replace = TRUE)
  boot_dist$mean[i] = mean(boot_sample)
}
ggplot(data = boot_dist, aes(x = mean)) + geom_histogram(bins=20)

Bootstrap distribution of the median

nsim = 15000
boot_dist = data.frame(med = rep(NA, nsim))
for(i in 1:nsim)
{
  boot_sample = sample(durham_apts$rent, size = nrow(durham_apts), replace = TRUE)
  boot_dist$med[i] = median(boot_sample)
}
ggplot(data = boot_dist, aes(x = med)) + geom_histogram(bins=20)

Generalized Approach - Helper functions

source("https://stat.duke.edu/~cr173/Sta112_Fa16/code/one_num_boot.R")
source("https://stat.duke.edu/~cr173/Sta112_Fa16/code/one_num_test.R")

Inference for one numeric variable

one_num_boot
## function (data, statistic = NULL, nsim = 15000, conf_level = 0.95, 
##     seed = NULL, digits = 4, print_summ = TRUE, print_plot = TRUE, 
##     print_output = TRUE, return_invis = TRUE) 
## {
##     installed_packages = names(installed.packages()[, "Package"])
##     if (!("ggplot2" %in% installed_packages)) {
##         stop("Install the ggplot2 package.")
##     }
##     if (is.null(data)) {
##         stop("Missing data.")
##     }
##     if (is.data.frame(data)) {
##         stop("Data should be a vector, not a data frame.")
##     }
##     if (is.null(statistic)) {
##         stop("Missing statistic.")
##     }
##     data = data[!is.na(data)]
##     n = length(data)
##     if (!is.null(seed)) {
##         set.seed(seed)
##     }
##     calc_stat = function(x, statistic) {
##         statistic(x)
##     }
##     stat = calc_stat(data, statistic)
##     boot_dist = data.frame(stat = rep(NA, nsim))
##     for (i in 1:nsim) {
##         boot_sample = sample(data, size = n, replace = TRUE)
##         boot_dist$stat[i] = calc_stat(boot_sample, statistic)
##     }
##     crit = qt(conf_level + (1 - conf_level)/2, df = n - 1)
##     boot_se = sd(boot_dist$stat)
##     ci = stat + c(-1, 1) * crit * boot_se
##     if (print_summ == TRUE) {
##         cat(paste0("Summary stats: n = ", n, ", sample ", paste(substitute(statistic)), 
##             " = ", round(stat, digits), "\n"))
##     }
##     if (print_output == TRUE) {
##         ci_pretty = round(ci, digits)
##         conf_level_perc = as.character(conf_level * 100)
##         cat(paste0(conf_level_perc, "% CI: (", ci_pretty[1], 
##             ", ", ci_pretty[2], ")"))
##     }
##     if (print_plot == TRUE) {
##         boot_plot = ggplot(data = boot_dist, aes(x = stat), environment = environment()) + 
##             geom_histogram() + xlab(paste0("bootstrap ", paste(substitute(statistic)), 
##             "s")) + geom_vline(xintercept = ci, lty = 2)
##         suppressWarnings(suppressMessages(print(boot_plot)))
##     }
##     if (return_invis == TRUE) {
##         return(invisible(list(n = n, statistic = stat, SE_boot = boot_se, 
##             conf_level = conf_level, ci = ci)))
##     }
##     else {
##         cat("\n")
##         return(list(n = n, statistic = stat, SE_boot = boot_se, 
##             conf_level = conf_level, ci = ci))
##     }
## }

Bootstrap CI for mean rent in Durham

Estimate the average rent in Durham for 1+ bedroom apartments using a 95% confidence interval.

one_num_boot(durham_apts$rent, statistic = mean, seed = 195729)
## Summary stats: n = 20, sample mean = 920.1
## 95% CI: (795.968, 1044.232)

Bootstrap CI for median rent in Durham

Estimate the median rent in Durham for 1+ bedroom apartments using a 95% confidence interval.

one_num_boot(durham_apts$rent, statistic = median, seed = 571035)
## Summary stats: n = 20, sample median = 887
## 95% CI: (712.8174, 1061.1826)

Bootstrap testing for a mean

  • Construct the bootstrap distribution

  • Shift it to be centered at the null value

  • Calculate the p-value as usual: observed or more extreme outcome (more extreme in favor of the null hypothesis) given that the null value is true

Bootstrap test for average rent in Durham

Do these data provide convincing evidence that the average rent in Durham for 1+ bedroom apartments is greater than $800?

one_num_test(durham_apts$rent, statistic = mean, null = 800, alt = "greater", seed = 28732)
## H0: mu = 800
## HA: mu > 800
## Summary stats: n = 20, sample mean = 920.1
## p-value =  0.0251

Bootstrap test for median rent in Durham

Do these data provide convincing evidence that the median rent in Durham for 1+ bedroom apartments is greater than $800?

one_num_test(durham_apts$rent, statistic = median, null = 800, alt = "greater", seed = 235671)
## H0: pop_median = 800
## HA: pop_median > 800
## Summary stats: n = 20, sample median = 887
## p-value =  0.1443

Other helper functions

For future use …

source("https://stat.duke.edu/~cr173/Sta112_Fa16/code/one_cat_boot.R")
source("https://stat.duke.edu/~cr173/Sta112_Fa16/code/one_cat_test.R")

More Bootstraping

Generalizing Bootstraping

There is nothing special about proportions, means, or medians with regard to using the bootstrap - it is applicable almost any statistical procedure you can imagine.

  • Bootstrapping is simple but it is not guaranteed to converge to the correct answer (asymptotic theory)

  • Data must be "representative" of the true population

  • Limited sample sizes are always an issue

Loess - Local Regression

Below we fit a local regression model to some synthetic data along with estimating the 95% confidence interval for the best fit line.

set.seed(3212016)
d = data.frame(x = 1:120) %>%
    mutate(y = sin(2*pi*x/120) + runif(length(x),-1,1))

l = loess(y ~ x, data=d)
d$pred_y = predict(l)
d$pred_y_se = predict(l,se=TRUE)$se.fit

ggplot(d, aes(x,y)) +
  geom_point() +
  geom_line(aes(y=pred_y)) +
  geom_line(aes(y=pred_y + 1.96 * pred_y_se), color="red") +
  geom_line(aes(y=pred_y - 1.96 * pred_y_se), color="red")

Loess bootstrap

Just like for the mean or median we can use a bootstrap to estimate the uncertainty for any point.

n_rep = 10000
res = matrix(NA, ncol=n_rep, nrow=nrow(d))

for(i in 1:ncol(res))
{ 
  bootstrap_samp = d %>% select(x,y) %>% sample_n(nrow(d), replace=TRUE)
  res[,i] = predict(loess(y ~ x, data=bootstrap_samp), newdata=d)
}

# Calculate the 95% bootstrap prediction interval
d$bs_low = apply(res,1,quantile,probs=c(0.025), na.rm=TRUE)
d$bs_up  = apply(res,1,quantile,probs=c(0.975), na.rm=TRUE)

ggplot(d, aes(x,y)) +
  geom_point() +
  geom_line(aes(y=pred_y)) +
  geom_line(aes(y=pred_y + 1.96 * pred_y_se), color="red") +
  geom_line(aes(y=pred_y - 1.96 * pred_y_se), color="red") +
  geom_line(aes(y=bs_low), color="blue") +
  geom_line(aes(y=bs_up), color="blue")