A plausible range of values for the population parameter is called a confidence interval.
If we report a point estimate, we probably won’t hit the exact population parameter.
If we report a range of plausible values we have a good shot at capturing the parameter.
One method of constructing a confidence interval, via simulation, is bootstrapping.
This term comes from the phrase "pulling oneself up by one's bootstraps", which is a metaphor for accomplishing an impossible task without any outside help.
In this case the "impossible" task is estimating a sampling distribution with a single sample and no specific theoretical results.
Generate a bootstrap sample - a random sample taken with replacement from the original sample, of the same size as the original sample
Calculate the bootstrap statistic - a statistic such as mean, median, proportion, etc. computed on the bootstrap samples
Repeat steps (1) and (2) many times to create a bootstrap distribution
One of the earliest examples of behavioral asymmetry is a preference in humans for turning the head to the right, rather than to the left, during the final weeks of gestation and for the first 6 months after birth. This is thought to influence subsequent development of perceptual and motor preferences. A study of 124 couples found that 64.5% turned their heads to the right when kissing.
Reference: Güntürkün, Onur. “Human behaviour: adult persistence of head-turning asymmetry.” Nature 421.6924 (2003): 711-711.
(kissing = read.csv("https://stat.duke.edu/~mc301/data/kissing.csv", stringsAsFactors = FALSE) %>% tbl_df())
## # A tibble: 124 × 1 ## side ## <chr> ## 1 left ## 2 right ## 3 right ## 4 right ## 5 right ## 6 right ## 7 right ## 8 right ## 9 right ## 10 right ## # ... with 114 more rows
Let's create the bootstrap distribution for the kissing
dataset.
set.seed(11012016) nsim = 100 boot_dist = data.frame(stat = rep(NA, nsim)) for(i in 1:nsim) { boot_sample = sample(kissing$side, size = nrow(kissing), replace = TRUE) boot_dist$stat[i] = sum(boot_sample == "right") / nrow(kissing) }
ggplot(boot_dist, aes(x = stat)) + geom_dotplot()
To make the shape a little easier to detect, let's increase the number of simulations.
nsim = 15000 boot_dist = data.frame(stat = rep(NA, nsim)) for(i in 1:nsim) { boot_sample = sample(kissing$side, size = nrow(kissing), replace = TRUE) boot_dist$stat[i] = sum(boot_sample == "right") / nrow(kissing) } ggplot(data = boot_dist, aes(x = stat)) + geom_histogram(bins=20)
Unimodal and symmetric (bell curve)
Follows very strict guidelines about how variably the data are distributed around the mean many variables are nearly normal, but almost never are they exactly normal
To be more exact:
percentiles <- c(0.0015, 0.025, 0.16, 0.84, 0.975, 0.9985) round(qnorm(percentiles), 2)
## [1] -2.97 -1.96 -0.99 0.99 1.96 2.97
The approximate 95% bootstrap confidence interval can be calculated as follows:
\[ \hat{p} \pm 1.96 \times SE_{boot} \]
Standard error is a term reserved for quantifying the variability of the sample statistic, as opposed to variability of individual observations, that is measured by the standard deviation.
stat = sum(kissing$side == "right") / 124 se_boot = sd(boot_dist$stat) (boot_int = round(stat + c(-1,1) * 1.96 * se_boot, 4))
## [1] 0.5607 0.7296
ggplot(data = boot_dist, aes(x = stat)) + geom_histogram() + geom_vline(xintercept = boot_int, lty = 2)
We are 95% confident that between 56% to 73% of couples turn to the right when kissing.
Note we do not need a null hypothesis, but we can still achieve a similar decision rule
95% confident does not mean there is a 95% chance that the interval contains the true \(p\)
Data from a random sample of 20 1+ bedroom apartments in Durham in 2012.
(durham_apts = read.csv("https://stat.duke.edu/~mc301/data/durham_apts.csv") %>% tbl_df())
## # A tibble: 20 × 1 ## rent ## <int> ## 1 625 ## 2 733 ## 3 895 ## 4 929 ## 5 775 ## 6 1349 ## 7 599 ## 8 749 ## 9 1020 ## 10 799 ## 11 705 ## 12 665 ## 13 1282 ## 14 1143 ## 15 1209 ## 16 500 ## 17 1495 ## 18 1076 ## 19 975 ## 20 879
ggplot(data = durham_apts, aes(x = rent)) + geom_dotplot()
durham_apts %>% summarise(xbar = mean(rent), med = median(rent))
## # A tibble: 1 × 2 ## xbar med ## <dbl> <dbl> ## 1 920.1 887
nsim = 15000 boot_dist = data.frame(mean = rep(NA, nsim)) for(i in 1:nsim) { boot_sample = sample(durham_apts$rent, size = nrow(durham_apts), replace = TRUE) boot_dist$mean[i] = mean(boot_sample) } ggplot(data = boot_dist, aes(x = mean)) + geom_histogram(bins=20)
nsim = 15000 boot_dist = data.frame(med = rep(NA, nsim)) for(i in 1:nsim) { boot_sample = sample(durham_apts$rent, size = nrow(durham_apts), replace = TRUE) boot_dist$med[i] = median(boot_sample) } ggplot(data = boot_dist, aes(x = med)) + geom_histogram(bins=20)
source("https://stat.duke.edu/~cr173/Sta112_Fa16/code/one_num_boot.R") source("https://stat.duke.edu/~cr173/Sta112_Fa16/code/one_num_test.R")
one_num_boot
## function (data, statistic = NULL, nsim = 15000, conf_level = 0.95, ## seed = NULL, digits = 4, print_summ = TRUE, print_plot = TRUE, ## print_output = TRUE, return_invis = TRUE) ## { ## installed_packages = names(installed.packages()[, "Package"]) ## if (!("ggplot2" %in% installed_packages)) { ## stop("Install the ggplot2 package.") ## } ## if (is.null(data)) { ## stop("Missing data.") ## } ## if (is.data.frame(data)) { ## stop("Data should be a vector, not a data frame.") ## } ## if (is.null(statistic)) { ## stop("Missing statistic.") ## } ## data = data[!is.na(data)] ## n = length(data) ## if (!is.null(seed)) { ## set.seed(seed) ## } ## calc_stat = function(x, statistic) { ## statistic(x) ## } ## stat = calc_stat(data, statistic) ## boot_dist = data.frame(stat = rep(NA, nsim)) ## for (i in 1:nsim) { ## boot_sample = sample(data, size = n, replace = TRUE) ## boot_dist$stat[i] = calc_stat(boot_sample, statistic) ## } ## crit = qt(conf_level + (1 - conf_level)/2, df = n - 1) ## boot_se = sd(boot_dist$stat) ## ci = stat + c(-1, 1) * crit * boot_se ## if (print_summ == TRUE) { ## cat(paste0("Summary stats: n = ", n, ", sample ", paste(substitute(statistic)), ## " = ", round(stat, digits), "\n")) ## } ## if (print_output == TRUE) { ## ci_pretty = round(ci, digits) ## conf_level_perc = as.character(conf_level * 100) ## cat(paste0(conf_level_perc, "% CI: (", ci_pretty[1], ## ", ", ci_pretty[2], ")")) ## } ## if (print_plot == TRUE) { ## boot_plot = ggplot(data = boot_dist, aes(x = stat), environment = environment()) + ## geom_histogram() + xlab(paste0("bootstrap ", paste(substitute(statistic)), ## "s")) + geom_vline(xintercept = ci, lty = 2) ## suppressWarnings(suppressMessages(print(boot_plot))) ## } ## if (return_invis == TRUE) { ## return(invisible(list(n = n, statistic = stat, SE_boot = boot_se, ## conf_level = conf_level, ci = ci))) ## } ## else { ## cat("\n") ## return(list(n = n, statistic = stat, SE_boot = boot_se, ## conf_level = conf_level, ci = ci)) ## } ## }
Estimate the average rent in Durham for 1+ bedroom apartments using a 95% confidence interval.
one_num_boot(durham_apts$rent, statistic = mean, seed = 195729)
## Summary stats: n = 20, sample mean = 920.1 ## 95% CI: (795.968, 1044.232)
Estimate the median rent in Durham for 1+ bedroom apartments using a 95% confidence interval.
one_num_boot(durham_apts$rent, statistic = median, seed = 571035)
## Summary stats: n = 20, sample median = 887 ## 95% CI: (712.8174, 1061.1826)
Construct the bootstrap distribution
Shift it to be centered at the null value
Calculate the p-value as usual: observed or more extreme outcome (more extreme in favor of the null hypothesis) given that the null value is true
Do these data provide convincing evidence that the average rent in Durham for 1+ bedroom apartments is greater than $800?
one_num_test(durham_apts$rent, statistic = mean, null = 800, alt = "greater", seed = 28732)
## H0: mu = 800 ## HA: mu > 800 ## Summary stats: n = 20, sample mean = 920.1 ## p-value = 0.0251
Do these data provide convincing evidence that the median rent in Durham for 1+ bedroom apartments is greater than $800?
one_num_test(durham_apts$rent, statistic = median, null = 800, alt = "greater", seed = 235671)
## H0: pop_median = 800 ## HA: pop_median > 800 ## Summary stats: n = 20, sample median = 887 ## p-value = 0.1443
For future use …
source("https://stat.duke.edu/~cr173/Sta112_Fa16/code/one_cat_boot.R") source("https://stat.duke.edu/~cr173/Sta112_Fa16/code/one_cat_test.R")
There is nothing special about proportions, means, or medians with regard to using the bootstrap - it is applicable almost any statistical procedure you can imagine.
Bootstrapping is simple but it is not guaranteed to converge to the correct answer (asymptotic theory)
Data must be "representative" of the true population
Limited sample sizes are always an issue
Below we fit a local regression model to some synthetic data along with estimating the 95% confidence interval for the best fit line.
set.seed(3212016) d = data.frame(x = 1:120) %>% mutate(y = sin(2*pi*x/120) + runif(length(x),-1,1)) l = loess(y ~ x, data=d) d$pred_y = predict(l) d$pred_y_se = predict(l,se=TRUE)$se.fit ggplot(d, aes(x,y)) + geom_point() + geom_line(aes(y=pred_y)) + geom_line(aes(y=pred_y + 1.96 * pred_y_se), color="red") + geom_line(aes(y=pred_y - 1.96 * pred_y_se), color="red")
Just like for the mean or median we can use a bootstrap to estimate the uncertainty for any point.
n_rep = 10000 res = matrix(NA, ncol=n_rep, nrow=nrow(d)) for(i in 1:ncol(res)) { bootstrap_samp = d %>% select(x,y) %>% sample_n(nrow(d), replace=TRUE) res[,i] = predict(loess(y ~ x, data=bootstrap_samp), newdata=d) } # Calculate the 95% bootstrap prediction interval d$bs_low = apply(res,1,quantile,probs=c(0.025), na.rm=TRUE) d$bs_up = apply(res,1,quantile,probs=c(0.975), na.rm=TRUE) ggplot(d, aes(x,y)) + geom_point() + geom_line(aes(y=pred_y)) + geom_line(aes(y=pred_y + 1.96 * pred_y_se), color="red") + geom_line(aes(y=pred_y - 1.96 * pred_y_se), color="red") + geom_line(aes(y=bs_low), color="blue") + geom_line(aes(y=bs_up), color="blue")