class: center, middle, inverse, title-slide # Estimation via Bootstrap ### Yue Jiang ### 03.04.20 --- layout: true <div class="my-footer"> <span> <a href="http://datasciencebox.org" target="_blank">datasciencebox.org</a> </span> </div> --- ## Announcements - Homework 4 due Thursday, Mar 5 at 11:59pm - Lab 6 due Friday, Mar 6 at 11:59pm --- class: center, middle # Inference --- ## Terminology **Population**: a group of individuals or objects we are interested in studying **Sample**: a representative (we assume) subset of our population of interest **Parameter**: a numerical quantity derived from the population (almost always unknown) **Statistic**: a numerical quantity derived from a sample -- Common population parameters of interest and their corresponding sample statistic: | Quantity | Parameter | Statistic | |--------------------|------------|-------------| | Mean | `\(\mu\)` | `\(\bar{x}\)` | | Variance | `\(\sigma^2\)` | `\(s^2\)` | | Standard deviation | `\(\sigma\)` | `\(s\)` | | Proportion | `\(p\)` | `\(\hat{p}\)` | --- ## What does inference mean? - **Statistical inference** is the process of using sample data to make conclusions about the underlying population the sample came from. - Types of inference: **testing** and **estimation**. - Today we will focus on estimation. --- class: center, middle # Estimation --- ## Point estimate A point estimate is a single value computed from the sample data to serve as the "best guess", or estimate, for the population parameter. -- <br/><br/> **What is the downside to using point estimates?** --- ## Confidence intervals A plausible range of values for the population parameter is an interval estimate. One type of interval estimate is known as a **confidence interval**. -- .pull-left[  ] .pull-right[  ] - If we report a point estimate, we probably won’t hit the exact population parameter. - If we report a range of plausible values, we have a good shot at capturing the parameter. --- ## Variability of sample statistics - In order to construct a confidence interval we need to quantify the variability of our sample statistic. - For example, if we want to construct a confidence interval for a population mean, we need to come up with a plausible range of values around our observed sample mean. - This range will depend on how precise and how accurate our sample mean is as an estimate of the population mean. - Quantifying this requires a measurement of how much we would expect the sample mean to vary from sample to sample. -- <br/> .question[ Suppose you randomly sample 50 students and 5 of them are left handed. If you were to take another random sample of 50 students, how many would you expect to be left handed? Would you be surprised if only 3 of them were left handed? Would you be surprised if 40 of them were left handed? ] --- ## Quantifying the variability of a sample statistic We can quantify the variability of sample statistics using 1. **Simulation**: via bootstrapping, or "resampling" (today); 2. **Theory**: via the Central Limit Theorem (later in the course). --- class: center, middle # Bootstrapping --- ## Bootstrapping <img src="img/13/boot.png" style="float:right"> - The term **bootstrapping** comes from the phrase "pulling oneself up by one’s bootstraps", to help oneself without the aid of others. - In this case, we are estimating a population parameter, and we’ll accomplish it using data from **only from the given sample**. - This notion of saying something about a population parameter using only information from an observed sample is the crux of statistical inference, it is not limited to bootstrapping. --- ## Rent in Manhattan How much do you think it costs to rent a typical 1 bedroom apartment in Manhattan? --- ## Sample data Consider 20 1 bedroom apartments that were randomly selected on Craigslist Manhattan from apartments listed as "by owner". ```r library(tidyverse) manhattan <- read_csv("data/manhattan.csv") ``` .small[ .pull-left[ ```r manhattan %>% slice(1:10) ``` ``` #> # A tibble: 10 x 1 #> rent #> <dbl> #> 1 3850 #> 2 3800 #> 3 2350 #> 4 3200 #> 5 2150 #> 6 3267 #> 7 2495 #> 8 2349 #> 9 3950 #> 10 1795 ``` ] .pull-right[ ```r manhattan %>% slice(11:20) ``` ``` #> # A tibble: 10 x 1 #> rent #> <dbl> #> 1 2145 #> 2 2300 #> 3 1775 #> 4 2000 #> 5 2175 #> 6 2350 #> 7 2550 #> 8 4195 #> 9 1470 #> 10 2350 ``` ] ] --- ## Parameter of interest .tiny[ <img src="13-bootstrap_files/figure-html/unnamed-chunk-4-1.png" style="display: block; margin: auto;" /> ] -- **Is the mean or the median a better measure of typical rent in Manhattan?** --- ## Observed sample vs. population .pull-left[ <img src="img/13/rent-bootsamp.png"> ```r manhattan %>% pull(rent) %>% median() ``` ``` #> [1] 2350 ``` ] -- .pull-right[ <img src="img/13/rent-bootpop.png" height="250"> We don't have this data! ] --- ## Bootstrapping scheme 1. **Take a bootstrap sample** - a random sample taken with replacement from the original sample, of the same size as the original sample. 2. **Calculate the bootstrap statistic** - a statistic such as mean, median, proportion, slope, etc. computed on the bootstrap samples. 3. **Repeat steps (1) and (2) many times to create a bootstrap distribution** - a distribution of bootstrap statistics. 4. **Calculate the bounds of the XX% confidence interval** as the middle XX% of the bootstrap distribution. --- class: center, middle # Bootstrapping in R --- ## Package `infer` .pull-left[  <br><br> [infer.netlify.com](http://infer.netlify.com) ] .pull-right[ <br/><br/><br/><br/> The objective of package `infer` is to perform statistical inference using an expressive statistical grammar that coheres with the tidyverse design framework. ] --- ## Package `infer`  ```r library(infer) ``` -- Also, let's set a seed: ```r set.seed(03052020) ``` Function `set.seed()` is a base R function that allows us to control R's random number generation. Use this to make your simulation work reproducible. --- ## Generate bootstrap medians 1. `specify()` the variable of interest. <br/><br/> ```r boot_dist <- manhattan %>% * specify(response = rent) ``` --- ## Generate bootstrap medians 1. `specify()` the variable of interest. <br/><br/> 2. `generate()` a fixed number of bootstrap samples. <br/><br/> ```r boot_dist <- manhattan %>% specify(response = rent) %>% * generate(reps = 15000, type = "bootstrap") ``` --- ## Generate bootstrap medians 1. `specify()` the variable of interest. <br/><br/> 2. `generate()` a fixed number of bootstrap samples. <br/><br/> 3. `calculate()` the bootstrapped statistic(s). <br/><br/> ```r boot_dist <- manhattan %>% specify(response = rent) %>% generate(reps = 15000, type = "bootstrap") %>% calculate(stat = "median") ``` --- ## Sample medians How many observations are there in `boot_dist`? What does each observation represent? -- ```r boot_dist ``` ``` #> # A tibble: 15,000 x 2 #> replicate stat #> <int> <dbl> #> 1 1 2350 #> 2 2 2238. #> 3 3 2262. #> 4 4 2350 #> 5 5 2450 #> 6 6 2324. #> 7 7 2350. #> 8 8 2775 #> 9 9 2350 #> 10 10 2450 #> # ... with 14,990 more rows ``` --- ## Visualize the distribution <img src="13-bootstrap_files/figure-html/unnamed-chunk-12-1.png" style="display: block; margin: auto;" /> --- ## Calculate the confidence interval A 95% confidence interval is bounded by the middle 95% of the bootstrap distribution. -- Use `dplyr` functions: ```r boot_dist %>% summarize(lower_bound = quantile(stat, 0.025), upper_bound = quantile(stat, 0.975)) ``` ``` #> # A tibble: 1 x 2 #> lower_bound upper_bound #> <dbl> <dbl> #> 1 2162. 2875 ``` -- Use `get_ci()` from `infer`: ```r percentile_ci <- get_ci(boot_dist, level = .95) percentile_ci ``` ``` #> # A tibble: 1 x 2 #> `2.5%` `97.5%` #> <dbl> <dbl> #> 1 2162. 2875 ``` --- ## Visualize a confidence interval Using `geom_vline()`. <img src="13-bootstrap_files/figure-html/unnamed-chunk-16-1.png" style="display: block; margin: auto;" /> --- ## Interpret the confidence interval <b> The 95% confidence interval for the median rent of one bedroom apartments in Manhattan was calculated as (2162.5, 2875). Which of the following is the correct interpretation of this interval? </b> <ol type="a"> <li>95% of the time the median rent one bedroom apartments in this sample is between $2162.5 and $2875.</li> <br/><br/> <li>95% of all one bedroom apartments in Manhattan have rents between $2162.5 and $2875.</li> <br/><br/> <li>We are 95% confident that the median rent of all one bedroom apartments is between $2162.5 and $2875.</li> <br/><br/> <li>We are 95% confident that the median rent of one bedroom apartments in this sample is between $2162.5 and $2875.</li> </ol> --- ## Caution We **cannot** say "There is a 95% *chance* that the median lies in the confidence interval." .question[ Why? ] --- class: center, middle # Accuracy vs. precision --- ## So what IS a confidence interval? **We are 95% confident that ...** - Suppose we took many samples from the original population and built a 95% confidence interval based on each sample. - We would expect about 95% of those intervals would contain the true population parameter. - However, we never know if in fact 95% of them do, or whether any particular interval contains the true parameter (maybe none of them do!) --- ## Common confidence levels Commonly used confidence levels in practice are 90%, 95%, and 99%. Which lines represent which of the aforementioned levels? <img src="13-bootstrap_files/figure-html/unnamed-chunk-17-1.png" style="display: block; margin: auto;" /> ```r l90 <- boot_dist %>% summarize(lower_bound = quantile(stat, 0.05)) %>% round(2) %>% pull() u90 <- boot_dist %>% summarize(lower_bound = quantile(stat, 0.95)) %>% round(2) %>% pull() l99 <- boot_dist %>% summarize(lower_bound = quantile(stat, 0.005)) %>% round(2) %>% pull() u99 <- boot_dist %>% summarize(lower_bound = quantile(stat, 0.995)) %>% round(2) %>% pull() ggplot(data = boot_dist, mapping = aes(x = stat)) + geom_histogram(binwidth = 50, alpha = .5) + geom_vline(xintercept = c(lower_bound, upper_bound), color = "darkgreen", lty = 2, size = 1) + geom_vline(xintercept = c(l90, u90), color = "purple", lty = 3, size = 1) + geom_vline(xintercept = c(l99, u99), color = "red", lty = 6, size = 1) + labs(title = "Bootstrap distribution of medians is skewed right", x = "Median", y = "Count") + theme_minimal(base_size = 16) ``` --- ## Precision vs. accuracy If we want to be very certain that we capture the population parameter, should we use a wider interval or a narrower interval? What drawbacks are associated with using a wider interval? -- <br/><br/> How can we get best of both worlds -- high precision and high accuracy? --- ## Calculating confidence intervals at various confidence levels How would you modify the following code to calculate a 90% confidence interval? How would you modify it for a 99% confidence interval? ```r manhattan %>% specify(response = rent) %>% generate(reps = 15000, type = "bootstrap") %>% calculate(stat = "median") %>% summarize(lower_bound = quantile(stat, 0.025), upper_bound = quantile(stat, 0.975)) ``` -- What would you modify to compute a confidence interval for the population mean? --- ## Recap - Sample statistic `\(\ne\)` population parameter, but if it is a random sample, it can be a good estimate. - We report that estimate with a confidence band around it, and the relative width of this band depends on the sample statistic's variability. - Since we can't continue sampling from the population, we instead bootstrap from the one sample we have to estimate the sampling variability. --- ## Your turn Go to [https://classroom.github.com/a/eULa07lW](https://classroom.github.com/a/eULa07lW) to try it yourself! --- ## References 1. Tidy Statistical Inference. (2020). Infer.netlify.com. Retrieved 29 February 2020, from https://infer.netlify.com/index.html