class: center, middle, inverse, title-slide # Estimation via bootstrapping ## Intro to Data Science ### Shawn Santo ### 03-05-20 --- ## Announcements - Homework 3 due Thursday, Mar 5 at 11:59pm - Lab 6 due Friday, Mar 6 at 11:59pm --- class: center, middle, inverse # Inference --- ## Terminology **Population**: a group of individuals or objects we are interested in studying **Sample**: a representative (we assume) subset of our population of interest **Parameter**: a numerical quantity derived from the population (almost always unknown) **Statistic**: a numerical quantity derived from a sample <br/> Common population parameters of interest and their corresponding sample statistic: | Quantity | Parameter | Statistic | |--------------------|------------|-------------| | Mean | `\(\mu\)` | `\(\bar{x}\)` | | Variance | `\(\sigma^2\)` | `\(s^2\)` | | Standard deviation | `\(\sigma\)` | `\(s\)` | | Median | `\(M\)` | `\(\tilde{x}\)` | | Proportion | `\(p\)` | `\(\hat{p}\)` | --- ## What does inference mean? - **Statistical inference** is the process of using sample data to make conclusions about the underlying population the sample came from. - Types of inference: **testing** and **estimation**. - Today we will focus on estimation. --- <img src="images/inference.png"> --- class: center, middle, inverse # Estimation --- ## Point estimate A point estimate is a single value computed from the sample data to serve as the "best guess", or estimate, for the population parameter. -- <br/><br/> **What is the downside to using point estimates?** --- ## Confidence intervals A plausible range of values for the population parameter is an interval estimate. One type of interval estimate is known as a **confidence interval**. -- .pull-left[ ![spear](img/09a/spear.png) ] .pull-right[ ![net](img/09a/net.png) ] - If we report a point estimate, we probably won’t hit the exact population parameter. - If we report a range of plausible values, we have a good shot at capturing the parameter. --- ## Variability of sample statistics - In order to construct a confidence interval we need to quantify the variability of our sample statistic. - For example, if we want to construct a confidence interval for a population mean, we need to come up with a plausible range of values around our observed sample mean. - This range will depend on how precise and how accurate our sample mean is as an estimate of the population mean. - Quantifying this requires a measurement of how much we would expect the sample mean to vary from sample to sample. -- <br/> Suppose you randomly sample 50 students and 5 of them are left handed. If you were to take another random sample of 50 students, how many would you expect to be left handed? Would you be surprised if only 3 of them were left handed? Would you be surprised if 40 of them were left handed? --- ## Quantifying the variability of a sample statistic We can quantify the variability of sample statistics using 1. **simulation**: via bootstrapping (today); 2. **theory**: via Central Limit Theorem (later in the course). --- class: center, middle, inverse # Bootstrapping --- ## Bootstrapping <img src="img/09a/boot.png" style="float:right"> - The term **bootstrapping** comes from the phrase "pulling oneself up by one’s bootstraps", to help oneself without the aid of others. - In this case, we are estimating a population parameter, and we’ll accomplish it using data from only from the given sample. - This notion of saying something about a population parameter using only information from an observed sample is the crux of statistical inference, it is not limited to bootstrapping. --- ## Rent in Manhattan How much do you think it costs to rent a typical 1 bedroom apartment in Manhattan? --- ## Sample data Consider 20 1 bedroom apartments that were randomly selected on Craigslist Manhattan from apartments listed as "by owner". ```r library(tidyverse) manhattan <- read_csv("data/manhattan.csv") ``` .small[ .pull-left[ ```r manhattan %>% slice(1:10) ``` ``` #> # A tibble: 10 x 1 #> rent #> <dbl> #> 1 3850 #> 2 3800 #> 3 2350 #> 4 3200 #> 5 2150 #> 6 3267 #> 7 2495 #> 8 2349 #> 9 3950 #> 10 1795 ``` ] .pull-right[ ```r manhattan %>% slice(11:20) ``` ``` #> # A tibble: 10 x 1 #> rent #> <dbl> #> 1 2145 #> 2 2300 #> 3 1775 #> 4 2000 #> 5 2175 #> 6 2350 #> 7 2550 #> 8 4195 #> 9 1470 #> 10 2350 ``` ] ] --- ## Parameter of interest .tiny[ <img src="lec-09b-bootstrap_files/figure-html/unnamed-chunk-4-1.png" style="display: block; margin: auto;" /> ] -- **Is the mean or the median a better measure of typical rent in Manhattan?** ??? ```r ggplot(data = manhattan, mapping = aes(x = rent)) + geom_histogram(binwidth = 250, fill = "purple") + labs(title = "There is high variability in Manhattan rent", subtitle = "1 bedroom apartments", x = "Rent", y = "Count") + theme_minimal(base_size = 16) ``` --- ## Observed sample vs. population .pull-left[ <img src="img/09a/rent-bootsamp.png"> ```r manhattan %>% pull(rent) %>% median() ``` ``` #> [1] 2350 ``` ] -- .pull-right[ <img src="img/09a/rent-bootpop.png" height="250"> We don't have this data! ] --- ## Bootstrapping scheme 1. **Take a bootstrap sample** - a random sample taken with replacement from the original sample, of the same size as the original sample. 2. **Calculate the bootstrap statistic** - a statistic such as mean, median, proportion, slope, etc. computed on the bootstrap samples. 3. **Repeat steps (1) and (2) many times to create a bootstrap distribution** - a distribution of bootstrap statistics. 4. **Calculate the bounds of the XX% confidence interval** as the middle XX% of the bootstrap distribution. --- class: center, middle, inverse # Bootstrapping in R --- ## Package `infer` .pull-left[ ![](img/09a/infer.png) <br><br> [infer.netlify.com](http://infer.netlify.com) ] .pull-right[ <br/><br/><br/><br/> The objective of package `infer` is to perform statistical inference using an expressive statistical grammar that coheres with the tidyverse design framework. ] --- ## Package `infer` ![ht-diagram](img/09a/ht-diagram.png) ```r library(infer) ``` -- Also, let's set a seed: ```r set.seed(03052020) ``` Function `set.seed()` is a base R function that allows us to control R's random number generation. Use this to make your simulation work reproducible. --- ## Generate bootstrap medians 1. `specify()` the variable of interest. <br/><br/> ```r boot_dist <- manhattan %>% * specify(response = rent) ``` --- ## Generate bootstrap medians 1. `specify()` the variable of interest. <br/><br/> 2. `generate()` a fixed number of bootstrap samples. <br/><br/> ```r boot_dist <- manhattan %>% specify(response = rent) %>% * generate(reps = 15000, type = "bootstrap") ``` --- ## Generate bootstrap medians 1. `specify()` the variable of interest. <br/><br/> 2. `generate()` a fixed number of bootstrap samples. <br/><br/> 3. `calculate()` the bootstrapped statistic(s). <br/><br/> ```r boot_dist <- manhattan %>% specify(response = rent) %>% generate(reps = 15000, type = "bootstrap") %>% * calculate(stat = "median") ``` --- ## Sample medians How many observations are there in `boot_dist`? What does each observation represent? -- ```r boot_dist ``` ``` #> # A tibble: 15,000 x 2 #> replicate stat #> <int> <dbl> #> 1 1 2350 #> 2 2 2238. #> 3 3 2262. #> 4 4 2350 #> 5 5 2450 #> 6 6 2324. #> 7 7 2350. #> 8 8 2775 #> 9 9 2350 #> 10 10 2450 #> # … with 14,990 more rows ``` --- ## Visualize the distribution <img src="lec-09b-bootstrap_files/figure-html/unnamed-chunk-13-1.png" style="display: block; margin: auto;" /> ??? ```r ggplot(data = boot_dist, mapping = aes(x = stat)) + geom_histogram(binwidth = 100) + labs(title = "Bootstrap distribution of medians is skewed right", x = "Median", y = "Count", caption = "Binwidth of 100") + theme_minimal(base_size = 16) ``` --- ## Calculate the confidence interval A 95% confidence interval is bounded by the middle 95% of the bootstrap distribution. -- Use `dplyr` functions: ```r boot_dist %>% summarize(lower_bound = quantile(stat, 0.025), upper_bound = quantile(stat, 0.975)) ``` ``` #> # A tibble: 1 x 2 #> lower_bound upper_bound #> <dbl> <dbl> #> 1 2162. 2875 ``` -- Use `get_ci()` from `infer`: ```r percentile_ci <- get_ci(boot_dist, level = .95) percentile_ci ``` ``` #> # A tibble: 1 x 2 #> `2.5%` `97.5%` #> <dbl> <dbl> #> 1 2162. 2875 ``` --- ## Visualize a confidence interval Using `geom_vline()`. <img src="lec-09b-bootstrap_files/figure-html/unnamed-chunk-18-1.png" style="display: block; margin: auto;" /> ??? ```r lower_bound <- boot_dist %>% summarize(lower_bound = quantile(stat, 0.025)) %>% pull() upper_bound <- boot_dist %>% summarize(upper_bound = quantile(stat, 0.975)) %>% pull() ``` ```r ggplot(data = boot_dist, mapping = aes(x = stat)) + geom_histogram(binwidth = 50, alpha = .5) + geom_vline(xintercept = c(lower_bound, upper_bound), color = "darkgreen", lty = 2, size = 1) + labs(title = "Bootstrap distribution of medians is skewed right", x = "Median", y = "Count", caption = "Green lines represent 95% C.I. bounds") + theme_minimal(base_size = 16) ``` --- ## Visualize a confidence interval Using `visualize()` and `shade_confidence_interval()`. .tiny[ ```r visualize(boot_dist) + shade_confidence_interval(endpoints = percentile_ci) + labs(title = "Bootstrap distribution of medians is skewed right", x = "Median", y = "Count", caption = "Green lines represent 95% C.I. bounds") + theme_minimal(base_size = 16) ``` <img src="lec-09b-bootstrap_files/figure-html/unnamed-chunk-21-1.png" style="display: block; margin: auto;" /> ] --- ## Interpret the confidence interval <b> The 95% confidence interval for the median rent of one bedroom apartments in Manhattan was calculated as (2162.5, 2875). Which of the following is the correct interpretation of this interval? </b> <ol type="a"> <li>95% of the time the median rent one bedroom apartments in this sample is between $2162.5 and $2875.</li> <br/><br/> <li>95% of all one bedroom apartments in Manhattan have rents between $2162.5 and $2875.</li> <br/><br/> <li>We are 95% confident that the median rent of all one bedroom apartments is between $2162.5 and $2875.</li> <br/><br/> <li>We are 95% confident that the median rent of one bedroom apartments in this sample is between $2162.5 and $2875.</li> </ol> -- **C is correct.** --- ## Caution! We **cannot** say "there is a 95% *chance* that the median lies in the confidence interval". <br/><br/> **Why?** --- class: center, middle, inverse # Accuracy vs. precision --- ## Confidence level **We are 95% confident that ...** - Suppose we took many samples from the original population and built a 95% confidence interval based on each sample. - About 95% of those intervals would contain the true population parameter. --- ## Common confidence levels Commonly used confidence levels in practice are 90%, 95%, and 99%. Which lines represent which of the aforementioned levels? <img src="lec-09b-bootstrap_files/figure-html/unnamed-chunk-22-1.png" style="display: block; margin: auto;" /> ??? ```r l90 <- boot_dist %>% summarize(lower_bound = quantile(stat, 0.05)) %>% round(2) %>% pull() u90 <- boot_dist %>% summarize(lower_bound = quantile(stat, 0.95)) %>% round(2) %>% pull() l99 <- boot_dist %>% summarize(lower_bound = quantile(stat, 0.005)) %>% round(2) %>% pull() u99 <- boot_dist %>% summarize(lower_bound = quantile(stat, 0.995)) %>% round(2) %>% pull() ggplot(data = boot_dist, mapping = aes(x = stat)) + geom_histogram(binwidth = 50, alpha = .5) + geom_vline(xintercept = c(lower_bound, upper_bound), color = "darkgreen", lty = 2, size = 1) + geom_vline(xintercept = c(l90, u90), color = "purple", lty = 3, size = 1) + geom_vline(xintercept = c(l99, u99), color = "red", lty = 6, size = 1) + labs(title = "Bootstrap distribution of medians is skewed right", x = "Median", y = "Count") + theme_minimal(base_size = 16) ``` --- ## Precision vs. accuracy If we want to be very certain that we capture the population parameter, should we use a wider interval or a narrower interval? What drawbacks are associated with using a wider interval? -- <br/><br/> How can we get best of both worlds -- high precision and high accuracy? --- ## Calculating confidence intervals at various confidence levels How would you modify the following code to calculate a 90% confidence interval? How would you modify it for a 99% confidence interval? ```r manhattan %>% specify(response = rent) %>% generate(reps = 15000, type = "bootstrap") %>% calculate(stat = "median") %>% summarize(lower_bound = quantile(stat, 0.025), upper_bound = quantile(stat, 0.975)) ``` -- What would you modify to compute a confidence interval for the population mean? --- ## Recap - Sample statistic `\(\ne\)` population parameter, but if it is a random sample, it can be a good estimate. - We report that estimate with a confidence band around it, and the relative width of this band depends on the sample statistic's variability. - Since we can't continue sampling from the population, we instead bootstrap from the one sample we have to estimate the sampling variability. --- ## Application exercise https://classroom.github.com/a/PKqJ_2S6 1. Create a 98% bootstrap confidence interval for the population median price of 1 bedroom apartments in Manhattan. 2. Create a 98% bootstrap confidence interval for the population mean price of 1 bedroom apartments in Manhattan. Write out an appropriate interpretation of this interval. 3. Compare the two confidence intervals. Which interval is wider? 4. Can we use these intervals to make an accurate statement about the mean/median price of 1 bedroom apartments in the Bronx? --- ## References 1. Tidy Statistical Inference. (2020). Infer.netlify.com. Retrieved 29 February 2020, from https://infer.netlify.com/index.html