class: center, middle, inverse, title-slide # Estimation via bootstrapping part II ## Intro to Data Science ### Shawn Santo ### 03-24-20 --- ## Announcements - DataFest cancelled, no virtual event - Lab 7 due Friday, Mar 27 at 11:59pm Eastern Standard Time <center> <iframe height="360" width="640" src="https://warpwire.duke.edu/w/3X4DAA/" frameborder="0" scrolling="0" allow="autoplay; encrypted-media; fullscreen; picture-in-picture;" allowfullscreen></iframe> </center> --- class: center, middle, inverse # Recall --- class: center, middle <iframe height="360" width="640" src="https://warpwire.duke.edu/w/334DAA/" frameborder="0" scrolling="0" allow="autoplay; encrypted-media; fullscreen; picture-in-picture;" allowfullscreen></iframe> --- ## Terminology **Population**: a group of individuals or objects we are interested in studying **Sample**: a representative (we assume) subset of our population of interest **Parameter**: a numerical quantity derived from the population (almost always unknown) **Statistic**: a numerical quantity derived from a sample <br/> Common population parameters of interest and their corresponding sample statistic: | Quantity | Parameter | Statistic | |--------------------|------------|-------------| | Mean | `\(\mu\)` | `\(\bar{x}\)` | | Variance | `\(\sigma^2\)` | `\(s^2\)` | | Standard deviation | `\(\sigma\)` | `\(s\)` | | Median | `\(M\)` | `\(\tilde{x}\)` | | Proportion | `\(p\)` | `\(\hat{p}\)` | --- ## What does inference mean? - **Statistical inference** is the process of using sample data to make conclusions about the underlying population the sample came from. - Types of inference: **testing** and **estimation**. - We will continue to focus on estimation. --- ## Confidence intervals A plausible range of values for the population parameter is an interval estimate. One type of interval estimate is known as a **confidence interval**. -- .pull-left[ ![spear](img/09a/spear.png) ] .pull-right[ ![net](img/09a/net.png) ] - If we report a point estimate, we probably won’t hit the exact population parameter. - If we report a range of plausible values, we have a good shot at capturing the parameter. --- ## Bootstrapping scheme 1. **Take a bootstrap sample** - a random sample taken with replacement from the original sample, of the same size as the original sample. 2. **Calculate the bootstrap statistic** - a statistic such as mean, median, proportion, slope, etc. computed on the bootstrap samples. 3. **Repeat steps (1) and (2) many times to create a bootstrap distribution** - a distribution of bootstrap statistics. 4. **Calculate the bounds of the XX% confidence interval** as the middle XX% of the bootstrap distribution. --- ## Bootstrap scheme steps 1 - 3 Consider a sample of size `\(n = 3\)`, and suppose we are interested in conducting inference for the population mean. <img src="images/bootstrap_sample.png"> --- ## Package `infer` ![ht-diagram](img/09a/ht-diagram.png) ```r library(infer) ``` -- Also, let's set a seed: ```r set.seed(03172020) ``` Function `set.seed()` is a base R function that allows us to control R's random number generation. Use this to make your simulation work reproducible. --- class: center, middle, inverse # More inference --- ## Small business owner optimism Data is from a Jan. 15-24 nationwide Square/Gallup online survey of small-business owners with annual revenues between $50,000 and $25 million. <br/> U.S. small businesses were asked, "Has your business benefited from the increased small business deduction that resulted from the 2017 tax reform law, or not?" <br/><br/> -- <b> Given the sample data, can we conclude that at least 65% of U.S. small businesses benefited from the 2017 tax reform? </b> --- ## Data ```r library(tidyverse) library(infer) set.seed(03172020) ``` ```r small_business <- read_csv("data/gallup_sb.csv") ``` ```r small_business ``` ``` #> # A tibble: 1,234 x 1 #> benefit #> <chr> #> 1 yes #> 2 yes #> 3 yes #> 4 no #> 5 yes #> 6 yes #> 7 yes #> 8 yes #> 9 yes #> 10 yes #> # … with 1,224 more rows ``` --- ## Data wrangling What is the survey response distribution of the sample? -- ```r small_business %>% count(benefit) ``` ``` #> # A tibble: 2 x 2 #> benefit n #> <chr> <int> #> 1 no 382 #> 2 yes 852 ``` -- What are the proportions? -- ```r small_business %>% count(benefit) %>% mutate(prop_benefit = n / sum(n)) ``` ``` #> # A tibble: 2 x 3 #> benefit n prop_benefit #> <chr> <int> <dbl> #> 1 no 382 0.310 #> 2 yes 852 0.690 ``` --- ## Conclusion? We see that the sample proportion is 0.69. That is, 69% of small businesses in the sample stated they did benefit from the 2017 tax reform. -- <br/><br/> <b> Can we now conclude: yes, at least 65% of all U.S. small businesses benefited from the 2017 tax reform? </b> -- <br/><br/> No! Why not? -- Let's create a confidence interval for the population proportion, `\(p\)`. --- ## Bootstrap confidence interval 1. `specify()` the `response` variable of interest along with what we define as `success`. <br/><br/> ```r small_business %>% * specify(response = benefit, success = "yes") ``` --- ## Bootstrap confidence interval 1. `specify()` the `response` variable of interest along with what we define as `success`. <br/><br/> 2. `generate()` a fixed number of `reps` for bootstrap `type` samples. <br/><br/> ```r small_business %>% specify(response = benefit, success = "yes") %>% * generate(reps = 5000, type = "bootstrap") ``` --- ## Bootstrap confidence interval 1. `specify()` the `response` variable of interest along with what we define as `success`. <br/><br/> 2. `generate()` a fixed number of `reps` for bootstrap `type` samples. <br/><br/> 3. `calculate()` the bootstrapped `stat`. ```r small_business %>% specify(response = benefit, success = "yes") %>% generate(reps = 5000, type = "bootstrap") %>% * calculate(stat = "prop") ``` --- ## Bootstrap confidence interval 1. `specify()` the `response` variable of interest along with what we define as `success`. <br/><br/> 2. `generate()` a fixed number of `reps` for bootstrap `type` samples. <br/><br/> 3. `calculate()` the bootstrapped `stat`. <br/><br/> 4. `summarize()` our results by getting the middle XX% of the bootstrap distribution in order to produce an XX% confidence interval. ```r small_business %>% specify(response = benefit, success = "yes") %>% generate(reps = 5000, type = "bootstrap") %>% calculate(stat = "prop") %>% * summarize(lower_bound = quantile(stat, 0.025), * upper_bound = quantile(stat, 0.975)) ``` ``` #> # A tibble: 1 x 2 #> lower_bound upper_bound #> <dbl> <dbl> #> 1 0.664 0.716 ``` --- ## Conclusion Our 95% confidence interval is `\((0.664, 0.716)\)`. **What can we conclude now?** --- ## Confidence level exploration ``` #> # A tibble: 6 x 4 #> lower_bound mean_p_hat upper_bound confidence_level #> <dbl> <dbl> <dbl> <dbl> #> 1 0.656 0.690 0.723 0.99 #> 2 0.659 0.690 0.720 0.98 #> 3 0.664 0.690 0.716 0.95 #> 4 0.669 0.690 0.712 0.9 #> 5 0.672 0.690 0.709 0.85 #> 6 0.673 0.690 0.707 0.8 ``` -- <br/><br/> **What do you notice?** --- ## Confidence interval interpretation What do we mean when we say, "We are 98% confident that our interval captures the true population parameter."? -- <br/> Suppose we could do all of the following: 1. Take a random sample of size `\(n\)` from our population. 2. Create a bootstrap confidence interval. 3. Repeat steps 1 - 2 ad infinitum. -- Then, we would expect 98% of those intervals we computed would cover the true population parameter's value. However, in practice, we only do steps 1 - 2 one time. --- ## Interpretation visualization <img src="lec-12a-bootstrap2_files/figure-html/unnamed-chunk-16-.gif" style="display: block; margin: auto;" /> --- ## A closer look at `specify()` | Argument | Description | |---------------|----------------------------------------------------------------------------------------------------------------------------------------------| | `x` | data frame | | `formula` | a formula in the form `response ~ explanatory` | | `response` | variable in `x` that is the response variable <br>(not needed if you use argument `formula`) | | `explanatory` | variable in `x` that is the explanatory variable<br>(not needed if you use argument `formula`) | | `success` | level of response considered a success - needed for inference on <br>one proportion, a difference in proportions, and corresponding z stats. | --- ## A closer look at `calculate()` Argument `stat` takes a string for the type of statistic to calculate for each of our bootstrap samples. Below are the available options. <br/> - "mean", "median", "sum", "sd", "prop", "count", - "diff in means", "diff in medians", "diff in props", - "slope", "correlation" - "Chisq", "F", "t", "z" <br/> Depending on the chosen stat, you may need provide a value to the subsequent argument - `order`. --- ## Bootstrap discussion **When can I create bootstrap confidence intervals as we have defined?** - Your original sample is representative of the population, and - your original sample size is not too small. <br/> -- **When should I not create a bootstrap confidence interval?** - Your original sample size is very small (< 5). - You are interested in the maximum or minimum. - The theoretical distribution of the sample statistic is known, and the assumptions are satisfied to conduct meaningful inference. --- ## Bootstrap discussion (continued) - There is no set number of bootstrap samples you should take to create a bootstrap confidence interval. The number you choose should be a function of your sample size and parameter of interest for inference. - Choose a computationally feasible number of bootstrap samples. - Remember, creating a bootstrap confidence interval is a simulation-based inference method. To ensure your work is reproducible, always control R's random number generation process with function `set.seed()`. --- ## Application exercise Today's application exercise can be found at the link below. https://classroom.github.com/a/5bfEx7l- <br/> As a reminder, these application exercises must be attempted within 24 hours of the scheduled lecture date in US Eastern Time. --- ## References 1. Tidy Statistical Inference. (2020). Infer.netlify.com. Retrieved 29 February 2020, from https://infer.netlify.com/index.html 2. Gallup, I. (2020). Small-Business Owners Highly Engaged in 2020 Election. Gallup.com. Retrieved 6 March 2020, from https://news.gallup.com/poll/284396/small-business-owners-highly-engaged-2020-election.aspx