Estimation via bootstrapping 👢

# Estimation via bootstrapping <br> 👢

---

layout: true
  
<div class="my-footer">
<span>
Dr. Mine Çetinkaya-Rundel -
<a href="http://www2.stat.duke.edu/courses/Fall18/sta112.01/schedule" target="_blank">stat.duke.edu/courses/Fall18/sta112.01
</a>
</span>
</div>

---

# Inference

---

# Inference

- Statistical inference is the process of using sample data to make conclusions 
about the underlying population the sample came from
- Types of inference: testing and estimation
- Today we discuss estimation, next time testing

---

# Confidence intervals

---

## Confidence intervals

A plausible range of values for the population parameter is a **confidence interval**.

- If we report a point estimate, we probably won’t hit the exact population 
parameter.
- If we report a range of plausible values we have a good shot at capturing 
the parameter.

---

## Variability of sample statistics

- In order to construct a confidence interval we need to quantify the variability 
of our sample statistic.
- For example, if we want to construct a confidence interval for a population mean, 
we need to come up with a plausible range of values around our observsed sample mean.
- This range will depend on how precise and how accurate our sample mean is as an 
estimate of the population mean.
- Quantifying this requires a measurement of how much we would expect the sample 
mean to vary from sample to sample.

.question[
Suppose you randomly sample 50 students and 5 of them are left handed. If you 
were to take another random sample of 50 students, how many would you expect to 
be left handed? Would you be surprised if only 3 of them were left handed? Would 
you be surprised if 40 of them were left handed?
]

---

## Quantifying the variability of a sample statistic

We can quantify the variability of sample statistics using

- simulation: via bootstrapping (today)

- theory: via Central Limit Theorem (later in the course)

---

# Bootstrapping

---

## Bootstrapping

- The term **bootstrapping** comes from the phrase "pulling oneself up by one’s 
bootstraps", which is a metaphor for accomplishing an impossible task without 
any outside help.
- In this case the impossible task is estimating a population parameter, and we’ll 
accomplish it using data from only the given sample.
- Note that this notion of saying something about a population parameter using 
only information from an observed sample is the crux of statistical inference, 
it is not limited to bootstrapping.

---

## Rent in Manhattan

---

## Sample

On a given day, twenty 1 BR apartments were randomly selected on Craigslist 
Manhattan from apartments listed as "by owner".

```r
library(tidyverse)
manhattan <- read_csv("data/manhattan.csv")
```

```r
manhattan %>% slice(1:10)
```

```
## # A tibble: 10 x 1
##     rent
##    <int>
##  1  3850
##  2  3800
##  3  2350
##  4  3200
##  5  2150
##  6  3267
##  7  2495
##  8  2349
##  9  3950
## 10  1795
```

---

## Parameter of interest

---

## Observed sample vs. bootstrap population

---

## Bootstrapping scheme

1. Take a bootstrap sample - a random sample taken with replacement from the 
original sample, of the same size as the original sample.
2. Calculate the bootstrap statistic - a statistic such as mean, median, 
proportion, slope, etc. computed on the bootstrap samples.
3. Repeat steps (1) and (2) many times to create a bootstrap distribution - 
a distribution of bootstrap statistics.
4. Calculate the bounds of the XX% confidence interval as the middle XX% 
of the bootstrap distribution.

---

## Let's bootstrap

![boot-by-hand](img/bootstrap-by-hand.png)

---

# Bootstrapping in R

---

## Two ways

1. Using `for` loops
2. Using the **infer** package

---

## Bootstrapping with `for` loops

```r
set.seed(11012018)

boot_df <- tibble(
  replicate = 1:15000,
  stat = rep(NA, 15000)
  )

for (i in 1:15000){
   boot_df$stat[i] <- manhattan %>% 
     sample_n(20, replace = TRUE) %>% 
     summarise(stat = median(rent)) %>% 
     pull()
}
```

---

## Bootstrap results

```r
ggplot(boot_df, aes(x = stat)) +
  geom_histogram(binwidth = 50)
```

![](u2_d07-bootstrapping_files/figure-html/unnamed-chunk-5-1.png)

```r
boot_df %>%
  summarise(
    lower = quantile(stat, 0.025),
    upper = quantile(stat, 0.975),
    )
```
]
]
.pull-right[

```
## # A tibble: 1 x 2
##   lower upper
##   <dbl> <dbl>
## 1 2162.  2875
```
]

---

## modelr `$\in$` tidyverse

.pull-left[
![](img/infer-part-of-tidymodels.png)
]
.pull-right[
The objective of `infer` is to perform statistical inference using an expressive statistical grammar that coheres with the `tidyverse` design framework.

```r
library(infer)
```
]

---

## Generate bootstrap medians

```r
manhattan %>%
  # specify the variable of interest
  specify(response = rent)
```

---

## Generate bootstrap medians

```r
manhattan %>%
  # specify the variable of interest
  specify(response = rent)
  # generate 15000 bootstrap samples
  generate(reps = 15000, type = "bootstrap")
```

---

## Generate bootstrap medians

```r
manhattan %>%
  # specify the variable of interest
  specify(response = rent)
  # generate 15000 bootstrap samples
  generate(reps = 15000, type = "bootstrap")
  # calculate the median of each bootstrap sample
  calculate(stat = "median")
```

---

## Generate bootstrap medians

```r
# save resulting bootstrap distribution
boot_df <- manhattan %>%
  # specify the variable of interest
  specify(response = rent) %>% 
  # generate 15000 bootstrap samples
  generate(reps = 15000, type = "bootstrap") %>% 
  # calculate the median of each bootstrap sample
  calculate(stat = "median")
```

---

## The bootstrap sample

```r
boot_df
```

```
## # A tibble: 15,000 x 2
##    replicate  stat
##        <int> <dbl>
##  1         1 2350 
##  2         2 2300 
##  3         3 2550 
##  4         4 2550 
##  5         5 2350 
##  6         6 2350 
##  7         7 2262.
##  8         8 2422.
##  9         9 2350.
## 10        10 2350 
## # ... with 14,990 more rows
```

---

## Visualize the bootstrap distribution

```r
ggplot(data = boot_df, mapping = aes(x = stat)) +
  geom_histogram(binwidth = 50) +
  labs(title = "Bootstrap distribution of medians")
```

![](u2_d07-bootstrapping_files/figure-html/unnamed-chunk-14-1.png)

---

## Calculate the confidence interval

A 95% confidence interval is bounded by the middle 95% of the bootstrap distribution.

```r
boot_df %>%
  summarize(lower = quantile(stat, 0.025),
            upper = quantile(stat, 0.975))
```

```
## # A tibble: 1 x 2
##   lower upper
##   <dbl> <dbl>
## 1 2162.  2875
```

---

## Visualize the confidence interval

![](u2_d07-bootstrapping_files/figure-html/unnamed-chunk-17-1.png)

---

## Interpret the confidence interval

.question[
The 95% confidence interval for the median rent of one bedroom apartments in 
Manhattan was calculated as (2162.5, 2875). Which of the 
following is the correct interpretation of this interval?
]

(a) 95% of the time the median rent one bedroom apartments in this sample is between $2162.5 and $2875.

(b) 95% of all one bedroom apartments in Manhattan have rents between $2162.5 and $2875.

(d) We are 95% confident that the median rent one bedroom apartments in this sample is between $2162.5 and $2875.

---

# Accuracy vs. precision

---

## Confidence level

**We are 95% confident that ...**

- Suppose we took many samples from the original population and built a 95% confidence interval based on each sample.
- Then about 95% of those intervals would contain the true population parameter.

---

## Commonly used confidence levels

Commonly used confidence levels in practice are 90%, 95%, and 99%

![](u2_d07-bootstrapping_files/figure-html/unnamed-chunk-18-1.png)

---

## Precision vs. accuracy

.question[
If we want to be very certain that we capture the population parameter, should 
we use a wider interval or a narrower interval? What drawbacks are associated 
with using a wider interval?
]

![garfield](img/garfield.png)

---

## Calculating confidence intervals at various confidence levels

.question[
How would you modify the following code to calculate a 90% confidence interval? 
How would you modify it for a 99% confidence interval?
]

```r
manhattan %>%
  specify(response = rent) %>% 
  generate(reps = 15000, type = "bootstrap") %>% 
  calculate(stat = "median") %>%
  summarize(lower = quantile(stat, 0.025),
            upper = quantile(stat, 0.975))
```

---

## Recap

- Sample statistic `$\ne$` population parameter, but if the sample is good, it can be a good estimate.
- We report that estimate with a confidence bound around it, and the width of this bound depends on how variable sample statistics from different samples from the population would be.
- Since we can't continue sampling from the population, we instead bootstrap from the one sample we have to estimate the sampling variability.
- We can do this for any sample statistic:
  - We did it for a median today, `calculate(stat = "median")`
  - Doing it for a mean would just take `calculate(stat = "mean")`
  - And you'll learn about calculating bootstrap intervals for other statistics 
  in your homework