Estimation via Bootstrap

# Estimation via Bootstrap
### Yue Jiang
### 03.04.20

---

<div class="my-footer">

<a href="http://datasciencebox.org" target="_blank">datasciencebox.org</a>

</div>

---

## Announcements

- Homework 4 due Thursday, Mar 5 at 11:59pm

- Lab 6 due Friday, Mar 6 at 11:59pm

---

# Inference

---

## Terminology

**Population**: a group of individuals or objects we are interested in
studying

**Sample**: a representative (we assume) subset of our population of interest

**Parameter**: a numerical quantity derived from the population
(almost always unknown)

**Statistic**: a numerical quantity derived from a sample

Common population parameters of interest and their corresponding
sample statistic:

| Quantity           | Parameter  | Statistic   |
|--------------------|------------|-------------|
| Mean               | `$\mu$`      | `$\bar{x}$`   |
| Variance           | `$\sigma^2$` | `$s^2$`       |
| Standard deviation | `$\sigma$`   | `$s$`         |
| Proportion         | `$p$`        | `$\hat{p}$`   |

---

## What does inference mean?

- **Statistical inference** is the process of using sample data to make 
  conclusions about the underlying population the sample came from.

- Types of inference: **testing** and **estimation**.

- Today we will focus on estimation.

---

# Estimation

---

## Point estimate

A point estimate is a single value computed from the sample data to serve
as the "best guess", or estimate, for the population parameter.

**What is the downside to using point estimates?**

---

## Confidence intervals

A plausible range of values for the population parameter is an interval 
estimate. One type of interval estimate is known as a **confidence interval**.

- If we report a point estimate, we probably won’t hit the exact 
  population parameter.

- If we report a range of plausible values, we have a good shot at 
  capturing the parameter.

---

## Variability of sample statistics

- In order to construct a confidence interval we need to quantify the 
  variability of our sample statistic.

- For example, if we want to construct a confidence interval for a population 
  mean, we need to come up with a plausible range of values around our observed 
  sample mean.

- This range will depend on how precise and how accurate our sample mean is as 
  an estimate of the population mean.

- Quantifying this requires a measurement of how much we would expect the 
  sample mean to vary from sample to sample.
    
--

.question[
Suppose you randomly sample 50 students and 5 of them are left handed. If you 
were to take another random sample of 50 students, how many would you expect 
to be left handed? Would you be surprised if only 3 of them were left handed? 
Would you be surprised if 40 of them were left handed?
]

---

## Quantifying the variability of a sample statistic

We can quantify the variability of sample statistics using

1. **Simulation**: via bootstrapping, or "resampling" (today);

2. **Theory**: via the Central Limit Theorem (later in the course).

---

# Bootstrapping

---

## Bootstrapping

- The term **bootstrapping** comes from the phrase "pulling oneself up 
  by one’s bootstraps", to help oneself without the aid of others.

- In this case, we are estimating a population parameter, and we’ll 
  accomplish it using data from **only from the given sample**.

- This notion of saying something about a population parameter using 
  only information from an observed sample is the crux of statistical inference, 
  it is not limited to bootstrapping.

---

## Rent in Manhattan

How much do you think it costs to rent a typical 1 bedroom apartment in 
Manhattan?

---

## Sample data

Consider 20 1 bedroom apartments that were randomly selected on 
Craigslist Manhattan from apartments listed as "by owner".

```r
library(tidyverse)
manhattan <- read_csv("data/manhattan.csv")
```

```r
manhattan %>% slice(1:10)
```

```
#> # A tibble: 10 x 1
#> rent
#> <dbl>
#> 1 3850
#> 2 3800
#> 3 2350
#> 4 3200
#> 5 2150
#> 6 3267
#> 7 2495
#> 8 2349
#> 9 3950
#> 10 1795
```
]

```r
manhattan %>% slice(11:20)
```

```
#> # A tibble: 10 x 1
#> rent
#> <dbl>
#> 1 2145
#> 2 2300
#> 3 1775
#> 4 2000
#> 5 2175
#> 6 2350
#> 7 2550
#> 8 4195
#> 9 1470
#> 10 2350
```
]
]

---

## Parameter of interest

.tiny[
<img src="13-bootstrap_files/figure-html/unnamed-chunk-4-1.png" style="display: block; margin: auto;" />
]

**Is the mean or the median a better measure of typical rent in Manhattan?**

---

## Observed sample vs. population

```r
manhattan %>% 
  pull(rent) %>% 
  median()
```

```
#> [1] 2350
```

]

We don't have this data!

]

---

## Bootstrapping scheme

1. **Take a bootstrap sample** - a random sample taken with replacement 
   from the original sample, of the same size as the original sample.

2. **Calculate the bootstrap statistic** - a statistic such as mean, median,
   proportion, slope, etc. computed on the bootstrap samples.

3. **Repeat steps (1) and (2) many times to create a bootstrap distribution** - 
   a distribution of bootstrap statistics.

4. **Calculate the bounds of the XX% confidence interval** as the middle XX% 
   of the bootstrap distribution.

---

# Bootstrapping in R

---

## Package `infer`

.pull-right[
 
The objective of package `infer` is to perform statistical inference using an 
expressive statistical grammar that coheres with the tidyverse design framework.
]

---

## Package `infer`

![ht-diagram](img/13/ht-diagram.png)

```r
library(infer)
```

Also, let's set a seed:

```r
set.seed(03052020)
```

Function `set.seed()` is a base R function that allows us to control R's
random number generation. Use this to make your simulation work reproducible.

---

## Generate bootstrap medians

1. `specify()` the variable of interest.

```r
boot_dist <- manhattan %>%
* specify(response = rent)
```

---

## Generate bootstrap medians

1. `specify()` the variable of interest.
 
2. `generate()` a fixed number of bootstrap samples.

```r
boot_dist <- manhattan %>%
 specify(response = rent) %>% 
* generate(reps = 15000, type = "bootstrap")
```

---

## Generate bootstrap medians

1. `specify()` the variable of interest.
 
2. `generate()` a fixed number of bootstrap samples.
 
3. `calculate()` the bootstrapped statistic(s).

```r
boot_dist <- manhattan %>%
 specify(response = rent) %>% 
 generate(reps = 15000, type = "bootstrap") %>% 
 calculate(stat = "median")
```

---

## Sample medians

How many observations are there in `boot_dist`? What does each observation 
represent?

```r
boot_dist
```

```
#> # A tibble: 15,000 x 2
#> replicate stat
#> <int> <dbl>
#> 1 1 2350 
#> 2 2 2238.
#> 3 3 2262.
#> 4 4 2350 
#> 5 5 2450 
#> 6 6 2324.
#> 7 7 2350.
#> 8 8 2775 
#> 9 9 2350 
#> 10 10 2450 
#> # ... with 14,990 more rows
```

---

## Visualize the distribution

---

## Calculate the confidence interval

A 95% confidence interval is bounded by the middle 95% of the bootstrap 
distribution.

Use `dplyr` functions:

```r
boot_dist %>%
  summarize(lower_bound = quantile(stat, 0.025),
            upper_bound = quantile(stat, 0.975))
```

```
#> # A tibble: 1 x 2
#> lower_bound upper_bound
#> <dbl> <dbl>
#> 1 2162. 2875
```

Use `get_ci()` from `infer`:

```r
percentile_ci <- get_ci(boot_dist, level = .95)
percentile_ci
```

```
#> # A tibble: 1 x 2
#> `2.5%` `97.5%`
#> <dbl> <dbl>
#> 1 2162. 2875
```

---

## Visualize a confidence interval

Using `geom_vline()`.

---

## Interpret the confidence interval

The 95% confidence interval for the median rent of one bedroom apartments in Manhattan was calculated as (2162.5, 2875). Which of the following is the correct interpretation of this interval?

<ol type="a">
 <li>95% of the time the median rent one bedroom apartments in this sample is 
 between $2162.5 and $2875.</li>
 
 <li>95% of all one bedroom apartments in Manhattan have rents between 
 $2162.5 and $2875.</li>
 
 <li>We are 95% confident that the median rent of all one bedroom apartments 
 is between $2162.5 and $2875.</li>
 
 <li>We are 95% confident that the median rent of one bedroom apartments in 
 this sample is between $2162.5 and $2875.</li>
</ol>

---

## Caution

We **cannot** say

"There is a 95% *chance* that the median lies in the 
confidence interval."

---

# Accuracy vs. precision

---

## So what IS a confidence interval?

**We are 95% confident that ...**

- Suppose we took many samples from the original population and built a 95%
  confidence interval based on each sample.

- We would expect about 95% of those intervals would contain the true population 
parameter.

- However, we never know if in fact 95% of them do, or whether any particular
interval contains the true parameter (maybe none of them do!)

---

## Common confidence levels

Commonly used confidence levels in practice are 90%, 95%, and 99%. Which lines
represent which of the aforementioned levels?

```r
l90 <- boot_dist %>% 
 summarize(lower_bound = quantile(stat, 0.05)) %>% 
 round(2) %>% 
 pull()
u90 <- boot_dist %>% 
 summarize(lower_bound = quantile(stat, 0.95)) %>% 
 round(2) %>% 
 pull()

l99 <- boot_dist %>% 
 summarize(lower_bound = quantile(stat, 0.005)) %>% 
 round(2) %>% 
 pull()
u99 <- boot_dist %>% 
 summarize(lower_bound = quantile(stat, 0.995)) %>% 
 round(2) %>% 
 pull()

ggplot(data = boot_dist, mapping = aes(x = stat)) +
  geom_histogram(binwidth = 50, alpha = .5) +
  geom_vline(xintercept = c(lower_bound, upper_bound), 
             color = "darkgreen", lty = 2, size = 1) +
  geom_vline(xintercept = c(l90, u90), color = "purple", lty = 3, size = 1) +
  geom_vline(xintercept = c(l99, u99), color = "red", lty = 6, size = 1) +
  labs(title = "Bootstrap distribution of medians is skewed right",
       x = "Median", y = "Count") +
  theme_minimal(base_size = 16)
```

---

## Precision vs. accuracy

If we want to be very certain that we capture the population parameter, should 
we use a wider interval or a narrower interval? What drawbacks are associated 
with using a wider interval?

How can we get best of both worlds -- high precision and high accuracy?

---

## Calculating confidence intervals at various confidence levels

How would you modify the following code to calculate a 90% confidence interval? 
How would you modify it for a 99% confidence interval?

```r
manhattan %>%
  specify(response = rent) %>% 
  generate(reps = 15000, type = "bootstrap") %>% 
  calculate(stat = "median") %>%
  summarize(lower_bound = quantile(stat, 0.025),
            upper_bound = quantile(stat, 0.975))
```

What would you modify to compute a confidence interval for the population
mean?

---

## Recap

- Sample statistic `$\ne$` population parameter, but if it is a random sample, it 
  can be a good estimate.

- We report that estimate with a confidence band around it, and the relative
  width of this band depends on the sample statistic's variability.

- Since we can't continue sampling from the population, we instead bootstrap 
  from the one sample we have to estimate the sampling variability.
  
---

## Your turn

Go to [https://classroom.github.com/a/eULa07lW](https://classroom.github.com/a/eULa07lW) to try it yourself!

---

## References

1. Tidy Statistical Inference. (2020). Infer.netlify.com. Retrieved 29 February 
   2020, from https://infer.netlify.com/index.html