Estimation via bootstrapping

# Estimation via bootstrapping
## Intro to Data Science
### Shawn Santo
### 03-05-20

---

## Announcements

- Homework 3 due Thursday, Mar 5 at 11:59pm

- Lab 6 due Friday, Mar 6 at 11:59pm

---

# Inference

---

## Terminology

**Population**: a group of individuals or objects we are interested in
studying

**Sample**: a representative (we assume) subset of our population of interest

**Parameter**: a numerical quantity derived from the population
(almost always unknown)

**Statistic**: a numerical quantity derived from a sample

Common population parameters of interest and their corresponding
sample statistic:

| Quantity           | Parameter  | Statistic   |
|--------------------|------------|-------------|
| Mean               | `$\mu$`      | `$\bar{x}$`   |
| Variance           | `$\sigma^2$` | `$s^2$`       |
| Standard deviation | `$\sigma$`   | `$s$`         |
| Median             | `$M$`        | `$\tilde{x}$` |
| Proportion         | `$p$`        | `$\hat{p}$`   |

---

## What does inference mean?

- **Statistical inference** is the process of using sample data to make 
  conclusions about the underlying population the sample came from.

- Types of inference: **testing** and **estimation**.

- Today we will focus on estimation.

---

---

# Estimation

---

## Point estimate

A point estimate is a single value computed from the sample data to serve
as the "best guess", or estimate, for the population parameter.

**What is the downside to using point estimates?**

---

## Confidence intervals

A plausible range of values for the population parameter is an interval 
estimate. One type of interval estimate is known as a **confidence interval**.

- If we report a point estimate, we probably won’t hit the exact 
  population parameter.

- If we report a range of plausible values, we have a good shot at 
  capturing the parameter.

---

## Variability of sample statistics

- In order to construct a confidence interval we need to quantify the 
  variability of our sample statistic.

- For example, if we want to construct a confidence interval for a population 
  mean, we need to come up with a plausible range of values around our observed 
  sample mean.

- This range will depend on how precise and how accurate our sample mean is as 
  an estimate of the population mean.

- Quantifying this requires a measurement of how much we would expect the 
  sample mean to vary from sample to sample.
    
--

Suppose you randomly sample 50 students and 5 of them are left handed. If you 
were to take another random sample of 50 students, how many would you expect 
to be left handed? Would you be surprised if only 3 of them were left handed? 
Would you be surprised if 40 of them were left handed?

---

## Quantifying the variability of a sample statistic

We can quantify the variability of sample statistics using

1. **simulation**: via bootstrapping (today);

2. **theory**: via Central Limit Theorem (later in the course).

---

# Bootstrapping

---

## Bootstrapping

- The term **bootstrapping** comes from the phrase "pulling oneself up 
  by one’s bootstraps", to help oneself without the aid of others.

- In this case, we are estimating a population parameter, and we’ll 
  accomplish it using data from only from the given sample.

- This notion of saying something about a population parameter using 
  only information from an observed sample is the crux of statistical inference, 
  it is not limited to bootstrapping.

---

## Rent in Manhattan

How much do you think it costs to rent a typical 1 bedroom apartment in 
Manhattan?

---

## Sample data

Consider 20 1 bedroom apartments that were randomly selected on 
Craigslist Manhattan from apartments listed as "by owner".

```r
library(tidyverse)
manhattan <- read_csv("data/manhattan.csv")
```

```r
manhattan %>% slice(1:10)
```

```
#> # A tibble: 10 x 1
#> rent
#> <dbl>
#> 1 3850
#> 2 3800
#> 3 2350
#> 4 3200
#> 5 2150
#> 6 3267
#> 7 2495
#> 8 2349
#> 9 3950
#> 10 1795
```
]

```r
manhattan %>% slice(11:20)
```

```
#> # A tibble: 10 x 1
#> rent
#> <dbl>
#> 1 2145
#> 2 2300
#> 3 1775
#> 4 2000
#> 5 2175
#> 6 2350
#> 7 2550
#> 8 4195
#> 9 1470
#> 10 2350
```
]
]

---

## Parameter of interest

.tiny[
<img src="lec-09b-bootstrap_files/figure-html/unnamed-chunk-4-1.png" style="display: block; margin: auto;" />
]

**Is the mean or the median a better measure of typical rent in Manhattan?**

???

```r
ggplot(data = manhattan, mapping = aes(x = rent)) +
  geom_histogram(binwidth = 250, fill = "purple") +
  labs(title = "There is high variability in Manhattan rent", 
       subtitle = "1 bedroom apartments", x = "Rent", y = "Count") +
  theme_minimal(base_size = 16)
```

---

## Observed sample vs. population

```r
manhattan %>% 
  pull(rent) %>% 
  median()
```

```
#> [1] 2350
```

]

We don't have this data!

]

---

## Bootstrapping scheme

1. **Take a bootstrap sample** - a random sample taken with replacement 
   from the original sample, of the same size as the original sample.

2. **Calculate the bootstrap statistic** - a statistic such as mean, median,
   proportion, slope, etc. computed on the bootstrap samples.

3. **Repeat steps (1) and (2) many times to create a bootstrap distribution** - 
   a distribution of bootstrap statistics.

4. **Calculate the bounds of the XX% confidence interval** as the middle XX% 
   of the bootstrap distribution.

---

# Bootstrapping in R

---

## Package `infer`

.pull-right[
 
The objective of package `infer` is to perform statistical inference using an 
expressive statistical grammar that coheres with the tidyverse design framework.
]

---

## Package `infer`

![ht-diagram](img/09a/ht-diagram.png)

```r
library(infer)
```

Also, let's set a seed:

```r
set.seed(03052020)
```

Function `set.seed()` is a base R function that allows us to control R's
random number generation. Use this to make your simulation work reproducible.

---

## Generate bootstrap medians

1. `specify()` the variable of interest.

```r
boot_dist <- manhattan %>%
* specify(response = rent)
```

---

## Generate bootstrap medians

1. `specify()` the variable of interest.
 
2. `generate()` a fixed number of bootstrap samples.

```r
boot_dist <- manhattan %>%
 specify(response = rent) %>% 
* generate(reps = 15000, type = "bootstrap")
```

---

## Generate bootstrap medians

1. `specify()` the variable of interest.
 
2. `generate()` a fixed number of bootstrap samples.
 
3. `calculate()` the bootstrapped statistic(s).

```r
boot_dist <- manhattan %>%
 specify(response = rent) %>% 
 generate(reps = 15000, type = "bootstrap") %>% 
* calculate(stat = "median")
```

---

## Sample medians

How many observations are there in `boot_dist`? What does each observation 
represent?

```r
boot_dist
```

```
#> # A tibble: 15,000 x 2
#> replicate stat
#> <int> <dbl>
#> 1 1 2350 
#> 2 2 2238.
#> 3 3 2262.
#> 4 4 2350 
#> 5 5 2450 
#> 6 6 2324.
#> 7 7 2350.
#> 8 8 2775 
#> 9 9 2350 
#> 10 10 2450 
#> # … with 14,990 more rows
```

---

## Visualize the distribution

???

```r
ggplot(data = boot_dist, mapping = aes(x = stat)) +
  geom_histogram(binwidth = 100) +
  labs(title = "Bootstrap distribution of medians is skewed right",
       x = "Median", y = "Count", caption = "Binwidth of 100") +
  theme_minimal(base_size = 16)
```

---

## Calculate the confidence interval

A 95% confidence interval is bounded by the middle 95% of the bootstrap 
distribution.

Use `dplyr` functions:

```r
boot_dist %>%
  summarize(lower_bound = quantile(stat, 0.025),
            upper_bound = quantile(stat, 0.975))
```

```
#> # A tibble: 1 x 2
#> lower_bound upper_bound
#> <dbl> <dbl>
#> 1 2162. 2875
```

Use `get_ci()` from `infer`:

```r
percentile_ci <- get_ci(boot_dist, level = .95)
percentile_ci
```

```
#> # A tibble: 1 x 2
#> `2.5%` `97.5%`
#> <dbl> <dbl>
#> 1 2162. 2875
```

---

## Visualize a confidence interval

Using `geom_vline()`.

???

```r
lower_bound <- boot_dist %>% 
 summarize(lower_bound = quantile(stat, 0.025)) %>% 
 pull()

upper_bound <- boot_dist %>% 
 summarize(upper_bound = quantile(stat, 0.975)) %>% 
 pull()
```

```r
ggplot(data = boot_dist, mapping = aes(x = stat)) +
  geom_histogram(binwidth = 50, alpha = .5) +
  geom_vline(xintercept = c(lower_bound, upper_bound), 
             color = "darkgreen", lty = 2, size = 1) +
  labs(title = "Bootstrap distribution of medians is skewed right",
       x = "Median", y = "Count", 
       caption = "Green lines represent 95% C.I. bounds") +
  theme_minimal(base_size = 16)
```

---

## Visualize a confidence interval

Using `visualize()` and `shade_confidence_interval()`.

```r
visualize(boot_dist) +
  shade_confidence_interval(endpoints = percentile_ci) +
  labs(title = "Bootstrap distribution of medians is skewed right",
       x = "Median", y = "Count", 
       caption = "Green lines represent 95% C.I. bounds") +
  theme_minimal(base_size = 16)
```

<img src="lec-09b-bootstrap_files/figure-html/unnamed-chunk-21-1.png" style="display: block; margin: auto;" />
]

---

## Interpret the confidence interval

The 95% confidence interval for the median rent of one bedroom apartments in Manhattan was calculated as (2162.5, 2875). Which of the following is the correct interpretation of this interval?

<ol type="a">
 <li>95% of the time the median rent one bedroom apartments in this sample is 
 between $2162.5 and $2875.</li>
 
 <li>95% of all one bedroom apartments in Manhattan have rents between 
 $2162.5 and $2875.</li>
 
 <li>We are 95% confident that the median rent of all one bedroom apartments 
 is between $2162.5 and $2875.</li>
 
 <li>We are 95% confident that the median rent of one bedroom apartments in 
 this sample is between $2162.5 and $2875.</li>
</ol>

**C is correct.**

---

## Caution!

We **cannot** say "there is a 95% *chance* that the median lies in the
confidence interval".

**Why?**

---

# Accuracy vs. precision

---

## Confidence level

**We are 95% confident that ...**

- Suppose we took many samples from the original population and built a 95%
  confidence interval based on each sample.

- About 95% of those intervals would contain the true population parameter.

---

## Common confidence levels

Commonly used confidence levels in practice are 90%, 95%, and 99%. Which lines
represent which of the aforementioned levels?

???

```r
l90 <- boot_dist %>% 
 summarize(lower_bound = quantile(stat, 0.05)) %>% 
 round(2) %>% 
 pull()
u90 <- boot_dist %>% 
 summarize(lower_bound = quantile(stat, 0.95)) %>% 
 round(2) %>% 
 pull()

l99 <- boot_dist %>% 
 summarize(lower_bound = quantile(stat, 0.005)) %>% 
 round(2) %>% 
 pull()
u99 <- boot_dist %>% 
 summarize(lower_bound = quantile(stat, 0.995)) %>% 
 round(2) %>% 
 pull()

ggplot(data = boot_dist, mapping = aes(x = stat)) +
  geom_histogram(binwidth = 50, alpha = .5) +
  geom_vline(xintercept = c(lower_bound, upper_bound), 
             color = "darkgreen", lty = 2, size = 1) +
  geom_vline(xintercept = c(l90, u90), color = "purple", lty = 3, size = 1) +
  geom_vline(xintercept = c(l99, u99), color = "red", lty = 6, size = 1) +
  labs(title = "Bootstrap distribution of medians is skewed right",
       x = "Median", y = "Count") +
  theme_minimal(base_size = 16)
```

---

## Precision vs. accuracy

If we want to be very certain that we capture the population parameter, should 
we use a wider interval or a narrower interval? What drawbacks are associated 
with using a wider interval?

How can we get best of both worlds -- high precision and high accuracy?

---

## Calculating confidence intervals at various confidence levels

How would you modify the following code to calculate a 90% confidence interval? 
How would you modify it for a 99% confidence interval?

```r
manhattan %>%
  specify(response = rent) %>% 
  generate(reps = 15000, type = "bootstrap") %>% 
  calculate(stat = "median") %>%
  summarize(lower_bound = quantile(stat, 0.025),
            upper_bound = quantile(stat, 0.975))
```

What would you modify to compute a confidence interval for the population
mean?

---

## Recap

- Sample statistic `$\ne$` population parameter, but if it is a random sample, it 
  can be a good estimate.

- We report that estimate with a confidence band around it, and the relative
  width of this band depends on the sample statistic's variability.

- Since we can't continue sampling from the population, we instead bootstrap 
  from the one sample we have to estimate the sampling variability.

---

## Application exercise

https://classroom.github.com/a/PKqJ_2S6

1. Create a 98% bootstrap confidence interval for the population median price of
   1 bedroom apartments in Manhattan.

2. Create a 98% bootstrap confidence interval for the population mean price of
   1 bedroom apartments in Manhattan. Write out an appropriate interpretation of
   this interval.

3. Compare the two confidence intervals. Which interval is wider?

4. Can we use these intervals to make an accurate statement about the mean/median 
   price of 1 bedroom apartments in the Bronx?

---

## References

1. Tidy Statistical Inference. (2020). Infer.netlify.com. Retrieved 29 February 
   2020, from https://infer.netlify.com/index.html