Simulation-Based Inference I

# Simulation-Based Inference I
## Intro to Data Science
### Shawn Santo
### 03-26-20

---

## Video table of contents

1. Introduction and overview: [slides 3 - 11](https://warpwire.duke.edu/w/D4UDAA/)
 
2. Hypothesis testing steps: [slides 12 - 19](https://warpwire.duke.edu/w/E4UDAA/)
 
3. Using `infer` for hypothesis testing: [slides 20 - 38](https://warpwire.duke.edu/w/F4UDAA/)

---

## Announcements

- Lab 7 due Friday, Mar 27 at 11:59pm Eastern Standard Time

- Project introduced tomorrow

---

## Packages and seed

```r
library(tidyverse)
library(infer)
```

```r
set.seed(200325)
```

---

# The hypothesis testing framework

---

## Review of terminology

A **parameter** is a numerical quantity derived from the population; it
is the "true" value of interest. We typically estimate the 
parameter using a **sample statistic** as a point estimate.

Common population parameters of interest and their corresponding
sample statistic:

| Quantity           | Parameter  | Statistic   |
|--------------------|------------|-------------|
| Mean               | `$\mu$`      | `$\bar{x}$`   |
| Variance           | `$\sigma^2$` | `$s^2$`       |
| Standard deviation | `$\sigma$`   | `$s$`         |
| Proportion         | `$p$`        | `$\hat{p}$`   |

---

## How can we answer research questions using statistics?

.question[
**Statistical hypothesis testing** is the procedure that assesses
evidence provided by the data in favor of or against some claim
about the population (often about a population parameter or
potential associations).
]

---

## Motivating example: organ donors

People providing an organ for donation sometimes seek the help of a special 
medical consultant. These consultants assist the patient in all aspects of the 
surgery, with the goal of reducing the possibility of complications during the 
medical procedure and recovery. Patients might choose a consultant based in part 
on the historical complication rate of the consultant's clients.

One consultant tried to attract patients by noting that the average complication 
rate for liver donor surgeries in the US is about 10%, but her clients have only 
had 3 complications in the 62 liver donor surgeries she has facilitated. She 
claims this is strong evidence that her work meaningfully contributes to 
reducing complications (and therefore she should be hired!).

---

## Correlation vs. causation

**No.** The claim is that there is a causal connection, but the data are 
observational. For example, maybe patients who can afford a medical consultant 
can afford better medical care, which can also lead to a lower complication 
rate.

While it is not possible to assess the causal claim, it is still possible to 
test for an association using these data. For this question we ask, how likely
is it that the low complication rate observed of `$\hat{p}$` = 0.048 be due solely
to chance?

---

## The hypothesis testing framework

1. Start with two hypotheses about the population: the null hypothesis and the 
alternative hypothesis.

2. Choose a (representative) sample, collect data, and analyze the data.

3. Figure out how likely it is to see data like what we observed, IF the null 
hypothesis were in fact true.

4. If our data would have been extremely unlikely if the null claim were true, 
then we reject it and deem the alternative claim worthy of further study. 
Otherwise, we cannot reject the null claim.

---

## Emperor Antoninus Pius

<center>
Ei incumbit probatio qui dicit, non qui negat
</center>

---

## Two competing hypotheses

The null hypothesis (often denoted `$H_0$`) states that "nothing unusual is 
happening" or "there is no relationship," etc.

On the other hand, the alternative hypothesis (often denoted `$H_1$` or `$H_A$`) 
states the opposite: that there is some sort of relationship (usually this is 
what we want to check or really think is happening).

.question[
Remember, in statistical hypothesis testing we always first assume that the null 
hypothesis is true and then see whether we reject or fail to reject this claim
]

---

## 1. Defining the hypotheses

Remember, the null and alternative hypotheses are defined for **parameters,**
not statistics

What will our null and alternative hypotheses be for this example?

- `$H_0$`: the true proportion of complications among her patients is equal to the 
US population rate
- `$H_1$`: the true proportion of complications among her patients is lower than
the US population rate

Expressed in symbols:

- `$H_0: p = 0.10$`
- `$H_1: p < 0.10$`

where `$p$` is the true proportion of transplants with complications among her
patients.

---

## 2. Collecting and summarizing data

With these two hypotheses, we now take our sample and summarize the data.

The choice of summary statistic calculated depends on the type of data. In our 
example, we use the sample proportion, `$\hat{p} = 3/62 \approx 0.048$`.

---

## 3. Assessing the evidence observed

Next, we calculate the probability of getting data like ours, *or more extreme*, 
if `$H_0$` were in fact actually true.

This is a conditional probability: "given that `$H_0$` is true (i.e., if `$p = 0.1$`
), what would the the probability of observing `$\hat{p} = 3/62$`?"

We have assumed from the start that the null hypothesis is true, and the p-value 
calculated is conditioned on this event. Importantly, they do **not** provide 
information on the probability that the null hypothesis is true given our 
observed data.

We will calculate this p-value during the second half of today's lecture.

---

## 4. Making a conclusion

We reject the null hypothesis if this conditional probability is small enough: 
that it is very unlikely to observe our data or more extreme if `$H_0$` were 
actually true.

What is "small enough"? We often consider a cutpoint (the **significance level** 
or `$\alpha$` **level**) defined *prior* to conducting the analysis.

Many analyses use `$\alpha = 0.05$`. If `$H_0$` were in fact true, we would expect 
to make the wrong decision only 5% of the time (why?)

If the p-value is less than `$\alpha$`, we say the results are 
*statistically significant* and we *reject the null hypothesis*.

---

## What do we conclude when `$p-value > \alpha$`?

If the p-value is `$\alpha$` or greater, we say the results are not statistically 
significant and we *fail to reject* `$H_0$`.

Importantly, we never "accept" the null hypothesis -- we assumed that `$H_0$` was 
true to begin with and assessed the probability of obtaining our observed data 
or more extreme under this assumption.

When we fail to reject the null hypothesis, we are stating that there is 
**insufficient evidence** to assert that it is false. This could be due to any 
number of reasons:

- There truly is no effect
- There truly is an effect (and we happened to get unlucky with our sample or 
didn't have enough data to tell that there was one)

---

## What can go wrong?

Suppose we test a certain null hypothesis, which can be either true or false 
(we never know for sure!). We make one of two decisions given our data: either 
reject or fail to reject `$H_0$`.

We have the following four scenarios:

| Decision             | `$H_0$` is true | `$H_0$` is false |
|----------------------|---------------|----------------|
| Fail to reject `$H_0$` | Correct decision    | *Type II Error*  |
| Reject `$H_0$`         | *Type I Error*  | Correct decision     |

It is important to weigh the consequences of making each type of error.

In fact, `$\alpha$` is precisely the probability of making a Type I error. We will
talk about this (and the associated probability of making a Type II error) in
next week's class.

---

## Ok, so what **isn't** a p-value?

- "A p-value of 0.05 means the null hypothesis has a probability of only 5% of 
being true" **FALSE**

- "A p-value of 0.05 means there is a 95% chance or greater that the null 
hypothesis is incorrect" **FALSE**

Remember, a p-value is calculated *assuming* that `$H_0$` is true. It cannot be 
used to tell us how likely that assumption is correct.

Even more bad news, while we want to know if we are seeing "something real" or a 
Type I or Type II error, hypothesis testing does NOT give us the tools to 
determine this.

---

# Conducting hypothesis tests

---

## Simulating the null distribution

Let's return to the organ transplant scenario.

Since `$H_0: p = 0.10$`, we need to simulate a distribution for `$\hat{p}$` under 
the null hypothesis such that the probability of complication for each patient 
is 0.10 for 62 patients.

This null distribution for `$\hat{p}$` represents the distribution of the observed 
proportions we might expect, if the null hypothesis were true.

.question[
When sampling from the null distribution, what is the expected proportion of
complications? What would the expected count be of patients experiencing 
complications?
]

---

## Data

```r
organ_donor <- read_csv("data/organ-donor.csv")
glimpse(organ_donor)
```

```
#> Observations: 62
#> Variables: 1
#> $ outcome <chr> "complication", "complication", "complication", "no comp…
```

```r
organ_donor %>%
  count(outcome)
```

```
#> # A tibble: 2 x 2
#> outcome n
#> <chr> <int>
#> 1 complication 3
#> 2 no complication 59
```
---

## Simulation #1

```
#> sim1
#>    complication no complication 
#>               3              59
```

```
#> [1] 0.0483871
```

---

## Simulation #2

```
#> sim2
#>    complication no complication 
#>               9              53
```

```
#> [1] 0.1451613
```

---

## Simulation #3

```
#> sim3
#>    complication no complication 
#>               8              54
```

```
#> [1] 0.1290323
```

---

## This is tedious...

Can we automate this process?

---

## Using the `infer` package

We will generate the null distribution using `infer`, a member of the 
`tidyverse` collection of packages.

We'll walk through this example (one-sample test of proportion) to familiarize
ourselves with the basic syntax. Next class, we will show examples of other
types of hypothesis tests.

```r
null_dist <- organ_donor %>%
 specify(response = outcome, success = "complication") %>% 
 hypothesize(null = "point", 
 p = c("complication" = 0.10, "no complication" = 0.90) 
 ) %>% 
 generate(reps = 100, type = "simulate") %>% 
 calculate(stat = "prop")
```
]

---

## Specify

```r
null_dist <- organ_donor %>%
* specify(response = outcome, success = "complication") %>%
 hypothesize(null = "point", 
 p = c("complication" = 0.10, "no complication" = 0.90) 
 ) %>% 
 generate(reps = 100, type = "simulate") %>% 
 calculate(stat = "prop")
```
]

- `response`: **`outcome`** in the **`organ_donor`** data frame
- `success`: **`"complication"`**, the level of outcome we're interested in studying

---

## Hypothesize

```r
null_dist <- organ_donor %>%
 specify(response = outcome, success = "complication") %>% 
* hypothesize(null = "point",
* p = c("complication" = 0.10, "no complication" = 0.90)
* ) %>%
 generate(reps = 100, type = "simulate") %>% 
 calculate(stat = "prop")
```
]

- `null`: Since we're testing the point null hypothesis that `$H_0: p = 0.10$`, we
choose **`"point"`**
- Next, we provide the probability of success, 0.10, and the probability of 
failure, 0.90

---

## Generate

```r
null_dist <- organ_donor %>%
 specify(response = outcome, success = "complication") %>% 
 hypothesize(null = "point", 
 p = c("complication" = 0.10, "no complication" = 0.90) 
 ) %>% 
* generate(reps = 100, type = "simulate") %>%
 calculate(stat = "prop")
```
]

- `reps`: We will generate 100 repetitions here
- `type`: Choose **`"simulate"`** for testing a point null
    - Choose `bootstrap` for estimation (from last class)
    - Choose `permute` for testing independence (will discuss later)

---

## Calculate

- Calculate a sample statistic. Here, the sample proportion.

---

## Store simulated null distribution

```r
*null_dist <- organ_donor %>%
 specify(response = outcome, success = "complication") %>% 
 hypothesize(null = "point", 
 p = c("complication" = 0.10, "no complication" = 0.90) 
 ) %>% 
 generate(reps = 100, type = "simulate") %>% 
 calculate(stat = "prop") 
```
]

```
#> # A tibble: 100 x 2
#> replicate stat
#> <dbl> <dbl>
#> 1 1 0.048
#> 2 2 0.097
#> 3 3 0.081
#> 4 4 0.081
#> 5 5 0.161
#> 6 6 0.081
#> 7 7 0.081
#> 8 8 0.032
#> 9 9 0.113
#> 10 10 0.113
#> # … with 90 more rows
```

---

## Visualizing the null distribution

```r
ggplot(data = null_dist, mapping = aes(x = stat)) +
  geom_histogram(binwidth = 0.01) +
  labs(title = "Null distribution",
       x = "Sample proportion",
       y = "Count") +
  theme_bw(base_size = 16)
```

<img src="lec-12b-simulation-inference-1_files/figure-html/unnamed-chunk-15-1.png" style="display: block; margin: auto;" />
]

---

## Calculating the p-value, visually

<img src="lec-12b-simulation-inference-1_files/figure-html/unnamed-chunk-16-1.png" style="display: block; margin: auto;" />
]

---

## Calculating the p-value, directly

```r
null_dist %>%
 filter(stat <= (3/62)) %>%
 summarise(p_value = n()/nrow(null_dist))
```

```
#> # A tibble: 1 x 1
#> p_value
#> <dbl>
#> 1 0.15
```

.question[
Is this sufficient evidence for you to reject the null hypothesis in favor of
the alternative? What might you conclude?
]

At the `$\alpha = 0.05$` level, we have insufficient evidence to reject the null
hypothesis. There is insufficient evidence to suggest that the consultant's
complication rate is lower than the U.S. average of 10%.

---

## How many simulations to use?

It's always a good idea to use more simulations when generating the null
distribution. Let's try 5,000:

```r
null_dist <- organ_donor %>% 
 specify(response = outcome, success = "complication") %>% 
 hypothesize(null = "point", 
 p = c("complication" = 0.10, "no complication" = 0.90) 
 ) %>% 
 generate(reps = 5000, type = "simulate") %>% 
 calculate(stat = "prop")

null_dist %>% 
 filter(stat <= (3/62)) %>% 
 summarise(p_value = n()/nrow(null_dist))
```

```
#> # A tibble: 1 x 1
#> p_value
#> <dbl>
#> 1 0.124
```

As an aside, the exact probability of observing 3 or fewer complications out of
62 (given independent patients who all have complication rates of 10%) is 
approximately 0.121. You will learn how to calculate 
this p-value in further statistics courses!

---

## Application exercise

Today's application exercise examines the proportion of Duke students who are
left-handed. Go to the following link below to start today's application
exercise.

https://classroom.github.com/a/zHoibbM4

---

## References

1. Tidy Statistical Inference. (2020). Infer.netlify.com. Retrieved 29 February 
   2020, from https://infer.netlify.com/index.html