The Hypothesis Testing Framework

class: center, middle, inverse, title-slide

# The Hypothesis Testing Framework
### Yue Jiang
### Duke University

---

## Review

.vocab[Population]: a group of individuals or objects we are interested in studying

.vocab[Parameter]: a numerical quantity derived from the population
(almost always unknown)

If we had data from every unit in the population, we could just calculate 
population parameters and be done!

**Unfortunately, we usually cannot do this.**

.vocab[Sample]: a subset of our population of interest

.vocab[Statistic]: a numerical quantity derived from a sample

If the sample is .vocab[representative], then we can use the tools of 
probability and statistical inference to make .vocab[generalizable] conclusions
to the broader population of interest.

---

## How can we answer research questions using statistics?

.question[
**Statistical hypothesis testing** is the procedure that assesses
evidence provided by the data in favor of or against some claim
about the population (often about a population parameter or
potential associations).
]

---

## Back to Asheville

.center[
![](img/10/asheville.jpg)

Your friend claims that the mean price per guest per night for Airbnbs in
Asheville is $100. **What do you make of this statement?**
]

---

## The hypothesis testing framework

1. Start with two hypotheses about the population: the null hypothesis and the 
alternative hypothesis.

2. Choose a (representative) sample, collect data, and analyze the data.

3. Figure out how likely it is to see data like what we observed, **IF** the 
null hypothesis were in fact true.

4. If our data would have been extremely unlikely if the null claim were true, 
then we reject it and deem the alternative claim worthy of further study. 
Otherwise, we cannot reject the null claim.

---

## Emperor Antonius Pius

.center[
<img src="img/11/pius.jpg" width="50%" />
]

> *Ei incumbit probatio qui dicit, non qui negat*

---

## Two competing hypotheses

The null hypothesis (often denoted `$H_0$`) states that "nothing unusual is 
happening" or "there is no relationship," etc.

On the other hand, the alternative hypothesis (often denoted `$H_1$` or `$H_A$`) 
states the opposite: that there is some sort of relationship (usually this is 
what we want to check or really think is happening).

.question[
In statistical hypothesis testing we always first assume that the null 
hypothesis is true and then see whether we reject or fail to reject this claim
]

---

## 1. Defining the hypotheses

The null and alternative hypotheses are defined for **parameters,** not 
statistics.

.question[
What will our null and alternative hypotheses be for this example?
]

- `$H_0$`: the true mean price per guest is $100 per night
- `$H_1$`: the true mean price per guest is NOT $100 per night

Expressed in symbols:

- `$H_0: \mu = 100$`
- `$H_1: \mu \neq 100$`

where `$\mu$` is the true population mean price per guest per night among Airbnb
listings in Asheville.

---

## 2. Collecting and summarizing data

With these two hypotheses, we now take our sample and summarize the data.

The choice of summary statistic calculated depends on the type of data. In our 
example, we use the sample mean: `$\bar{x} = 76.6$`:

```r
asheville <- read_csv("data/asheville.csv")

asheville %>% 
  summarize(mean_price = mean(ppg))
```

```
## # A tibble: 1 x 1
##   mean_price
##        <dbl>
## 1       76.6
```

---

## 3. Assessing the evidence observed

Next, we calculate the probability of getting data like ours, *or more extreme*, 
if `$H_0$` were in fact actually true.

This is a conditional probability: 
> Given that `$H_0$` is true (i.e., if `$\mu$` were *actually* 100), what would 
> be the probability of observing `$\bar{x} = 76.6$`?"

.question[
This probability is known as the **p-value**.
]

---

## 4. Making a conclusion

We reject the null hypothesis if this conditional probability is small enough.

If it is very unlikely to observe our data (or more extreme) if `$H_0$` were 
actually true, then that might give us enough evidence to suggest that it is
actually false (and that `$H_1$` is true).

What is "small enough"? 
- We often consider a numeric cutpoint (the .vocab[significance level]) defined 
*prior* to conducting the analysis.
- Many analyses use `$\alpha = 0.05$`. This means that if `$H_0$` were in fact true, 
we would expect to make the wrong decision only 5% of the time.

If the p-value is less than `$\alpha$`, we say the results are 
.vocab[statistically significant]. In this case, we would make the decision to 
.vocab[reject the null hypothesis].

---

## What do we conclude when `$p \ge \alpha$`?

If the p-value is `$\alpha$` or greater, we say the results are not statistically 
significant and we .vocab[fail to reject] `$H_0$`.

Importantly, we never "accept" the null hypothesis -- we performed the analysis
assuming that `$H_0$` was true to begin with and assessed the probability of 
seeing our observed data or more extreme under this assumption.

---

## Ok, so what **isn't** a p-value?

> *"A p-value of 0.05 means the null hypothesis has a probability of only 5% of* 
> *being true"*

> *"A p-value of 0.05 means there is a 95% chance or greater that the null*
> *hypothesis is incorrect"*

# <center><span style="color:red">NO</span></center>

p-values do **not** provide information on the probability that the null 
hypothesis is true given our observed data.

---

## Ok, so what **isn't** a p-value?

Again, a p-value is calculated *assuming* that `$H_0$` is true. It cannot be 
used to tell us how likely that assumption is correct. When we fail to reject 
the null hypothesis, we are stating that there is **insufficient evidence** to 
assert that it is false. This could be because...

- ... `$H_0$` actually *is* true!
- ... `$H_0$` is false, but we got unlucky and happened to get a sample that 
didn't give us enough reason to say that `$H_0$` was false

Even more bad news, hypothesis testing does NOT give us the tools to 
determine which one of the two scenarios occurred.

---

class: center, middle

# Conducting hypothesis tests

---

## Simulating the null distribution

Let's return to the Asheville data. We know that our sample mean was 76.6, but
we also know that if we were to take another random sample of size 50 from all
Airbnb listings, we might get a different sample mean.

There is some variability in the .vocab[sampling distribution] of the mean, and
we want to make sure we quantify this.

.question[
How might we quantify the sampling distribution of the mean using only the data
that we have from our original sample?
]

---

## Bootstrap distribution of the mean

```r
set.seed(12345)

n_sims <- 5000
boot_dist = numeric(n_sims)

for(i in 1:n_sims){ 
  set.seed(i) 
  indices <- sample(1:nrow(asheville), replace = T)
  
  boot_mean <- asheville %>% 
    slice(indices) %>%                    
    summarize(boot_mean = mean(ppg)) %>% 
    pull()                                
  
  boot_dist[i] <- boot_mean 
}

boot_means = tibble(boot_dist)

ggplot(data = boot_means, aes(x = boot_dist)) +
  geom_histogram(binwidth = 2, color = "darkblue", fill = "skyblue") + 
  labs(x = "Price per night", y = "Count") +
  geom_vline(xintercept = mean(boot_means$boot_dist), lwd = 2, color = "red")
```

---

## Bootstrap distribution of the mean

![](11-testing-framework_files/figure-html/unnamed-chunk-4-1.png)

---

## Shifting the distribution

We've captured the variability in the sample mean among samples of size 50 from
Asheville area Airbnbs, but remember that in the hypothesis testing paradigm,
we must assess our observed evidence under the assumption that the null 
hypothesis is true.

```r
boot_means %>% 
  summarize(mean(boot_dist))
```

```
## # A tibble: 1 x 1
##   `mean(boot_dist)`
##               <dbl>
## 1              76.6
```

Remember,
- `$H_0: \mu = 100$`
- `$H_0: \mu \neq 100$`

.question[
Where should the bootstrap distribution of means be centered if in fact `$H_0$` 
were actually true?
]

---

## Shifting the distribution

```r
mu_0 <- 100

offset <- boot_means %>% 
* summarize(mu_0 - mean(boot_dist)) %>%
  pull()

*boot_means <- boot_means %>%
* mutate(shifted_means = boot_dist + offset)
```

If we shifted the bootstrap distribution by `offset`, then it will be centered
at `$\mu_0$`: the null-hypothesized value for the mean.

```r
*ggplot(data = boot_means, aes(x = shifted_means)) +
  geom_histogram(binwidth = 2, color = "darkblue", fill = "skyblue") + 
  labs(x = "Price per night", y = "Count")
```

---

## Distribution of `$\bar{x}$` under `$H_0$`

![](11-testing-framework_files/figure-html/unnamed-chunk-8-1.png)

If `$H_0$` were true and we repeatedly sampled from the population, then this is 
what we might expect if we calculated `$\bar{x}$` from these samples.

.question[
How might we calculate the p-value?
]

---

## Calculating the p-value

![](11-testing-framework_files/figure-html/unnamed-chunk-9-1.png)

.question[
Why are there two vertical lines depicted?
]

---

## Calculating the p-value

```r
obs_mean <- asheville %>% 
  summarize(mean(ppg)) %>% 
  pull()

obs_diff <- mu_0 - obs_mean

boot_means %>% 
  mutate(extreme = ifelse(shifted_means <= mu_0 - obs_diff |
                          shifted_means >= mu_0 + obs_diff, 1, 0)) %>% 
  count(extreme) %>% 
  mutate(prob = n/sum(n))
```

```
## # A tibble: 2 x 3
##   extreme     n   prob
##     <dbl> <int>  <dbl>
## 1       0  4992 0.998 
## 2       1     8 0.0016
```

Supposing that the true mean price per guest were $100 a night, only 8 out of 
5,000 bootstrap sample means were as extreme or even more so than our originally
observed sample mean price per guest of $76.6.

.question[
What is the p-value? What might we conclude?
]

---

## Your turn!

[https://classroom.github.com/a/15lb0lW_](https://classroom.github.com/a/15lb0lW_)