Hypothesis Testing

class: center, middle, inverse, title-slide

# Hypothesis Testing
### Yue Jiang
### Duke University

---

### Statistical inference

.copper[Point estimation]: estimating an unknown parameter using a single number calculated from the sample
.copper[Interval estimation]: estimating an unknown parameter using a range of values that is likely to contain the true parameter
.copper[Hypothesis testing]: evaluating whether our observed sample data provides evidence against some populuation claim

---

### Why should we care about hypothesis testing?

---

### Emperor Antonius Pius

.eno[**Ei incumbit probatio qui dicit, non qui negat**]

---

### The hypothesis testing framework

1. Start with two hypotheses about the population: the .copper[null hypothesis] 
and the .copper[alternative hypothesis]
2. Choose a sample, collect data, and analyze the data
3. Figure out how likely it is to see data like what we got/observed, IF the null hypothesis were true
4. If our data would have been extremely unlikely if the null claim were true, then we reject it and deem the alternative claim worthy of further study. Otherwise, we cannot reject the null claim

---

### Ultra-low dose contraception

.pull-left[
<img src="yaz.png" width="100%" style="display: block; margin: auto;" />
]

.pull-right[
Oral contraceptive pills work well, but must have a precise dose of estrogen.

If a pill has too high a dose, then women may risk side effects such as headaches, nausea, and rare but potentially fatal blood clots.

If a pill has too low a dose, then women may get pregnant. 
]

---

### Ultra-low dose contraception

A certain contraceptive pill is supposed to contain precisely 0.020 `\(\mu\)`g of estrogen. During QC, 50 randomly selected pills are tested, with sample mean dose 0.017 `\(\mu\)`g and sample SD 0.008 `\(\mu\)`g.

.question[
Do you think this is cause for concern? Why or why not? (Don't worry about calculations for now)
]

---

### Two competing hypotheses

The null hypothesis, `\(H_0\)`, states that "nothing unusual is happening," 
or that there is no change from the status quo / there is no relationship / etc.

The alternative hypothesis, `\(H_A\)` or `\(H_1\)`, states the 
opposite: that there is some sort of relationship (usually this is what we want 
to check or really think is happening)

Remember, in statistical hypothesis testing we **always first assume the null hypothesis is true** 
and see whether we reject or fail to reject this claim
---

### Defining the null and alternative hypotheses

Stated in words:

- `\(H_0\)`: The pills are consistent with a population that has a mean of 0.020 `\(\mu\)`g estrogen
- `\(H_1\)`: The pills are not consistent with a population that has a mean of 0.020 `\(\mu\)`g estrogen

Stated in symbols:

- `\(H_0: \mu = 0.020\)`
- `\(H_1: \mu \neq 0.020\)`,

where `\(\mu\)` is the mean estrogen level of the manufactured pills, in `\(\mu\)`g

---

### Collecting and summarizing the data

With these two hypotheses, we now take a sample and summarize the data

The choice of .copper[summary statistic] calculated depends on the type of data as well as its distribution

In our example, quality control technicians randomly selected a sample of 50 pills and calculated the sample mean `\(\bar{x} = 0.017\)` `\(\mu\)`g and sample standard deviation `\(s = 0.008\)` `\(\mu\)`g

---

### Assessing the evidence observed

Next, we calculate the probability of getting data like ours, or more extreme, 
if `\(H_0\)` were actually true

This is a conditional probability: *if `\(H_0\)` were true* (i.e., if `\(\mu\)` were 
truly 0.020), what would be the probability of observing `\(\bar{x} = 0.017\)` and 
`\(s = 0.008\)`?

This probability is the .copper[p-value]

---

### Some philosophical details

The p-value we obtain relates to the test specific itself, so use of the same 
data can result in different p-values or confidence intervals depending on which 
test is used (for instance, what if we assumed that `\(\sigma^2\)` were known?)

Importantly, we have assumed from the start that the null hypothesis is true, 
and the p-value calculates conditioned on that event

p-values do **NOT** provide information on the probability that the null 
hypothesis is true given our observed data

---

### Making a conclusion

We reject the null hypothesis if the conditional probability of obtaining our 
test statistic, or more extreme, given it is true, is very small

What is very small? We often consider a cutpoint (the .copper[significance level] 
or `\(\alpha\)` level) defined prior to conducting the analysis

Many analyses use `\(\alpha = 0.05\)`: if `\(H_0\)` were in fact true, we would expect 
to make the wrong decision only 5\% of the time (why?)

If the p-value is less than `\(\alpha\)`, we say the results are 
.copper[statistically significant] and we .copper[reject the null hypothesis]. 
On the other hand, if the `\(p-value\)` is `\(\alpha\)` or greater, we say the results 
are not statistically significant and .copper[fail to reject] `\(H_0\)`.

---

### But wait...

What if `\(p \ge \alpha\)`? We **never** "accept" the null hypothesis -- we assumed 
that `\(H_0\)` was true to begin with and assessed the probability of obtaining our 
test statistic (or more extreme) under this assumption

When we fail to reject the null hypothesis, we are stating that there is 
*insufficient evidence* to assert that it is false

---

### Two-sided tests of hypotheses

To conduct the hypothesis test, we use what we learned about the sampling 
distribution of the sample mean `\(\bar{X}\)`. If the underlying population is 
normally distributed (or `\(n\)` is pretty large), then the random variable

`\begin{align*}
t = \frac{\bar{X} - \mu_0}{s / \sqrt{n}}
\end{align*}`
	
has a `\(t_{n-1}\)` distribution
---

### Breaking down the test statistic

`\begin{align*}
t = \frac{\bar{X} - \mu_0}{s/ \sqrt{n}}
\end{align*}`

- `\(\bar{X} - \mu_0\)` tells us how far our sample mean is from the hypothesized population mean
- Whether `\(\bar{X} - \mu_0\)` is big depends on the variance (and standard deviation): a difference of `\(\bar{X} - \mu_0 = 1\)` is a small difference if we are looking at weight in grams, but large for height in meters. This is why we standardize the difference by the estimated SD of the mean: `\(s/\sqrt{n}\)`

Thus, the test statistic `\(t\)` is an estimate of how many SDs apart `\(\mu_0\)` and `\(\bar{X}\)` are from each other

---

### Getting the p-value graphically

(see board)
	
---

### Back to the oral contraceptives

As it turns out, the probability of observing a sample mean of 0.017 and sample SD of 0.008 in 50 pills if `\(H_0\)` were actually true is approximately 0.01.

.question[
What might we conclude?
]

---

### What could go wrong?

Suppose we test the null hypothesis `\(H_0: \mu = \mu_0\)`. We could potentially make two types of errors:

| Truth | `\(\mu = \mu_0\)` | `\(\mu \neq \mu_0\)` |
| ----- | ------------- | ---------------- |
| Fail to reject `\(H_0\)` | .eno[Correct decision] | .copper[Type II Error] | 
| Reject `\(H_0\)` | .copper[Type I Error] | .eno[Correct decision] |
	
- .copper[Type I Error]: rejecting `\(H_0\)` when it is actually true (falsely rejecting the null hypothesis)
- .copper[Type II Error]: not rejecting `\(H_0\)` when it is false (falsely failing to reject the null hypothesis)
	
While we of course want to know if any one study is showing us something real or a Type I or Type II error, hypothesis testing does NOT give us the tools to determine this

---

### Type I vs. Type II errors

.pull-left[
<img src="oraquickstick.jpg" width="30%" style="display: block; margin: auto;" />
]

.pull-right[
<img src="pregnancy.jpg" width="30%" style="display: block; margin: auto;" />
]

---

### Different sets of hypothesses

We set up the hypotheses to cover *all* possibilities for `\(\mu\)` and consider three possibilities:

- Two-sided: `\(H_0: \mu = \mu_0\)`; `\(H_1: \mu \neq \mu_0\)`
- `\(H_0: \mu \ge \mu_0\)`; `\(H_1: \mu < \mu_0\)`
- `\(H_0: \mu \le \mu_0\)`; `\(H_1: \mu > \mu_0\)`

---

### Why not use a one-sided test?