class: center, middle, inverse, title-slide # The Hypothesis Testing Framework ### Yue Jiang ### Duke University --- ## Review .vocab[Population]: a group of individuals or objects we are interested in studying .vocab[Parameter]: a numerical quantity derived from the population (almost always unknown) If we had data from every unit in the population, we could just calculate population parameters and be done! **Unfortunately, we usually cannot do this.** .vocab[Sample]: a subset of our population of interest .vocab[Statistic]: a numerical quantity derived from a sample If the sample is .vocab[representative], then we can use the tools of probability and statistical inference to make .vocab[generalizable] conclusions to the broader population of interest. --- ## How can we answer research questions using statistics? .question[ **Statistical hypothesis testing** is the procedure that assesses evidence provided by the data in favor of or against some claim about the population (often about a population parameter or potential associations). ] --- ## Back to Asheville .center[  Your friend claims that the mean price per guest per night for Airbnbs in Asheville is $100. **What do you make of this statement?** ] --- ## The hypothesis testing framework 1. Start with two hypotheses about the population: the null hypothesis and the alternative hypothesis. 2. Choose a (representative) sample, collect data, and analyze the data. 3. Figure out how likely it is to see data like what we observed, **IF** the null hypothesis were in fact true. 4. If our data would have been extremely unlikely if the null claim were true, then we reject it and deem the alternative claim worthy of further study. Otherwise, we cannot reject the null claim. --- ## Emperor Antonius Pius .center[ <img src="img/11/pius.jpg" width="50%" /> ] > *Ei incumbit probatio qui dicit, non qui negat* --- ## Two competing hypotheses The null hypothesis (often denoted `\(H_0\)`) states that "nothing unusual is happening" or "there is no relationship," etc. On the other hand, the alternative hypothesis (often denoted `\(H_1\)` or `\(H_A\)`) states the opposite: that there is some sort of relationship (usually this is what we want to check or really think is happening). .question[ In statistical hypothesis testing we always first assume that the null hypothesis is true and then see whether we reject or fail to reject this claim ] --- ## 1. Defining the hypotheses The null and alternative hypotheses are defined for **parameters,** not statistics. .question[ What will our null and alternative hypotheses be for this example? ] -- - `\(H_0\)`: the true mean price per guest is $100 per night - `\(H_1\)`: the true mean price per guest is NOT $100 per night Expressed in symbols: - `\(H_0: \mu = 100\)` - `\(H_1: \mu \neq 100\)` where `\(\mu\)` is the true population mean price per guest per night among Airbnb listings in Asheville. --- ## 2. Collecting and summarizing data With these two hypotheses, we now take our sample and summarize the data. The choice of summary statistic calculated depends on the type of data. In our example, we use the sample mean: `\(\bar{x} = 76.6\)`: ```r asheville <- read_csv("data/asheville.csv") asheville %>% summarize(mean_price = mean(ppg)) ``` ``` ## # A tibble: 1 x 1 ## mean_price ## <dbl> ## 1 76.6 ``` --- ## 3. Assessing the evidence observed Next, we calculate the probability of getting data like ours, *or more extreme*, if `\(H_0\)` were in fact actually true. This is a conditional probability: > Given that `\(H_0\)` is true (i.e., if `\(\mu\)` were *actually* 100), what would > be the probability of observing `\(\bar{x} = 76.6\)`?" .question[ This probability is known as the **p-value**. ] --- ## 4. Making a conclusion We reject the null hypothesis if this conditional probability is small enough. If it is very unlikely to observe our data (or more extreme) if `\(H_0\)` were actually true, then that might give us enough evidence to suggest that it is actually false (and that `\(H_1\)` is true). -- What is "small enough"? - We often consider a numeric cutpoint (the .vocab[significance level]) defined *prior* to conducting the analysis. - Many analyses use `\(\alpha = 0.05\)`. This means that if `\(H_0\)` were in fact true, we would expect to make the wrong decision only 5% of the time. If the p-value is less than `\(\alpha\)`, we say the results are .vocab[statistically significant]. In this case, we would make the decision to .vocab[reject the null hypothesis]. --- ## What do we conclude when `\(p \ge \alpha\)`? If the p-value is `\(\alpha\)` or greater, we say the results are not statistically significant and we .vocab[fail to reject] `\(H_0\)`. Importantly, we never "accept" the null hypothesis -- we performed the analysis assuming that `\(H_0\)` was true to begin with and assessed the probability of seeing our observed data or more extreme under this assumption. --- ## Ok, so what **isn't** a p-value? > *"A p-value of 0.05 means the null hypothesis has a probability of only 5% of* > *being true"* > *"A p-value of 0.05 means there is a 95% chance or greater that the null* > *hypothesis is incorrect"* -- # <center><span style="color:red">NO</span></center> p-values do **not** provide information on the probability that the null hypothesis is true given our observed data. --- ## Ok, so what **isn't** a p-value? Again, a p-value is calculated *assuming* that `\(H_0\)` is true. It cannot be used to tell us how likely that assumption is correct. When we fail to reject the null hypothesis, we are stating that there is **insufficient evidence** to assert that it is false. This could be because... - ... `\(H_0\)` actually *is* true! - ... `\(H_0\)` is false, but we got unlucky and happened to get a sample that didn't give us enough reason to say that `\(H_0\)` was false Even more bad news, hypothesis testing does NOT give us the tools to determine which one of the two scenarios occurred. --- class: center, middle # Conducting hypothesis tests --- ## Simulating the null distribution Let's return to the Asheville data. We know that our sample mean was 76.6, but we also know that if we were to take another random sample of size 50 from all Airbnb listings, we might get a different sample mean. There is some variability in the .vocab[sampling distribution] of the mean, and we want to make sure we quantify this. .question[ How might we quantify the sampling distribution of the mean using only the data that we have from our original sample? ] --- ## Bootstrap distribution of the mean ```r set.seed(12345) n_sims <- 5000 boot_dist = numeric(n_sims) for(i in 1:n_sims){ set.seed(i) indices <- sample(1:nrow(asheville), replace = T) boot_mean <- asheville %>% slice(indices) %>% summarize(boot_mean = mean(ppg)) %>% pull() boot_dist[i] <- boot_mean } boot_means = tibble(boot_dist) ggplot(data = boot_means, aes(x = boot_dist)) + geom_histogram(binwidth = 2, color = "darkblue", fill = "skyblue") + labs(x = "Price per night", y = "Count") + geom_vline(xintercept = mean(boot_means$boot_dist), lwd = 2, color = "red") ``` --- ## Bootstrap distribution of the mean <!-- --> --- ## Shifting the distribution We've captured the variability in the sample mean among samples of size 50 from Asheville area Airbnbs, but remember that in the hypothesis testing paradigm, we must assess our observed evidence under the assumption that the null hypothesis is true. ```r boot_means %>% summarize(mean(boot_dist)) ``` ``` ## # A tibble: 1 x 1 ## `mean(boot_dist)` ## <dbl> ## 1 76.6 ``` Remember, - `\(H_0: \mu = 100\)` - `\(H_0: \mu \neq 100\)` .question[ Where should the bootstrap distribution of means be centered if in fact `\(H_0\)` were actually true? ] --- ## Shifting the distribution ```r mu_0 <- 100 offset <- boot_means %>% * summarize(mu_0 - mean(boot_dist)) %>% pull() *boot_means <- boot_means %>% * mutate(shifted_means = boot_dist + offset) ``` If we shifted the bootstrap distribution by `offset`, then it will be centered at `\(\mu_0\)`: the null-hypothesized value for the mean. ```r *ggplot(data = boot_means, aes(x = shifted_means)) + geom_histogram(binwidth = 2, color = "darkblue", fill = "skyblue") + labs(x = "Price per night", y = "Count") ``` --- ## Distribution of `\(\bar{x}\)` under `\(H_0\)` <!-- --> If `\(H_0\)` were true and we repeatedly sampled from the population, then this is what we might expect if we calculated `\(\bar{x}\)` from these samples. .question[ How might we calculate the p-value? ] --- ## Calculating the p-value <!-- --> .question[ Why are there two vertical lines depicted? ] --- ## Calculating the p-value ```r obs_mean <- asheville %>% summarize(mean(ppg)) %>% pull() obs_diff <- mu_0 - obs_mean boot_means %>% mutate(extreme = ifelse(shifted_means <= mu_0 - obs_diff | shifted_means >= mu_0 + obs_diff, 1, 0)) %>% count(extreme) %>% mutate(prob = n/sum(n)) ``` ``` ## # A tibble: 2 x 3 ## extreme n prob ## <dbl> <int> <dbl> ## 1 0 4992 0.998 ## 2 1 8 0.0016 ``` Supposing that the true mean price per guest were $100 a night, only 8 out of 5,000 bootstrap sample means were as extreme or even more so than our originally observed sample mean price per guest of $76.6. .question[ What is the p-value? What might we conclude? ] --- ## Your turn! [https://classroom.github.com/a/15lb0lW_](https://classroom.github.com/a/15lb0lW_)