class: center, middle, inverse, title-slide # Hypothesis Testing ### Yue Jiang ### STA 210 / Duke University / Spring 2024 --- ### Statistical inference A process that converts data into useful information, whereby practitioners - form a question of interest, - collect, summarize, and analyze the data, - and interpret the results --- ### Statistical inference The .vocab[population] is the group we'd like to learn something about. If we had data from every unit in the population, we could just calculate what we wanted and be done! Unfortunately, we (usually) have to settle with a .vocab[sample] from the population. Ideally, the sample is .vocab[representative], allowing us to use .vocab[probability and statistical inference] to make conclusions that are .vocab[generalizable] to the broader population of interest. We want to make inferences regarding population .vocab[parameters], which we do with .vocab[sample statistics]. --- ### *Ei incumbit probatio qui dicit* <img src="img/pius.jpg" width="50%" style="display: block; margin: auto;" /> --- ### The hypothesis testing framework 1. Start with two hypotheses about the population: the .vocab[null hypothesis] and the .vocab[alternative hypothesis] 2. Choose a sample, collect data, and analyze the data 3. Figure out how likely it is to see data like what we got/observed, IF the null hypothesis were true 4. If our data would have been extremely unlikely if the null claim were true, then we reject it and deem the alternative claim worthy of further study. Otherwise, we cannot reject the null claim --- ### The sampling distribution of the mean Suppose we're interested in the resting heart rate of students at Duke, and are able to do the following: 1. Take a random sample of size `\(n\)` from this population - Calculate the mean resting heart rate *in this sample*, `\(\bar{x}_1\)` 2. Put the sample back, take a second random sample of size `\(n\)` - Calculate the mean resting heart rate from this new sample, `\(\bar{x}_2\)` 3. Put the sample back, take a third random sample of size `\(n\)` - Calculate the mean resting heart rate from this sample, too... 4. ...and so on. After repeating this `\(M\)` many times, we have a dataset that has the sample averages from the population: `\(\{ \bar{x}_1, \bar{x}_2, \cdots \bar{x}_M\}\)`. .question[ Can we say anything about the distribution of these sample means? ] --- ### This guy again...same book! <br> <img src="img/normal.png" width="100%" style="display: block; margin: auto;" /> --- ### The central limit theorem The .vocab[central limit theorem] states that for a population with a well-defined mean and standard deviation: 1. The mean of the sampling distribution is identical to the population mean 2. The standard deviation of the distribution of these sample averages, the .vocab[standard error] (.vocab[SE]) of the mean, is related to the population mean (and gets smaller as the sample size `\(n\)` gets larger: `\(\sigma/\sqrt{n}\)`) 3. As the sample size `\(n\)` gets larger and larger, the shape of the sampling distribution becomes closer and closer to the normal (Gaussian) distribution -- **Importantly,** the central limit theorem tells us that **sample averages** are normally distributed if we have enough data. This is true *even if* our original variables are not normally distributed. [Interactive central limit theorem demonstration](http://onlinestatbook.com/stat_sim/sampling_dist/) --- ### The normal (Gaussian) distribution <img src="testing_files/figure-html/unnamed-chunk-4-1.png" style="display: block; margin: auto;" /> --- ### What if we don't know `\(\sigma\)`? <img src="img/guinness.jpg" width="100%" style="display: block; margin: auto;" /> --- ### The t distribution <img src="testing_files/figure-html/unnamed-chunk-6-1.png" style="display: block; margin: auto;" /> --- ### The hypothesis testing framework 1. Start with two hypotheses about the population: the .vocab[null hypothesis] and the .vocab[alternative hypothesis] 2. Choose a sample, collect data, and analyze the data 3. Figure out how likely it is to see data like what we got/observed, IF the null hypothesis were true 4. If our data would have been extremely unlikely if the null claim were true, then we reject it and deem the alternative claim worthy of further study. Otherwise, we cannot reject the null claim --- ### Do pricy jeans have differentially sized pockets? ```r jeans |> filter(menWomen == "women") |> ggplot(data = ., aes(x = maxHeightFront, y = price)) + geom_point() + labs(x = "Price (dollars)", y = "Max front pocket height (cm.)", title = "Evidence for relationship between price and pockets...?") + geom_smooth(method = "lm", se = F) + theme_bw() ``` --- ### Do pricy jeans have differentially sized pockets? <img src="testing_files/figure-html/unnamed-chunk-8-1.png" style="display: block; margin: auto;" /> --- ### Setting up the hypotheses First, what parameter(s) do we *actually* care about in the linear model? `\(Height_i = \beta_0 + \beta_1(Price)_i + \epsilon_i\)` -- - `\(H_0: \beta_1 = 0\)` - `\(H_1: \beta_1 \neq 0\)` .question[ What are these null and alternative hypotheses saying *in words*? ] --- ### Collecting our data ```r jeans |> filter(menWomen == "women") |> select(price, maxHeightFront) ``` ``` ## price maxHeightFront ## 1 42.00 14.5 ## 2 42.00 14.5 ## 3 89.50 13.0 ## 4 89.50 13.0 ## 5 39.90 13.0 ## 6 39.90 15.5 ## 7 79.50 12.0 ## 8 69.50 14.0 ## 9 99.00 13.0 ## 10 79.50 15.0 ## 11 54.50 12.5 ## 12 54.50 12.2 ## 13 19.94 13.7 ## 14 19.94 13.5 ## 15 199.00 13.0 ## 16 159.00 14.0 ## 17 179.00 14.0 ## 18 249.00 14.5 ## 19 98.00 11.5 ## 20 89.00 12.5 ## 21 69.50 12.5 ## 22 69.50 15.0 ## 23 79.90 15.7 ## 24 79.90 15.5 ## 25 49.99 14.5 ## 26 9.99 19.0 ## 27 29.99 12.0 ## 28 29.99 14.5 ## 29 125.00 14.0 ## 30 110.00 11.5 ## 31 69.95 14.0 ## 32 69.95 13.0 ## 33 39.95 19.0 ## 34 39.95 16.0 ## 35 88.00 21.5 ## 36 78.00 18.0 ## 37 99.00 14.0 ## 38 99.00 15.0 ## 39 89.95 15.0 ## 40 92.95 14.5 ``` --- ### Collecting our data <img src="testing_files/figure-html/unnamed-chunk-10-1.png" style="display: block; margin: auto;" /> --- ### What do we expect if the null is true? `\begin{align*} t_{n-2} = \frac{\hat{\beta}_1 - \beta_{1, H_0}}{SE(\widehat{\beta}_1)} \end{align*}` --- ### What do we expect if the null is true? ```r options(digits = 3) linear_reg() |> set_engine("lm") |> fit(price ~ maxHeightFront, data = jeans |> filter(menWomen == "women")) |> tidy() ``` ``` ## # A tibble: 2 x 5 ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) 131. 54.0 2.42 0.0204 ## 2 maxHeightFront -3.52 3.73 -0.943 0.352 ``` --- ### What do we expect if the null is true? <img src="testing_files/figure-html/unnamed-chunk-12-1.png" style="display: block; margin: auto;" /> --- ### Making a conclusion We reject the null hypothesis if the conditional probability of obtaining our test statistic, or more extreme, given it is true, is very small What is very small? We often consider a cutpoint (the .vocab[significance level] or `\(\alpha\)` level) defined prior to conducting the analysis Many analyses use `\(\alpha = 0.05\)`: if `\(H_0\)` were in fact true, we would expect to make the wrong decision only 5% of the time (why?) If the p-value is less than `\(\alpha\)`, we say the results are .copper[statistically significant] and we .copper[reject the null hypothesis]. On the other hand, if the .vocab[p-value] is `\(\alpha\)` or greater, we say the results are not statistically significant and .vocab[fail to reject] `\(H_0\)`. --- ### So what do we conclude? ```r options(digits = 3) linear_reg() |> set_engine("lm") |> fit(price ~ maxHeightFront, data = jeans |> filter(menWomen == "women")) |> tidy() ``` ``` ## # A tibble: 2 x 5 ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) 131. 54.0 2.42 0.0204 ## 2 maxHeightFront -3.52 3.73 -0.943 0.352 ``` --- ### But wait... What if `\(p \ge \alpha\)`? We **never** "accept" the null hypothesis - we assumed that `\(H_0\)` was true to begin with and assessed the probability of obtaining our test statistic (or more extreme) under this assumption When we fail to reject the null hypothesis, we are stating that there is *insufficient evidence* to assert that it is false -- Importantly, we have assumed from the start that the null hypothesis is true, and the p-value calculates conditioned on that event p-values do **NOT** provide information on the probability that the null hypothesis is true given our observed data --- ### Some philosophical details The p-value we obtain relates to the test specific itself, so use of the same data can result in different p-values or confidence intervals depending on which test is used. --- ### What do we expect if the null is true? ``` ## price maxHeightFront ## 1 42.00 13.0 ## 2 42.00 15.0 ## 3 89.50 14.5 ## 4 89.50 16.0 ## 5 39.90 15.7 ## 6 39.90 13.5 ## 7 79.50 14.5 ## 8 69.50 19.0 ## 9 99.00 12.5 ## 10 79.50 13.0 ## 11 54.50 15.0 ## 12 54.50 15.0 ## 13 19.94 21.5 ## 14 19.94 11.5 ## 15 199.00 12.0 ## 16 159.00 13.0 ## 17 179.00 13.0 ## 18 249.00 14.0 ## 19 98.00 13.0 ## 20 89.00 14.5 ## 21 69.50 14.5 ## 22 69.50 14.0 ## 23 79.90 11.5 ## 24 79.90 14.5 ## 25 49.99 12.0 ## 26 9.99 12.5 ## 27 29.99 18.0 ## 28 29.99 12.2 ## 29 125.00 15.0 ## 30 110.00 14.5 ## 31 69.95 13.0 ## 32 69.95 15.5 ## 33 39.95 14.0 ## 34 39.95 14.0 ## 35 88.00 15.5 ## 36 78.00 12.5 ## 37 99.00 14.0 ## 38 99.00 19.0 ## 39 89.95 14.0 ## 40 92.95 13.7 ``` --- ### What do we expect if the null is true? ``` ## price maxHeightFront ## 1 42.00 12.5 ## 2 42.00 13.0 ## 3 89.50 15.5 ## 4 89.50 15.0 ## 5 39.90 13.0 ## 6 39.90 14.0 ## 7 79.50 14.0 ## 8 69.50 14.0 ## 9 99.00 16.0 ## 10 79.50 12.2 ## 11 54.50 19.0 ## 12 54.50 13.0 ## 13 19.94 14.5 ## 14 19.94 12.5 ## 15 199.00 14.5 ## 16 159.00 13.0 ## 17 179.00 15.0 ## 18 249.00 14.0 ## 19 98.00 14.5 ## 20 89.00 11.5 ## 21 69.50 21.5 ## 22 69.50 12.0 ## 23 79.90 19.0 ## 24 79.90 11.5 ## 25 49.99 15.7 ## 26 9.99 13.0 ## 27 29.99 12.0 ## 28 29.99 14.0 ## 29 125.00 18.0 ## 30 110.00 12.5 ## 31 69.95 13.7 ## 32 69.95 14.5 ## 33 39.95 15.5 ## 34 39.95 14.5 ## 35 88.00 13.0 ## 36 78.00 15.0 ## 37 99.00 14.5 ## 38 99.00 15.0 ## 39 89.95 14.0 ## 40 92.95 13.5 ``` --- ### What do we expect if the null is true? ``` ## price maxHeightFront ## 1 42.00 13.0 ## 2 42.00 12.2 ## 3 89.50 18.0 ## 4 89.50 14.0 ## 5 39.90 14.0 ## 6 39.90 12.5 ## 7 79.50 15.0 ## 8 69.50 15.0 ## 9 99.00 14.5 ## 10 79.50 14.5 ## 11 54.50 14.0 ## 12 54.50 15.0 ## 13 19.94 13.7 ## 14 19.94 13.0 ## 15 199.00 19.0 ## 16 159.00 12.0 ## 17 179.00 13.0 ## 18 249.00 15.5 ## 19 98.00 15.0 ## 20 89.00 14.5 ## 21 69.50 14.0 ## 22 69.50 11.5 ## 23 79.90 19.0 ## 24 79.90 15.5 ## 25 49.99 12.5 ## 26 9.99 13.0 ## 27 29.99 16.0 ## 28 29.99 12.5 ## 29 125.00 12.0 ## 30 110.00 13.0 ## 31 69.95 14.5 ## 32 69.95 11.5 ## 33 39.95 14.0 ## 34 39.95 14.0 ## 35 88.00 13.0 ## 36 78.00 14.5 ## 37 99.00 14.5 ## 38 99.00 15.7 ## 39 89.95 21.5 ## 40 92.95 13.5 ``` --- ### What do we expect if the null is true? <img src="testing_files/figure-html/unnamed-chunk-17-1.png" style="display: block; margin: auto;" /> --- ### What do we expect if the null is true? <img src="testing_files/figure-html/unnamed-chunk-18-1.png" style="display: block; margin: auto;" /> --- ### What do we expect if the null is true? <img src="testing_files/figure-html/unnamed-chunk-19-1.png" style="display: block; margin: auto;" /> --- ### What do we expect if the null is true? <img src="testing_files/figure-html/unnamed-chunk-20-1.png" style="display: block; margin: auto;" /> --- ### Assessing the evidence (again) (in this case, 380 out of 1000 slopes are as extreme as, or even more so, than the observed data). --- ### What could go wrong? Suppose we test the null hypothesis `\(H_0: \mu = \mu_0\)`. We could potentially make two types of errors: | Truth | `\(\mu = \mu_0\)` | `\(\mu \neq \mu_0\)` | | ----- | ------------- | ---------------- | | Fail to reject `\(H_0\)` | .eno[Correct decision] | .copper[Type II Error] | | Reject `\(H_0\)` | .copper[Type I Error] | .eno[Correct decision] | - .copper[Type I Error]: rejecting `\(H_0\)` when it is actually true (falsely rejecting the null hypothesis) - .copper[Type II Error]: not rejecting `\(H_0\)` when it is false (falsely failing to reject the null hypothesis) While we of course want to know if any one study is showing us something real or a Type I or Type II error, hypothesis testing does NOT give us the tools to determine this