class: center, middle, inverse, title-slide # Comparing multiple means ### Yue Jiang ### Duke University --- ### Review: hypothesis testing steps 1. State the null and alternative hypotheses. The null hypothesis states "nothing unusual is happening" and the alternative challenges it 2. Collect relevant data and summarize it 3. Assess how surprising it would be to see data like that **if the null hypothesis were really true** 4. Draw conclusions --- ### Two-sample t-tests Last time, we compared one sample to a hypothesized population value. .question[ What if we wanted to compare two samples to each other? ] --- ### Independent samples The type of t-test we use to compare two means depends on how the samples were obtained. One approach would be to obtain two independent samples and test the equality of means `\(\mu_1\)` and `\(\mu_2\)` <img src="independent.png" width="50%" style="display: block; margin: auto;" /> --- ### Paired or matched samples An alternative would be to obtain paired or matched samples and test the equality of means `\(\mu_1\)` and `\(\mu_2\)`. Matching could be by person (e.g., before and after measures) or could be a pair of individuals who belong together in another way (e.g., same date of birth in same hospital; husband and wife; etc.) <img src="paired.png" width="50%" style="display: block; margin: auto;" /> --- ### Paired samples Samples are often paired for a variety of reasons - Measurements are taken on a single subject at two distinct points in time (e.g., baseline and follow-up) - Subjects may be matched so that members of each pair are as much alike as possible with respect to important characteristics like age and gender (e.g., matched case-control study) Pairing can control for unwanted sources of variation that might otherwise influence the results of a comparison. Matching within subject (e.g., baseline and follow-up) is a powerful way to eliminate subject-specific factors. --- ### Designing a study: impaired driving The Department of Motor Vehicles wishes to compare impairment of drivers while texting to impairment after being sleep deprived for 24 hours. .question[ Describe an independent samples design and a matched pairs design for this question of interest. ] --- ### Case study: licorice and surgery Reutzler et al. (2013) performed an experiment among patients having surgery who required intubation. Prior to anesthesia, patients were randomly assigned to gargle either a licorice-based solution or sugar water (as placebo). Sore throat was evaluated 30 minutes, 90 minutes, and 4 hours after conclusion of the surgery on a pain scale from 0 to 10 (0 = no pain; 10 = worst). .question[ Let's evaluate whether gargling licorice before surgery led to different mean pain scores when swallowing, at 30 minutes after arrival in the PACU (post-anesthesia care unit). ] --- ### Case study: hypothesis testing step 1 The null hypothesis is that patients receiving licorice gargle and sugar solution placebo have the same mean pain scores relating to swallowing 30 minutes after arrival in the PACU (treatment is unrelated to mean pain), while the alternative is that they do not. .question[ What are the null and alternative hypotheses written out in symbols? ] --- ### Case study: hypothesis testing step 2 The researcher enrolled 233 subjects, 116 receiving placebo and 117 receiving licorice. Analyzing the data, we obtained `\(\bar{x}_L = 0.307, \bar{x}_S = 1.379, s_L = 0.825, s_S = 2.287\)` --- ### Two-sample t-test, independent samples The two-sample t-test for independent samples is given by `\begin{align*} t = \frac{(\bar{x}_1 - \bar{x}_2) - (\mu_1 - \mu_2)}{\sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}}} \end{align*}` The degrees of freedom of the `\(t\)` statistic depend on whether or not `\(\sigma_1 = \sigma_2\)`. --- ### Equal or unequal variances? The choice of `\(df\)` depends on whether the independent samples have the same or different variances If the variances are equal, then we can use a pooled estimate of `\(s^2\)`, and the degrees of freedom are given by `\((n_1 - 1) + (n_2 - 1) = n_1 + n_2 -2.\)` If the variances are unequal, the degrees of freedom are difficult to derive, and the Satterthwaite approximation is often used (use software). Unequal variances should be the default choice, as the t-test assuming equal variances can be quite unreliable if the variances differ, especially when the group sizes differ as well. --- ### Case study: hypothesis testing step 3 Carrying out the two sample t-test for independent samples with unequal variances using software, we get `\(t = 4.75\)` `\(df \approx 144.21\)`, with a corresponding p-value `\(<\)` 0.001 (you will learn how to carry this out in R on your homework) --- ### Case study: hypothesis testing step 4 Based on our observed data, we conclude that there is evidence of a potential difference in mean pain score between the two group. In particular, we have evidence that those receiving the licorice gargle before their surgery reported a lower mean pain score compared to placebo patients. <img src="licorice.jpg" width="50%" style="display: block; margin: auto;" /> --- ### Case study: Body temperature You may believe normal body temperature is 98.6 degrees F. A 1992 `\(JAMA\)` article examined this assumption, and published temperature data from a large cohort of people. We aim to answer whether the mean body temperature is different for men and women. Describe the null and alternative hypothesis, explain how you could carry out a specific test of interest and what information you would need, and explain how you would make a conclusion in light of observed data. The real article showed the sample mean for females was 98.4 and the sample mean for men was 98.1. A t-statistic was calculated (127.51 df), and was found to be 2.285, corresponding to a two-sided p-value of 0.024. .question[ What might we conclude? ] --- ### Paired sample t-test The paired sample t-test is pretty easy to carry out. All we do is to create a new outcome variable, `\(d\)`, that contains the differences in outcomes between members of a pair. Then we analyze the differences `\(d\)` using the usual one-sample t-test. --- ### Case study: athletic training A school athletics department has taken a new instructor and wants to test the effectiveness of the new type of training proposed by comparing the average times of 10 runners in the 100 meters in seconds before and after the new training is implemented. .question[ What type of t-test might be best for this scenario? ] --- ### Puppies! <img src="puppies.jpg" width="50%" style="display: block; margin: auto;" /> --- ### Distribution of heart rate data <img src="lec-11_files/figure-html/unnamed-chunk-6-1.png" width="80%" style="display: block; margin: auto;" /> .question[ How do we compare across three groups? Which groups are different? ] --- ### Example: pets and stress We are interested in testing `\begin{align*} H_0: \mu_P = \mu_F = \mu_N \end{align*}` against the alternative that at least one mean is different from the others. One way to do this would be to use t-tests on all possible pairs of tests (here there are just three). However, if we have more groups, this becomes quite complicated. For example, with 10 groups we need to do `\(\binom{10}{2} = 45\)` tests! --- ### Multiple comparisons However, in addition to being time-consuming, carrying out multiple tests can lead to an inflated Type I error rates which puts into question the validity of a given study if these .copper[multiple comparisons] are not accounted for. --- ### Multiple comparisons in action: overall test <img src="jellybean1.png" width="50%" style="display: block; margin: auto;" /> --- ### Extra tests...whoa! <img src="jellybean2.png" width="50%" style="display: block; margin: auto;" /> --- ### Green jelly beans! <img src="jellybean3.png" width="50%" style="display: block; margin: auto;" /> --- ### So what went wrong? <img src="stop.jpg" width="50%" style="display: block; margin: auto;" /> --- ### Multiple comparisons Let's revisit the pets / stress example, where we had three pairwise comparisons. - Suppose all means are truly equal and we conduct all three pairwise tests - Suppose also the tests are independent and done at a 0.05 significance level - Then the probability we fail to reject all three tests (the correct decision) is `\((1-0.05)^3 = 0.95^3 = 0.857\)`, and so the probability of rejecting at least one of the three null hypotheses, called the .copper[family-wise error rate], is `\(1-0.857 = 0.143 > 0.05\)` - With 45 tests, the probability of rejecting at least 1 of them (incorrectly!) is over 90%! --- ### Multiple comparisons .copper[ANOVA] extends the `\(t\)`-test and is one way to control the overall Type I error rate at a fixed level `\(\alpha\)`, if we only test pairwise differences when the overall ANOVA test is rejected ANOVA stands for .copper[analysis of variance]. We use ANOVA when we want to compare more than two groups. --- ### ANOVA null hypothesis In ANOVA, we typically follow this testing procedure: 1. First, we conduct an **overall** test of the null hypothesis that the means of all of the groups are equal. 2. If this is rejected, then we .vocab[step down] to see which means are different from each other. A .copper[multiple comparisons] correction is sometimes used for these pairwise comparisons of means. 3. If we fail to reject the null hypothesis, then no further testing should be done. --- ### ANOVA alternative hypothesis For ANOVA with three groups, our null hypothesis is `\(H_0: \mu_P = \mu_F = \mu_N\)`. What could happen under the alternative? - `\(\mu_P \neq \mu_F \neq \mu_N\)` - `\(\mu_P = \mu_F\)`, but `\(\mu_N\)` is different - `\(\mu_P = \mu_N\)`, but `\(\mu_F\)` is different - `\(\mu_F = \mu_N\)`, but `\(\mu_P\)` is different The alternative hypothesis for ANOVA is that at least one of the means is different from the others. --- ### Assumptions of ANOVA 1. Outcomes within groups are normally distributed 2. Homoscedastic variance (the within-group variance is the same for all groups) 3. Samples are independent If these assumptions are violated, then results from ANOVA may not be valid. We will discuss some alternatives later in the course. --- ### Validity of ANOVA for pet data <img src="lec-11_files/figure-html/unnamed-chunk-11-1.png" width="80%" style="display: block; margin: auto;" /> Variances appear to be similar, but normality looks questionable! For now, let's proceed despite this problem. --- ### Why analyze variance when we want means? Remember, ANOVA stands for analysis of variance. What does variance have to do with our null hypothesis, which is about equality of means (say, `\(H_0: \mu_1 = \mu_2 = \cdots = \mu_K\)`)? --- ### F-test If the group-specific means vary around the overall grand mean more than the individual observations vary around their group-specific sample means, then we have evidence that the corresponding population means are in fact different. (see board) --- ### F-test How do we compare formally compare these variances? Consider the `\(F\)` statistic given by `$$F = \frac{s_{between}^2}{s_{within}^2},$$` which if `\(H_0\)` is true, has an `\(F\)` distribution with `\(K-1\)` .copper[numerator] degrees of freedom and `\(n-K\)` .copper[denominator] degrees of freedom, where `\(K\)` is the number of groups and `\(n\)` is the total number of observations. --- ### F-test The F-test for ANOVA is inherently one-tailed, rejecting `\(H_0\)` only if `\(F\)` is considerably larger than one. Importantly, this does not mean we have a one-sided alternative; we just look at one tail of the `\(F\)` distribution to get the p-value. If there are only two groups, then the F-test gives the same result as the independent-samples t-test. --- ### Output ``` ## Df Sum Sq Mean Sq F value Pr(>F) ## pets$group 2 499.8 249.89 13.11 3.77e-05 ## Residuals 42 800.6 19.06 ``` .question[ Which rows correspond to the between-group and within-group variances here, respectively? ] --- ### F-test for pet data Note that `\(F = s_B^2/s_W^2 = 1193.8/84.8 = 14.1\)`, with ndf = `\(3-1=2\)` and ddf = `\(45-3=42\)`. This corresponds to a p-value <0.0001. At `\(\alpha = 0.05\)`, we reject the null hypothesis. There is sufficient evidence that at least one of the three groups comes from a population with a different mean from the others. .question[ Which groups might be different? ] --- ### Bonferroni correction As we showed earlier, conducting multiple tests on a data set increases the .vocab[family-wise error rate]. One very conservative way to ensure this is not the case is to simply divide `\(\alpha\)` by the number of tests to be done and to use that as the significance level. This procedure is called the .vocab[Bonferroni correction]. --- ### Bonferroni correction For example for two tests, to preserve an overall 0.05 type I error rate, the Bonferroni correction would use `\(\alpha/2 = 0.025\)` as the significance level for each individual test instead of 0.05. Bonferroni is a conservative correction, making it harder to reject the null hypothesis, but it is a safe bet in controlling the Type I error rate. --- ### Pets and stress: group differences We can compare the groups using a Bonferroni correction (here we have three tests, so the significance level for each test is `\(\alpha/3\)`). The raw (uncorrected) p-values for the t-test comparing friend vs. pet was `\(<0.0001\)`; for friend vs. neither was 0.021, and for pet vs. neither was 0.009. .question[ What might we conclude? ]