class: center, middle, inverse, title-slide # Comparing proportions & odds ### Dr. Maria Tackett ### 10.21.19 --- class: middle, center ### [Click for PDF of slides](15-proportions-odds.pdf) --- ### Announcements - Lab 06 due **Tuesday, Oct 22 at 11:59p** - [Reading 07](https://www2.stat.duke.edu/courses/Fall19/sta210.001/reading/reading-07.html) for Wednesday - Project Proposal **due Wed, Oct 30 at 11:59p** - Evan's Office hours: - Complete the poll: [https://doodle.com/poll/e4teb3d38cgvshwc](https://doodle.com/poll/e4teb3d38cgvshwc) - Ignore the specific dates. This is for your general availability during each time slot. --- ### Packages ```r library(tidyverse) library(knitr) library(broom) library(fivethirtyeight) ``` --- class: middle, center ### Modeling Binary Outcomes --- ### FiveThirtyEight March Madness <center> <a href=https://projects.fivethirtyeight.com/2018-march-madness-predictions/ target="_blank" class="link">2018 March Madness Predictions</a> </center> <br> *Live Win Probabilities are "derived using <font color=#9B02BD>**logistic regression analysis**</font>, which lets us plug the current state of a game into a model to produce the probability that either team will win the game."* <br> <div align="right"> <a href=https://fivethirtyeight.com/features/how-our-march-madness-predictions-work/ target="_blank">-"How Our March Madness Predictions Work"<a/> </div> --- ### 2018 Election Forecasts <center> <img src="img/13/fivethirtyeight_senate.png" width="70%" style="display: block; margin: auto;" /> <a href="https://projects.fivethirtyeight.com/2018-midterm-election-forecast/senate/?ex_cid=irpromo">FiveThirtyEight.com Senate forecast</a> <br> <br> <br> <img src="img/13/fivethirtyeight_house.png" width="70%" style="display: block; margin: auto;" /> <a href="https://projects.fivethirtyeight.com/2018-midterm-election-forecast/house/?ex_cid=irpromo">FiveThirtyEight.com House forecast</a> </center> --- class: middle, center *Our models are probabilistic in nature; we do a lot of thinking about these probabilities, and the goal is to develop <font class="vocab">probabilistic estimates</font> that hold up well under real-world conditions.* <br> <div align="right"> <a href="https://fivethirtyeight.com/methodology/how-fivethirtyeights-house-and-senate-models-work/" target="_blank">-"How FiveThirtyEight's House, Senate, and Governor Models Work"<a/> </div> --- ### Is it rude to recline your seat on a plane? ```r flying <- flying %>% filter(!is.na(recline_rude)) %>% mutate(rude = if_else(recline_rude %in% c("Somewhat", "Very"), 1, 0)) ``` <br><br> Source: [*41 Percent of Fliers Think You're Rude If You Recline Your Seat*](https://fivethirtyeight.com/features/airplane-etiquette-recline-seat/) --- ### Response Variable, `\(Y\)` - `\(Y\)` is a binary response variable + 1: yes + 0: no - `\(Mean(Y) = p\)` + `\(p\)` is the proportion of "yes" responses in the population - `\(Variance(Y) = p(1-p)\)` --- ### Sampling Distribution of Sample Proportion - `\(\color{blue}{\hat{p}}\)` : average of binary responses in the sample + Called the <font color="blue">sample proportion</font> + This is the statistic, i.e. the estimate of `\(p\)` - Given `\(\hat{p}\)` is the sample proportion based on a sample of size `\(n\)` from a population with population proportion `\(p\)`: `$$\hat{p} \sim N\bigg(p, \frac{p(1-p)}{n}\bigg)$$` ...assuming `\(n\)` is "large" (more than 5 "yes" and 5 "no") --- ### Confidence Interval for a Single Proportion - Approximate `\(C\)`% confidence interval for `\(p\)` is .alert[ `$$\begin{aligned}&\hat{p} \pm z^* SE(\hat{p})\\[15pt] = &\hat{p} \pm z^* \sqrt{\frac{\hat{p}(1-\hat{p})}{n}}\\\end{aligned}$$` ] -- where `\(z^*\)` is the critical value calculated from the `\(N(0,1)\)` distribution -- ```r # Critical value for 90% CI qnorm(0.95) ``` ``` ## [1] 1.644854 ``` --- ### Opinions about reclining: 90% CI ```r crit.val <- qnorm(0.95) ``` ```r flying %>% summarise(n = n(), p_hat = sum(rude)/n, se = sqrt(p_hat*(1-p_hat)/n), lb = p_hat - crit.val*se, ub = p_hat + crit.val*se) ``` ``` ## # A tibble: 1 x 5 ## n p_hat se lb ub ## <int> <dbl> <dbl> <dbl> <dbl> ## 1 854 0.412 0.0168 0.384 0.440 ``` We are 90% confident that the interval 0.384 to 0.44 contains the true proportion of fliers who think reclining your seat on a plane is rude. --- ### Sampling Distribution for Difference in Two Proportions - Let `\(\hat{p}_1\)` and `\(\hat{p}_2\)` be sample proportions from independent random samples of size `\(n_1\)` and `\(n_2\)`, respectively: `$$\hat{p}_1 - \hat{p}_2 \sim N\bigg(p_1 - p_2, \frac{p_1(1-p_1)}{n_1} + \frac{p_2(1-p_2)}{n_2}\bigg)$$` <br> ... assuming `\(n_1\)` and `\(n_2\)` are "large" (at least 5 "yes" and "no" in each sample) --- ### Confidence Interval for Difference in Proportions - Approximate `\(C\)`% confidence interval for `\(p_1 - p_2\)` is `$$\begin{aligned}&(\hat{p}_1 - \hat{p}_2) \pm z^* \times SE(\hat{p}_1 - \hat{p}_2)\\[15pt] =~ &(\hat{p}_1 - \hat{p}_2) \pm z^* \times \sqrt{\frac{\hat{p}_1(1-\hat{p}_1)}{n_1} + \frac{\hat{p}_2(1-\hat{p}_2)}{n_2}}\end{aligned}$$` <br> -- where `\(z^*\)` is the critical value calculated from the `\(N(0,1)\)` distribution --- ### Opinions about reclining by age |age | 0| 1| |:-----|---:|--:| |18-29 | 78| 94| |30-44 | 143| 79| -- **Is there a significant difference in the proportion of 18-29 year olds versus 30-44 year olds who think reclining a seat on a plane is rude?** --- ### Opinions about reclining by age: 90% CI ```r flying %>% filter(age %in% c("18-29", "30-44")) %>% group_by(age) %>% summarise(n = n(), p_hat = round(sum(rude)/n,3)) %>% kable(format="markdown") ``` |age | n| p_hat| |:-----|---:|-----:| |18-29 | 172| 0.547| |30-44 | 222| 0.356| .question[ 1. Calculate a 90% confidence interval for the difference in proportion of 18-29 year olds and 30-44 year olds who think reclining a seat on a plane is rude. Interpret the interval. 3. Based on the interval, is there evidence of a significant difference in proportions between the two groups? ] --- class: middle, center **What are some potential difficulties with reporting results using the difference in proportions? Or proportions/percentages in general?** --- ### Odds - Given `\(p\)`, the population proportion of "yes" responses, the corresponding <font class="vocab3">odds</font> of a "yes" response is `$$\omega = \frac{p}{1-p}$$` - The *sample odds* are `\(\hat{\omega} = \frac{\hat{p}}{1-\hat{p}}\)` - **Ex.** + proportion of fliers who think reclining is rude: 0.412. + odds a flier thinking reclining is rude: 0.701 to 1 --- ### Properties of the odds - `\(\text{odds } \geq 0\)` - If `\(\hat{p} = 0.5\)`, then odds `\(=1\)` - If odds of "yes" `\(=\omega\)`, then the odds of "no" `\(=\frac{1}{\omega}\)` - If odds of "yes" `\(=\omega\)`, then `\(\hat{p} = \frac{\omega}{(1+\omega)}\)` --- ### Odds ratio - Suppose we have two populations with proportions `\(p_1\)` and `\(p_2\)` and odds `\(\omega_1\)` and `\(\omega_2\)` - The <font class="vocab">odds ratio</font> is `\(\phi = \frac{\omega_1}{\omega_2}\)` + *Estimate*: `\(\hat{\phi} = \frac{\hat{\omega}_1}{\hat{\omega}_2}\)` - Good alternative to the difference in proportions - .vocab[Intepretation]: The odds of "yes" in group 1 is `\(\phi\)` times the odds of "yes" in group 2 --- ### Why use Odds Ratio? - In practice, the odds ratio is more consistent across levels of confounding variables - The odds ratio is more easily interpreted / understood - The odds ratio can be easily extended to regression analysis --- ### Sampling distribution of log(odds ratio) - Let `\(\hat{\omega}_1\)` and `\(\hat{\omega}_2\)` be sample odds from independent random samples of size `\(n_1\)` and `\(n_2\)`, respectively: `$$\log(\hat{\phi}) = \log\bigg(\frac{\hat{\omega}_1}{\hat{\omega}_2}\bigg) \approx N\bigg(\log\bigg(\frac{\omega_1}{\omega_2}\bigg), \frac{1}{n_1p_1(1-p_1)} + \frac{1}{n_2p_2(1-p_2)}\bigg)$$` <br> ... assuming `\(n_1\)` and `\(n_2\)` are "large" based on the thresholds for difference in proportions --- ### Confidence Interval for Log Odds Ratio - Approximate `\(C\)`% confidence interval for `\(\log(\phi)\)` is .alert[ `$$\begin{aligned}&\log(\hat{\phi}) \pm z^* \times SE[\log(\hat{\phi})]\\[15pt] =\hspace{2mm} &\log(\hat{\phi}) \pm z^* \times \sqrt{\frac{1}{n_1\hat{p}_1(1-\hat{p}_1)} + \frac{1}{n_2\hat{p}_2(1-\hat{p}_2)}}\\\end{aligned}$$` ] -- where `\(z^*\)` is the critical value calculated from the `\(N(0,1)\)` distribution --- ### Confidence Interval for Odds Ratio Suppose `\(LB\)` and `\(UB\)` are the lower and upper bounds of the `\(C\)`% confidence interval for the log odds ratio, `\(\log(\phi)\)` .alert[ The `\(C\)`% confidence interval for the odds ratio, `\(\phi\)` is `$$\exp\{LB\} \text{ to }\exp\{UB\}$$` ] --- ### Opinions about reclining seat ```r flying %>% filter(age %in% c("18-29", "30-44")) %>% group_by(age) %>% summarise(n = n(), p_hat = round(sum(rude)/n,3), odds = round(p_hat/(1-p_hat),3)) ``` ``` ## # A tibble: 2 x 4 ## age n p_hat odds ## <ord> <int> <dbl> <dbl> ## 1 18-29 172 0.547 1.21 ## 2 30-44 222 0.356 0.553 ``` .question[ 1. Calculate a 90% confidence interval for the odds ratio of 18-29 versus 30-44 year olds who think reclining a seat on a plane is rude. Interpret the interval. 2. Based on the interval, is there evidence of a significant difference in the odds between the two groups? ] --- ### Hypothesis Test for Odds Ratio - We want to test whether two groups have equal odds, i.e. `\(\phi = \frac{\omega_1}{\omega_2} =1\)` - <font class="vocab">Null Hypothesis: </font> `\(H_0: \log(\phi) = \log\big(\frac{\omega_1}{\omega_2}\big) = 0\)` - <font class="vocab">Test Statistic: </font> $$ z = \frac{\log(\hat{\phi}) - 0}{SE_0[\log(\hat{\phi})]} = \frac{\log(\hat{\phi}) - 0}{\sqrt{\frac{1}{n_1\hat{p}_c(1-\hat{p}_c)} + \frac{1}{n_2\hat{p}_c(1-\hat{p}_c)}}}$$ - <font class="vocab">p-value: </font>proportion of `\(N(0,1)\)` distribution as extreme or more extreme than the test statistic --- ### Standard error `\(SE_0[\log(\hat{\phi})]\)` - The null hypothesis is that odds ratio is 1, i.e. the proportions are equal - To calculate standard error, we estimate `\(\hat{\pi}_c\)`, the sample proportion from the combined data `$$SE_0[\log(\hat{\phi})] =SE_0\bigg[\log\bigg(\frac{\hat{\omega}_1}{\hat{\omega}_2}\bigg)\bigg] = \sqrt{ \frac{1}{n_1\hat{p}_c(1-\hat{p}_c)} + \frac{1}{n_2\hat{p}_c(1-\hat{p}_c)}}$$` --- ### Opinions about reclining seat **Do the odds of thinking it's rude to recline a seat on a plane differ between 18-29 and 30-44 year olds?** `$$\begin{aligned}&H_0: \log(\phi) = 0\\ &H_a: \log(\phi) \neq 0\\\end{aligned}$$` - `\(\hat{p}_c\)` = 0.439 - **18 - 29**: `\(n =\)` 172, `\(\hat{\omega} =\)` 1.208 - **30 - 44**: `\(n =\)` 222, `\(\hat{\omega} =\)` 0.553 .question[ 1. Calculate the test statistic. 2. Calculate p-value and make a conclusion. ] --- ### Exam 01: - Grades will be released after class - Notes will be returned in this week's lab - Solutions available in the Resources folder on Sakai - Attend office hours to ask about the solutions and any feedback - Topics: - Variance of `\(y\)` from ANOVA - Violation of assumption vs. robustness