class: center, middle, inverse, title-slide # Scientific studies, confounding, and Simpson’s paradox ## Intro to Data Science ### Shawn Santo ### 02-04-20 --- ## Announcements - Eric will return on Thursday. - Be ready to discuss the ugly graphic and your improvements. - Keep up with the assigned readings. Try to get them done before class. --- ## Does cereal keep girls slim? **What do you think?** <br/> Take a minute and write a few sentences explaining why you believe or do not believe that, according to the assigned reading from the news article, the statement "cereal keeps girls slim." Fill out a brief survey [at this link](https://forms.gle/ZKyjR1kWCh2UCyzz5), which serves as today's application exercise. --- ## 4 possible explanations -- 1. Eating breakfast causes girls to be slimmer. <br> -- 2. Being slim causes girls to eat breakfast. <br> -- 3. That's just how the data ended up - there's no relationship here, but we just so happened to observe a "lucky coincidence". <br/><br/> If we ended up *concluding* that there was a relationship when there truly wasn't one, we would have made a **type 1 error**. We will talk about this concept in greater detail later this semester. <br/> -- 4. A third variable is responsible for both - a confounding variable. <br/><br/> A *confounding* variable is an an extraneous variable that affects both the explanatory and the response variable, and makes it seem like there is a relationship between them --- ## "Lucky coincidences" ![](images/correlation1.png) *Source*: [Tyler Vigen's site of spurious correlations:](https://www.tylervigen.com/spurious-correlations) --- ## "Lucky coincidences" ![](images/correlation2.png) *Source*: [Tyler Vigen's site of spurious correlations:](https://www.tylervigen.com/spurious-correlations) --- ## "Lucky coincidences" ![](images/correlation3.png) *Source*: [Tyler Vigen's site of spurious correlations:](https://www.tylervigen.com/spurious-correlations) --- ## Confounding variables **Identify the confounding variable in each of the following statements:** - As the amount of ice cream sales increases, the number of shark attacks also increases. -- - The higher the number of firefighters at a fire is, the greater the amount of damage caused by that fire. -- - Taller children are better at both reading and math compared to shorter children. --- ## Correlation != causation <br><br> .center[ ![](img/05a/xkcdcorrelation.png) ] <br><br> .footnote[ Randall Munroe CC BY-NC 2.5 http://xkcd.com/552/ ] --- class: center, middle, inverse # Scientific studies --- ## Scientific studies - Observational - Collect data in a way that does not interfere with how the data arise ("observe") - Only establish an association - Data often cheaper and easier to collect -- <br/><br/> - Experimental - Randomly assign subjects to treatments - Establish causal connections - Often more expensive - Sometimes it is impossible or unethical to design an experiment --- ## Random sampling vs. random assignment ![](img/05a/random_sample_assign_grid.png) --- ## Non-random samples: a cautionary tale In 2016, the Natural Environment Research Council in England started an online competition in an effort to name a polar research ship. People were invited to submit suggestions and/or cast a vote for their favorite choice. -- <br/><br/> **What type of sampling design is this?** [What happened?](https://www.cnn.com/2016/04/18/world/boaty-mcboatface-wins-vote/index.html) --- class: center, middle, inverse # Conditional probability --- ## Conditional probability **Notation**: `\(P(A | B)\)`: Probability of event A given event B `\(A\)`: it will be unseasonably warm tomorrow `\(B\)`: it is unseasonably warm today - What is the probability that it will be unseasonably warm tomorrow? - What is `\(P(A)\)`? -- - What is the probability that it will be unseasonably warm tomorrow, given that is unseasonably warm today? - What is `\(P(A|B)\)`? --- ## Example A January 2018 SurveyUSA poll asked 500 randomly selected Californians whether they are familiar with the DREAM act. The distribution of the responses by age category are shown below. What proportion of **<u>all respondents</u>** are very familiar with the DREAM act? <br> .pull-left[ | | 18 - 49 | 50+ | Total | |------------|---------|-----|-------| | Very | 90 | 32 | 122 | | Somewhat | 125 | 86 | 211 | | Not very | 56 | 33 | 89 | | Not at all | 36 | 24 | 60 | | Not sure | 9 | 9 | 18 | | Total | 316 | 184 | 500 | <br><br> ] -- .pull-right[ `\(P(\text{Very}) = \frac{122}{500} = 0.244\)` ] .footnote[ Source: [SurveyUSA News Poll 23754](http://www.surveyusa.com/client/PollReport.aspx?g=783743b0-efc1-4b67-9201-58352a8f61f1) ] --- ## Example A January 2018 SurveyUSA poll asked 500 randomly selected Californians whether they are familiar with the DREAM act. The distribution of the responses by age category are shown below. What proportion of **<u>respondents who are 18 - 49 years old</u>** are very familiar with the DREAM act? <br> .pull-left[ | | 18 - 49 | 50+ | Total | |------------|---------|-----|-------| | Very | 90 | 32 | 122 | | Somewhat | 125 | 86 | 211 | | Not very | 56 | 33 | 89 | | Not at all | 36 | 24 | 60 | | Not sure | 9 | 9 | 18 | | Total | 316 | 184 | 500 | ] -- .pull-right[ `\(P(\text{Very}~|~18-49) = \frac{90}{316} = 0.285\)` ] --- ## Example A January 2018 SurveyUSA poll asked 500 randomly selected Californians whether they are familiar with the DREAM act. The distribution of the responses by age category are shown below. What proportion of **<u>respondents who are 50+ years old</u>** are very familiar with the DREAM act? <br> .pull-left[ | | 18 - 49 | 50+ | Total | |------------|---------|-----|-------| | Very | 90 | 32 | 122 | | Somewhat | 125 | 86 | 211 | | Not very | 56 | 33 | 89 | | Not at all | 36 | 24 | 60 | | Not sure | 9 | 9 | 18 | | Total | 316 | 184 | 500 | ] -- .pull-right[ `\(P(\text{Very}~|~50+) = \frac{32}{184} = 0.173\)` ] --- Given that - `\(P(\text{Very}) = \frac{122}{500} = 0.244\)` - `\(P(\text{Very}~|~18-49) = \frac{90}{316} = 0.285\)` - `\(P(\text{Very}~|~50+) = \frac{32}{184} = 0.173\)` does there appear to be a relationship between age and familiarity with the DREAM act? Explain your reasoning. -- <br> Could there be another variable that explains this relationship? --- ## Independence Inspired by the previous example and how we used the conditional probabilities to make conclusions, come up with a definition of independent events. If easier, you can keep the context limited to the example (independence/dependence of familiarity with the DREAM act and age), but try to push yourself to make a more general statement. --- class: center, middle, inverse # Simpson's paradox --- ## Relationships between variables - **Bivariate relationship**: Fitness `\(\rightarrow\)` Heart health - **Multivariate relationship**: Calories + Age + Fitness `\(\rightarrow\)` Heart health --- ## Simpson's paradox - Not considering an important variable when studying a relationship can result in **Simpson's paradox**, a phenomenon in which the omission of one explanatory variable can affect the measure of association between another explanatory variable and a response variable. - In other words, the inclusion of a third variable in the analysis can change the apparent relationship between the other two variables. --- ## Simpson's paradox <img src="lec-05a-confounding_files/figure-html/unnamed-chunk-2-1.png" style="display: block; margin: auto;" /> --- ## Simpson's paradox <img src="lec-05a-confounding_files/figure-html/unnamed-chunk-3-1.png" style="display: block; margin: auto;" /> --- ## Berkeley's admission data - Data is from a study carried out by the graduate Division of UC Berkeley in the early 70’s to evaluate whether there was bias in graduate admissions. - The data come from six departments. For confidentiality we'll call them A-F. - We have information on whether the applicant was male or female and whether they were admitted or rejected. - First, we will evaluate whether the percentage of males admitted is indeed higher than females, overall. Next, we will calculate the same percentage for each department. --- ## Data ```r library(tidyverse) ucb_admit <- read_csv("data/ucb_admit.csv") ucb_admit ``` ``` #> # A tibble: 4,526 x 3 #> Admit Gender Dept #> <chr> <chr> <chr> #> 1 Admitted Male A #> 2 Admitted Male A #> 3 Admitted Male A #> 4 Admitted Male A #> 5 Admitted Male A #> 6 Admitted Male A #> 7 Admitted Male A #> 8 Admitted Male A #> 9 Admitted Male A #> 10 Admitted Male A #> # … with 4,516 more rows ``` If you want to follow along, a repo you can clone with the data and code is available here: https://classroom.github.com/a/g04U7VIr --- ## Overall gender distribution What can you say about the overall gender distribution? *Hint*: Calculate the following probabilities: `\(P(\text{Admit} | \text{Male})\)` and `\(P(\text{Admit} | \text{Female})\)`. ```r ucb_admit %>% count(Gender, Admit) ``` ``` #> # A tibble: 4 x 3 #> Gender Admit n #> <chr> <chr> <int> #> 1 Female Admitted 557 #> 2 Female Rejected 1278 #> 3 Male Admitted 1198 #> 4 Male Rejected 1493 ``` --- ## Overall gender distribution ```r ucb_admit %>% count(Gender, Admit) %>% group_by(Gender) %>% mutate(prop_admit = n / sum(n)) ``` ``` #> # A tibble: 4 x 4 #> # Groups: Gender [2] #> Gender Admit n prop_admit #> <chr> <chr> <int> <dbl> #> 1 Female Admitted 557 0.304 #> 2 Female Rejected 1278 0.696 #> 3 Male Admitted 1198 0.445 #> 4 Male Rejected 1493 0.555 ``` -- <br/> What type of visualization would be appropriate for representing this data? --- ## Overall gender distribution ```r ggplot(ucb_admit, mapping = aes(x = Gender, fill = Admit)) + geom_bar(position = "fill") + labs(y = "", title = "Admission by gender") ``` <img src="lec-05a-confounding_files/figure-html/unnamed-chunk-7-1.png" style="display: block; margin: auto;" /> --- ## Distribution by department What can you say about the by department gender distribution? ```r ucb_admit %>% count(Dept, Gender, Admit) ``` ``` #> # A tibble: 24 x 4 #> Dept Gender Admit n #> <chr> <chr> <chr> <int> #> 1 A Female Admitted 89 #> 2 A Female Rejected 19 #> 3 A Male Admitted 512 #> 4 A Male Rejected 313 #> 5 B Female Admitted 17 #> 6 B Female Rejected 8 #> 7 B Male Admitted 353 #> 8 B Male Rejected 207 #> 9 C Female Admitted 202 #> 10 C Female Rejected 391 #> # … with 14 more rows ``` --- .small[ ```r ucb_admit %>% count(Dept, Gender, Admit) %>% print(n = 24) ``` ``` #> # A tibble: 24 x 4 #> Dept Gender Admit n #> <chr> <chr> <chr> <int> #> 1 A Female Admitted 89 #> 2 A Female Rejected 19 #> 3 A Male Admitted 512 #> 4 A Male Rejected 313 #> 5 B Female Admitted 17 #> 6 B Female Rejected 8 #> 7 B Male Admitted 353 #> 8 B Male Rejected 207 #> 9 C Female Admitted 202 #> 10 C Female Rejected 391 #> 11 C Male Admitted 120 #> 12 C Male Rejected 205 #> 13 D Female Admitted 131 #> 14 D Female Rejected 244 #> 15 D Male Admitted 138 #> 16 D Male Rejected 279 #> 17 E Female Admitted 94 #> 18 E Female Rejected 299 #> 19 E Male Admitted 53 #> 20 E Male Rejected 138 #> 21 F Female Admitted 24 #> 22 F Female Rejected 317 #> 23 F Male Admitted 22 #> 24 F Male Rejected 351 ``` ] --- ## Distribution by department What type of visualization would be appropriate for representing these data? .small[ ```r ucb_admit %>% count(Dept, Gender, Admit) %>% group_by(Dept, Gender) %>% mutate(Perc_Admit = n / sum(n)) %>% filter(Admit == "Admitted") ``` ``` #> # A tibble: 12 x 5 #> # Groups: Dept, Gender [12] #> Dept Gender Admit n Perc_Admit #> <chr> <chr> <chr> <int> <dbl> #> 1 A Female Admitted 89 0.824 #> 2 A Male Admitted 512 0.621 #> 3 B Female Admitted 17 0.68 #> 4 B Male Admitted 353 0.630 #> 5 C Female Admitted 202 0.341 #> 6 C Male Admitted 120 0.369 #> 7 D Female Admitted 131 0.349 #> 8 D Male Admitted 138 0.331 #> 9 E Female Admitted 94 0.239 #> 10 E Male Admitted 53 0.277 #> 11 F Female Admitted 24 0.0704 #> 12 F Male Admitted 22 0.0590 ``` ] --- ## Distribution by department ```r ggplot(ucb_admit, mapping = aes(x = Gender, fill = Admit)) + geom_bar(position = "fill") + facet_grid(. ~ Dept) + labs(x = "Gender", y = "", fill = "Admission", title = "Admission by gender by department") ``` <img src="lec-05a-confounding_files/figure-html/unnamed-chunk-11-1.png" style="display: block; margin: auto;" /> --- ## Distribution by department <img src="lec-05a-confounding_files/figure-html/unnamed-chunk-12-1.png" style="display: block; margin: auto;" /> <br> Why do you think Simpson's paradox occurred? In other words, why is the overall admissions rate much lower for females, even though the admissions rates are generally similar within each department? --- ## Revisiting cereal... --- ## References 1. https://www.cbsnews.com/news/study-cereal-keeps-girls-slim/ 2. http://www.surveyusa.com/client/PollReport.aspx?g=783743b0-efc1-4b67-9201-58352a8f61f1 3. https://www.tylervigen.com/spurious-correlations 4. http://xkcd.com/552/