Scientific studies, confounding, and Simpson’s paradox

# Scientific studies, confounding, and Simpson’s paradox
## Intro to Data Science
### Shawn Santo
### 02-04-20

---

## Announcements

- Eric will return on Thursday.

- Be ready to discuss the ugly graphic and your improvements.

- Keep up with the assigned readings. Try to get them done before class.

---

## Does cereal keep girls slim?

**What do you think?**

Take a minute and write a few sentences explaining why you believe or do not 
believe that, according to the assigned reading from the news article, the 
statement "cereal keeps girls slim."

Fill out a brief survey 
[at this link](https://forms.gle/ZKyjR1kWCh2UCyzz5), 
which serves as today's application exercise.

---

## 4 possible explanations

1. Eating breakfast causes girls to be slimmer.

2. Being slim causes girls to eat breakfast.

3. That's just how the data ended up - there's no relationship here,
 but we just so happened to observe a "lucky coincidence".
 
 If we ended up *concluding* that there was a relationship when there truly 
 wasn't one, we would have made a **type 1 error**. We 
 will talk about this concept in greater detail later this semester. 
 
--

4. A third variable is responsible for both - a confounding variable.
 
 A *confounding* variable is an an extraneous variable 
 that affects both the explanatory and the response variable, and makes it 
 seem like there is a relationship between them

---

## "Lucky coincidences"

![](images/correlation1.png)

*Source*: [Tyler Vigen's site of spurious correlations:](https://www.tylervigen.com/spurious-correlations)

---

## "Lucky coincidences"

![](images/correlation2.png)

*Source*: [Tyler Vigen's site of spurious correlations:](https://www.tylervigen.com/spurious-correlations)

---

## "Lucky coincidences"

![](images/correlation3.png)

*Source*: [Tyler Vigen's site of spurious correlations:](https://www.tylervigen.com/spurious-correlations)

---

## Confounding variables

**Identify the confounding variable in each of the following statements:**

- As the amount of ice cream sales increases, the number of shark
attacks also increases.

- The higher the number of firefighters at a fire is, the greater the amount of
damage caused by that fire.

- Taller children are better at both reading and math compared to shorter
children.

---

## Correlation != causation

---

# Scientific studies

---

## Scientific studies

- Observational
    - Collect data in a way that does not interfere with how the data arise 
      ("observe")
    - Only establish an association
    - Data often cheaper and easier to collect

- Experimental
    - Randomly assign subjects to treatments
    - Establish causal connections
    - Often more expensive
    - Sometimes it is impossible or unethical to design an experiment

---

## Random sampling vs. random assignment

![](img/05a/random_sample_assign_grid.png)

---

## Non-random samples: a cautionary tale

In 2016, the Natural Environment Research Council in England
started an online competition in an effort to name a polar research
ship. People were invited to submit suggestions and/or cast a vote for
their favorite choice.

--

**What type of sampling design is this?**

[What happened?](https://www.cnn.com/2016/04/18/world/boaty-mcboatface-wins-vote/index.html)

---

# Conditional probability

---

## Conditional probability

**Notation**: `\(P(A | B)\)`: Probability of event A given event B

`\(A\)`: it will be unseasonably warm tomorrow

`\(B\)`: it is unseasonably warm today

- What is the probability that it will be unseasonably warm tomorrow? 
    - What is `\(P(A)\)`?

- What is the probability that it will be unseasonably warm tomorrow, given that
  is unseasonably warm today? 
    - What is `\(P(A|B)\)`?

---

## Example

A January 2018 SurveyUSA poll asked 500 randomly selected Californians whether 
they are familiar with the DREAM act. The distribution of the responses by age 
category are shown below.

What proportion of **all respondents** are very familiar with the 
DREAM act?

.pull-left[
| | 18 - 49 | 50+ | Total |
|------------|---------|-----|-------|
| Very | 90 | 32 | 122 |
| Somewhat | 125 | 86 | 211 |
| Not very | 56 | 33 | 89 |
| Not at all | 36 | 24 | 60 |
| Not sure | 9 | 9 | 18 |
| Total | 316 | 184 | 500 |

]

--
.pull-right[
`\(P(\text{Very}) = \frac{122}{500} = 0.244\)`
]

.footnote[
Source: [SurveyUSA News Poll 23754](http://www.surveyusa.com/client/PollReport.aspx?g=783743b0-efc1-4b67-9201-58352a8f61f1)
]

---

## Example

A January 2018 SurveyUSA poll asked 500 randomly selected Californians whether 
they are familiar with the DREAM act. The distribution of the responses by age 
category are shown below.

What proportion of **respondents who are 18 - 49 years old** are very 
familiar with the DREAM act?

---

## Example

A January 2018 SurveyUSA poll asked 500 randomly selected Californians whether 
they are familiar with the DREAM act. The distribution of the responses by age 
category are shown below.

What proportion of **respondents who are 50+ years old** are very 
familiar with the DREAM act?

---

Given that

- `\(P(\text{Very}) = \frac{122}{500} = 0.244\)`

- `\(P(\text{Very}~|~18-49) = \frac{90}{316} = 0.285\)`

- `\(P(\text{Very}~|~50+) = \frac{32}{184} = 0.173\)`

does there appear to be a relationship between age and familiarity with the 
DREAM act? Explain your reasoning.

Could there be another variable that explains this relationship?

---

## Independence

Inspired by the previous example and how we used the conditional probabilities 
to make conclusions, come up with a definition of independent events. 
If easier, you can keep the context limited to the example 
(independence/dependence of familiarity with the DREAM act and age), but try to 
push yourself to make a more general statement.

---

# Simpson's paradox

---

## Relationships between variables

- **Bivariate relationship**: Fitness `\(\rightarrow\)` Heart health

- **Multivariate relationship**: Calories + Age + Fitness `\(\rightarrow\)` Heart 
  health

---

## Simpson's paradox

- Not considering an important variable when studying a relationship can result 
  in **Simpson's paradox**, a phenomenon in which the omission of one explanatory 
  variable can affect the measure of association between another explanatory 
  variable and a response variable.

- In other words, the inclusion of a third variable in the analysis can change 
  the apparent relationship between the other two variables.

---

## Simpson's paradox

---

## Simpson's paradox

---

## Berkeley's admission data

- Data is from a study carried out by the graduate Division of UC 
  Berkeley in the early 70’s to evaluate whether there was bias in graduate 
  admissions.
  
- The data come from six departments. For confidentiality we'll call them A-F.

- We have information on whether the applicant was male or female and whether 
  they were admitted or rejected.
  
- First, we will evaluate whether the percentage of males admitted is indeed 
  higher than females, overall. Next, we will calculate the same percentage for 
  each department.

---

## Data

```r
library(tidyverse)
ucb_admit <- read_csv("data/ucb_admit.csv")
ucb_admit
```

```
#> # A tibble: 4,526 x 3
#> Admit Gender Dept 
#> <chr> <chr> <chr>
#> 1 Admitted Male A 
#> 2 Admitted Male A 
#> 3 Admitted Male A 
#> 4 Admitted Male A 
#> 5 Admitted Male A 
#> 6 Admitted Male A 
#> 7 Admitted Male A 
#> 8 Admitted Male A 
#> 9 Admitted Male A 
#> 10 Admitted Male A 
#> # … with 4,516 more rows
```

If you want to follow along, a repo you can clone with the data and
code is available here: https://classroom.github.com/a/g04U7VIr

---

## Overall gender distribution

What can you say about the overall gender distribution? *Hint*: 
Calculate the following probabilities: `\(P(\text{Admit} | \text{Male})\)` and 
`\(P(\text{Admit} | \text{Female})\)`.

```r
ucb_admit %>%
  count(Gender, Admit)
```

```
#> # A tibble: 4 x 3
#> Gender Admit n
#> <chr> <chr> <int>
#> 1 Female Admitted 557
#> 2 Female Rejected 1278
#> 3 Male Admitted 1198
#> 4 Male Rejected 1493
```

---

## Overall gender distribution

```r
ucb_admit %>%
  count(Gender, Admit) %>%
  group_by(Gender) %>%
  mutate(prop_admit = n / sum(n))
```

```
#> # A tibble: 4 x 4
#> # Groups: Gender [2]
#> Gender Admit n prop_admit
#> <chr> <chr> <int> <dbl>
#> 1 Female Admitted 557 0.304
#> 2 Female Rejected 1278 0.696
#> 3 Male Admitted 1198 0.445
#> 4 Male Rejected 1493 0.555
```

What type of visualization would be appropriate for representing this data?

---

## Overall gender distribution

```r
ggplot(ucb_admit, mapping = aes(x = Gender, fill = Admit)) +
  geom_bar(position = "fill") + 
  labs(y = "", title = "Admission by gender")
```

---

## Distribution by department

What can you say about the by department gender distribution?

```r
ucb_admit %>%
  count(Dept, Gender, Admit)     
```

---

```r
ucb_admit %>%
  count(Dept, Gender, Admit) %>%
  print(n = 24)
```

```
#> # A tibble: 24 x 4
#> Dept Gender Admit n
#> <chr> <chr> <chr> <int>
#> 1 A Female Admitted 89
#> 2 A Female Rejected 19
#> 3 A Male Admitted 512
#> 4 A Male Rejected 313
#> 5 B Female Admitted 17
#> 6 B Female Rejected 8
#> 7 B Male Admitted 353
#> 8 B Male Rejected 207
#> 9 C Female Admitted 202
#> 10 C Female Rejected 391
#> 11 C Male Admitted 120
#> 12 C Male Rejected 205
#> 13 D Female Admitted 131
#> 14 D Female Rejected 244
#> 15 D Male Admitted 138
#> 16 D Male Rejected 279
#> 17 E Female Admitted 94
#> 18 E Female Rejected 299
#> 19 E Male Admitted 53
#> 20 E Male Rejected 138
#> 21 F Female Admitted 24
#> 22 F Female Rejected 317
#> 23 F Male Admitted 22
#> 24 F Male Rejected 351
```
]

---

## Distribution by department

What type of visualization would be appropriate for representing these data?

```r
ucb_admit %>%
  count(Dept, Gender, Admit) %>%
  group_by(Dept, Gender) %>%
  mutate(Perc_Admit = n / sum(n)) %>%
  filter(Admit == "Admitted")
```

```
#> # A tibble: 12 x 5
#> # Groups: Dept, Gender [12]
#> Dept Gender Admit n Perc_Admit
#> <chr> <chr> <chr> <int> <dbl>
#> 1 A Female Admitted 89 0.824 
#> 2 A Male Admitted 512 0.621 
#> 3 B Female Admitted 17 0.68 
#> 4 B Male Admitted 353 0.630 
#> 5 C Female Admitted 202 0.341 
#> 6 C Male Admitted 120 0.369 
#> 7 D Female Admitted 131 0.349 
#> 8 D Male Admitted 138 0.331 
#> 9 E Female Admitted 94 0.239 
#> 10 E Male Admitted 53 0.277 
#> 11 F Female Admitted 24 0.0704
#> 12 F Male Admitted 22 0.0590
```
]

---

## Distribution by department

```r
ggplot(ucb_admit, mapping = aes(x = Gender, fill = Admit)) +
  geom_bar(position = "fill") +
  facet_grid(. ~ Dept) +
  labs(x = "Gender", y = "", fill = "Admission",
       title = "Admission by gender by department")
```

---

## Distribution by department

Why do you think Simpson's paradox occurred? In other words, why is the overall 
admissions rate much lower for females, even though the admissions rates are 
generally similar within each department?

---

## Revisiting cereal...

---

## References

1. https://www.cbsnews.com/news/study-cereal-keeps-girls-slim/

2. http://www.surveyusa.com/client/PollReport.aspx?g=783743b0-efc1-4b67-9201-58352a8f61f1

3. https://www.tylervigen.com/spurious-correlations

4. http://xkcd.com/552/