Scientific studies, confounding, and Simpson’s paradox 😖

# Scientific studies, confounding, and Simpson’s paradox 😖

---

layout: true
 
<div class="my-footer">

Dr. Mine Çetinkaya-Rundel -
<a href="http://www2.stat.duke.edu/courses/Fall18/sta112.01/schedule" target="_blank">stat.duke.edu/courses/Fall18/sta112.01
</a>

</div>

---

# Announcements

---

## Announcements

- Reading 03 due Thursday
- MT 01 assigned Thursday after class, due in class on Tuesday

---

# Scientific studies

---

## Scientific studies

- Observational
    - Collect data in a way that does not interfere with how the data arise ("observe")
    - Only establish an association
- Experimental
    - Randomly assign subjects to treatments
    - Establish causal connections

.question[
Design a study comparing average energy levels of people who do and do not exercise -- both as an observational study and as an experiment.
]

---

## Study: Bfast cereal keeps girls slim

.midi[
Girls who ate breakfast of any type had a lower average body mass index, a common obesity gauge, than those who said they didn't. The index was even lower for girls who said they ate cereal for breakfast, according to findings of the study conducted by the Maryland Medical Research Institute with funding from the National Institutes of Health (NIH) and cereal-maker General Mills.
 
[...]
 
The results were gleaned from a larger NIH survey of 2,379 girls in California, Ohio, and Maryland who were tracked between the ages of 9 and 19.
 
[...]
 
As part of the survey, the girls were asked once a year what they had eaten during the previous three days.
 
[...]
]

.footnote[
Souce: [Study: Cereal Keeps Girls Slim](https://www.cbsnews.com/news/study-cereal-keeps-girls-slim/), Retrieved Sep 13, 2018.
]

---

### 3 possible explanations

- Eating breakfast causes girls to be slimmer

--
- Being slim causes girls to eat breakfast

--
- A third variable is responsible for both -- a **confounding** variable: an extraneous variable that affects both the explanatory and the response variable, and that make it seem like there is a relationship between them

---

## Correlation != causation

---

## Studies and conclusions

![](img/random_sample_assign_grid.png)

---

# Conditional probability

---

## Conditional probability

**Notation**: `$P(A | B)$`: Probability of event A given event B

- What is the probability that it be unseasonably warm tomorrow?
- What is the probability that it be unseasonably warm tomorrow, given that it it was unseasonably warm tomorrow?

---

.question[
A January 2018 SurveyUSA poll asked 500 randomly selected Californians whether they are familiar with the DREAM act. The distribution of the responses by age category are shown below.

What percent of **all respondents** are very familiar with the DREAM act? 
]
 
.pull-left[
| | 18 - 49 | 50+ | Total |
|------------|---------|-----|-------|
| Very | 90 | 32 | 122 |
| Somewhat | 125 | 86 | 211 |
| Not very | 56 | 33 | 89 |
| Not at all | 36 | 24 | 60 |
| Not sure | 9 | 9 | 18 |
| Total | 316 | 184 | 500 |
]
--
.pull-right[
`$P(Very) = \frac{122}{500} = 0.244$`
]

.footnote[
Source: [SurveyUSA News Poll 23754](http://www.surveyusa.com/client/PollReport.aspx?g=783743b0-efc1-4b67-9201-58352a8f61f1)
]

---

.question[
A January 2018 SurveyUSA poll asked 500 randomly selected Californians whether they are familiar with the DREAM act. The distribution of the responses by age category are shown below.

What percent of **respondents who are 18 - 49 years old** are very familiar with the DREAM act?
]
 
.pull-left[
| | 18 - 49 | 50+ | Total |
|------------|---------|-----|-------|
| Very | 90 | 32 | 122 |
| Somewhat | 125 | 86 | 211 |
| Not very | 56 | 33 | 89 |
| Not at all | 36 | 24 | 60 |
| Not sure | 9 | 9 | 18 |
| Total | 316 | 184 | 500 |
]
--
.pull-right[
`$P(Very~|~18-49) = \frac{90}{316} = 0.285$`
]

---

.question[
A January 2018 SurveyUSA poll asked 500 randomly selected Californians whether they are familiar with the DREAM act. The distribution of the responses by age category are shown below.

What percent of **respondents who are 50+ years old** are very familiar with the DREAM act?
]
 
.pull-left[
| | 18 - 49 | 50+ | Total |
|------------|---------|-----|-------|
| Very | 90 | 32 | 122 |
| Somewhat | 125 | 86 | 211 |
| Not very | 56 | 33 | 89 |
| Not at all | 36 | 24 | 60 |
| Not sure | 9 | 9 | 18 |
| Total | 316 | 184 | 500 |
]
--
.pull-right[
`$P(Very~|~50+) = \frac{32}{184} = 0.173$`
]

---

- `$P(Very) = \frac{122}{500} = 0.244$`
- `$P(Very~|~18-49) = \frac{90}{316} = 0.285$`
- `$P(Very~|~50+) = \frac{32}{184} = 0.173$`

does there appear to be a relationship between age and familiarity with the DREAM act? Explain your reasoning.
]

---

## Independence

.question[
Inspired by the previous example and how we used the conditional probabilities to make conclusions, come up with a definition of independent events. If easier, you can keep the context limited to the example (independence/dependence of familiarity with the DREAM act and age), but try to push yourself to make a more general statement.
]

---

# Simpson's paradox

---

## Relationships between variables

- Bivariate relationship: Fitness `$\rightarrow$` Heart health
- Multivariate relationship: Calories + Age + Fitness `$\rightarrow$` Heart health

---

## Simpson's paradox

- Not considering an important variable when studying a relationship can result in what we call a **Simpson's paradox**, which illustrates the effect the omission of an explanatory variable can have on the measure of association between another explanatory variable and a response variable. 
- In other words, the inclusion of a third variable in the analysis can change the apparent relationship between the other two variables.

---

## Berkeley admission data

- Study carried out by the graduate Division of the University of California, Berkeley in the early 70’s to evaluate whether there was a sex bias in graduate admissions.
- The data come from six departments. For confidentiality we'll call them A-F. 
- We have information on whether the applicant was male or female and whether they were admitted or rejected. 
- First, we will evaluate whether the percentage of males admitted is indeed higher than females, overall. Next, we will calculate the same percentage for each department.

---

## Data

The data are in the `ucbadmit` data frame in the `dsbox` package:

```r
glimpse(ucbadmit)
```

```
## Observations: 4,526
## Variables: 3
## $ admit <fct> Admitted, Admitted, Admitted, Admitted, Admitted, Admit...
## $ gender <fct> Male, Male, Male, Male, Male, Male, Male, Male, Male, M...
## $ dept <ord> A, A, A, A, A, A, A, A, A, A, A, A, A, A, A, A, A, A, A...
```

---

## Overall gender distribution

.question[
What can you say about the overall gender distribution? Hint: Calculate the following probabilities: `$P(Admit | Male)$` and `$P(Admit | Female)$`.
]

```r
ucbadmit %>%
  count(gender, admit)
```

```
## # A tibble: 4 x 3
## gender admit n
## <fct> <fct> <int>
## 1 Female Rejected 1278
## 2 Female Admitted 557
## 3 Male Rejected 1493
## 4 Male Admitted 1198
```

---

## Overall gender distribution

```
## # A tibble: 4 x 4
## # Groups: gender [2]
## gender admit n prop_admit
## <fct> <fct> <int> <dbl>
## 1 Female Rejected 1278 0.696
## 2 Female Admitted 557 0.304
## 3 Male Rejected 1493 0.555
## 4 Male Admitted 1198 0.445
```

---

## Overall gender distribution

```r
ucbadmit %>%
  count(gender, admit) %>%
  group_by(gender) %>%
  mutate(prop_admit = n / sum(n))
```

---

## Overall gender distribution

![](u1_d08-confounding-simpsons-paradox_files/figure-html/unnamed-chunk-7-1.png)

---