Categorical Data (2)

class: center, middle, inverse, title-slide

# Categorical Data (2)
### Yue Jiang
### Duke University

---

### Pregnancy, HIV, and AIDS

<img src="img/hiv.png" width="60%" style="display: block; margin: auto;" />
**Image Credit**: Liza Gross, used under CCA 4.0 License

---

### Pregnancy, HIV, and AIDS

Researchers are interested in the special population of `\(HIV+\)` women on 
antiretroviral therapy in sub-Saharan Africa. They would like to know whether in 
this population, a new pregnancy is related to the probability of having an 
AIDS-defining event (that is, their HIV being classified as AIDS).

To test for an association, they recruit women from a large network of health
care clinics and find the following:

|                   | AIDS  | No AIDS  |
| -------------     | -----: | -----:| 
| Pregnant       | 31     |  44  | 
| Not Pregnant       | 124     |  99  |

.question[
Calculate an odds ratio and associated 95% CI using these data for AIDS-defining
events based on pregnancy status. What might you conclude?
]

---

### Pregnancy, HIV, and AIDS

Consider the following data, which is for all HIV+ women in the area, not just
those who visited a clinic:

|                   | AIDS  | No AIDS  |
| -------------     | -----: | -----:| 
| Pregnant       | 44     |  175  | 
| Not Pregnant       | 248     |  990  |

.question[
Calculate the same odds ratio as before and associated 95% CI using these data.
What might you conclude? How might you explain this?
]

---

### Pregnancy, HIV, and AIDS

The original sample was collected due to easier data collection (the women are
already in the clinic). However, this is problematic since not all HIV+ women 
are equally likely to visit a health clinic - we have a .vocab[sampling bias]
here:

|                   | Visited clinic  | Did not visit clinic  |
| -------------     | -----: | -----:| 
| Pregnant + AIDS      | 31     |  13  | 
| Pregnant only       | 44     |  131  | 
| AIDS only         | 124 | 124 |
| Neither          | 99 | 891 |

---

### Everyone's heard of Simpson's paradox...

The previous example is an example of .vocab[Berkson's paradox]. In these cases,
two *marginally* independent events become *dependent* conditionally on the
occurrence of either one of the two events. That is, for two independent events,
if we only consider the cases where either of the events occur, then they 
become dependent (usually negatively).

Note that

`\begin{align*}
P(A) &= \frac{P(A | A \cup B)}{P(A \cup B)}\\
&\le P(A | A \cup B) \times 1\\
&= P(A | A \cup B)
\end{align*}`

---

### Berkson's paradox

HIV+ women can go to the clinic to check on their pregnancy or because they are 
experiencing a drastic lowering in health status. In our original case, we've
*excluded* every woman who was neither pregnant nor particularly worried about 
their HIV. We've conditioned on AIDS-defining event OR pregnancy when sampling
these women.

A non-pregnant HIV+ woman who went to the clinic is more likely to have AIDS
than HIV+ women in general, because she's specifically going to the clinic for
a *non*-pregnancy-related reason (maybe AIDS). This is responsible for the 
spurious observed conclusion of pregnant HIV+ women having lower AIDS risk than
non-pregnant HIV+ women.

---

### Two more examples

(see visualizations on Zoom)

---

### Combining contingency tables

Previously we talked about how to analyze data from r × c tables
to quantify a potential association between two factors. We will
continue with this concept but now concentrate on the relationship
between two factors in the presence of a third factor.

- Maybe we have a 2x2 table at each site for a multi-site study
- Perhaps we want to combine results of several published studies in a meta-
analysis
- There might be a third factor that affects our estimates of the association
between two other categorical variables

---

### Kidney stone treatment

A 1986 *British Medical Journal* article reported the results of a study
comparing open surgical treatment to percutaneous
nephrolithotomy (PN) for removal of kidney stones. They first
examined success rates of the procedure stratified by the size of
the kidney stone.

| Small stones     | Successful  | Not Successful  |
| -------------     | -----: | -----:| 
| Open              | 81     |   6  | 
| PN                | 234     |  36  |

| Large stones      | Successful  | Not Successful  |
| -------------     | -----: | -----:| 
| Open              | 192     |   71  | 
| PN                | 55     |  25  |

---

### Kidney stone treatment

| Small stones     | Successful  | Not Successful  |
| -------------     | -----: | -----:| 
| Open              | 81     |   6  | 
| PN                | 234     |  36  |

| Large stones      | Successful  | Not Successful  |
| -------------     | -----: | -----:| 
| Open              | 192     |   71  | 
| PN                | 55     |  25  |

For small stones, the odds ratio for success was 2.07 in favor of open 
procedures; for large stones, the odds ratio for success was 1.23 in favor of
open procedures.

.question[
What might we conclude?
]

---

### Kidney stone treatment

| All stones     | Successful  | Not Successful  |
| -------------     | -----: | -----:| 
| Open              | 273     | 77  | 
| PN                | 289     |  61  |

When combining all kidney stones, the odds ratio was actually 1.33 in favor of
**PN**, not open surgery.

.question[
What happened?
]

---

### Combining contingency tables

Sometimes, the presence of a third factor can affect the
relationship between the two factors of interest.

.vocab[Simpson’s paradox] occurs when the direction of an association between two
variables is reversed after stratification upon a third variable.

---

### Kidney stone treatment

| Small stones     | Successful  | Not Successful  |
| -------------     | -----: | -----:| 
| Open              | 81     |   6  | 
| PN                | 234     |  36  |

| Large stones      | Successful  | Not Successful  |
| -------------     | -----: | -----:| 
| Open              | 192     |   71  | 
| PN                | 55     |  25  |

- Group sizes were very different
- Doctors tended to give the harder cases (large stones) the better treatment
and small stones the inferior treatment
- Success rate was more strongly tied to the size of the stone rather than by
the treatment type

---

### Combining multiple tables

**Should** we combine tables (e.g., could there be any issues re: Simpson's 
paradox)? If so, how do we combine them? Once combined, how might we make 
statistical inferences?

.question[
Intuitively, how might you go about answering the above questions?
]

---

### Should we combine tables?

This is actually a pretty hard question to answer! Suppose we have two 
contingency tables.

- Calculate estimates separately for each table. If the
association is the same in the two tables, then maybe it is  fine to
combine the two tables.
- If the association is different, then it is often (though not
always) not a good idea to combine tables.

We *want* to combine tables if any differences are due purely to change, but
we don't want to combine tables if there really is a different association
across the tables scientifically.

---

### The Mantel-Haenszel approach

The .vocab[Mantel-Haenszel] approach potentially addresses the confounding
effect of explanatory variables that comprise the stratification and can provide 
increased power for detecting association.

- First, determine whether the strength of association is uniform across tables
(if not, stop and report separate odds ratios)
- If the strength of association is similar across tables, then calculate a
combined OR and test whether the overall association is significant

What if both ORs are 1.2? What if one is 0.9 and another is 1.1 (are these
"different")? We might test the homogeneity of odds ratios (i.e., the null
hypothesis that all are the same), such as the Breslow-Day test.

---

### The Mantel-Haenszel approach

When combining odds ratios, we need to be careful of Simpson's paradox. Under 
the Mantel-Haenszel approach, the summary odds ratio estimate for `\(g\)` different
2x2 tables is:

`\begin{align*}
\widehat{OR} = \frac{\sum_{i = 1}^g (a_id_i)/n_i}{\sum_{i = 1}^g (b_ic_id)/n_i},
\end{align*}`

where `\(n_i\)` is the total number of subjects in table `\(i\)`.

---

### Case 2 considerations

1. Keep in mind that performers are instructed to select certain pieces from a pre-defined lists for each round. Have you accounted for this in your analysis?
2. Once performer choose pieces, they may order them in any way they choose. Are you considering the order in which pieces get selected for their programs?
3. Performers who select certain pieces in earlier rounds may not choose them again in later rounds. How does this factor into your analysis? Are there any pieces that show up both “early” and “late”? How might you examine placement?