Missing Data

class: center, middle, inverse, title-slide

# Missing Data
### Yue Jiang
### STA 210 / Duke University / Spring 2023

---

### Missing data

Missing data occurs all the time in data analyses and must be dealt with 
carefully. For instance:

- Respondents to a survey may be unwilling to answer questions in ways that are
socially undesirable (e.g., reporting high incomes, alcohol use habits, etc.).
- Participants in a trial may be lost to follow-up and have censored data.
- We may intentionally incorporate "missing" data to complex longitudinal
survey designs in cases where it's overly burdensome to collect "complete" data.

.question[
In the past, how have you dealt with missing data?
]

---

### Visualizing missing data

```r
library(mice)
data("nhanes2")
head(nhanes2)
```

```
##     age  bmi  hyp chl
## 1 20-39   NA <NA>  NA
## 2 40-59 22.7   no 187
## 3 20-39   NA   no 187
## 4 60-99   NA <NA>  NA
## 5 20-39 20.4   no 113
## 6 60-99   NA <NA> 184
```

---

### Visualizing missing data

```r
library(naniar)
vis_miss(nhanes2)
```

---

### Visualizing missing data

```r
library(UpSetR)
gg_miss_upset(nhanes2)
```

---

### Visualizing missing data

```r
library(naniar)
gg_miss_fct(x = nhanes2, fct = hyp)
```

---

### Some terminology

Suppose `$\mathbf{Z} = (Z_1, \cdots, Z_k)$` is the full data, which may be 
completely or partially missing for some observations, and suppose 
`$\mathbf{R} = (R_1, \cdots, R_k)$` is a vector of indicators for whether each
`$Z_i$` is observed (1 if observed, 0 if missing).

- .vocab[Missing Completely at Random (MCAR)]: `$P(\mathbf{R} = \mathbf{r} | \mathbf{Z})$` 
does not depend on `$\mathbf{Z}$` (that is, `$\mathbf{R}$` and `$\mathbf{Z}$` are
independent).
- .vocab[Missing at Random (MAR)]: `$P(\mathbf{R} = \mathbf{r} | \mathbf{Z})$` 
only depends on elements of `$\mathbf{Z}$` that are observed for `$\mathbf{R} = \mathbf{r}$`.
- .vocab[Missing Not at Random (MNAR)]: `$P(\mathbf{R} = \mathbf{r} | \mathbf{Z})$`
depends on elements of `$\mathbf{Z}$` that are *not* observed for `$\mathbf{R} = \mathbf{r}$`.

---

### Examples of missingness mechanisms

Suppose you are designing a written survey examining alcohol consumption among 
students and your goal is to quantify alcohol consumption and examine potential
predictors.

.question[
What types of analyses might you choose? What types of data might you collect?
What would be examples of each type of missingness mechanism, MCAR, MAR, and
MNAR?
]

---

### MCAR example

Under MCAR, *no systematic differences exist* between those with missing data
and those with complete data - those with missing data are representative of the
entire population.

Suppose you are designing a written survey examining alcohol consumption among 
students, and suppose that some of these survey responses happened to get 
destroyed because the building they were housed in burned down. In this case,
whether a response is missing is completely unrelated to the data at hand - it's
only those responses that happened to be in the (former) building.

MCAR is a very strong assumption and usually unrealistic unless the study was
deliberately designed to include missing data and account for it appropriately.

---

### MAR example

Under MAR, missing data are related to observed data, but not with the 
unobserved data.

Taking the same survey example, suppose statistical science students recognize
the importance of having complete data and were more likely to complete the
survey compared to other departments. Also suppose whether someone completed a 
survey was due solely to major, and that we fully observed everyone's major. In
this case, we would have MAR data.

---

### MNAR example

Under MNAR, missing data are related to *unobserved* data.

For instance, suppose instead that students who have higher alcohol consumption
are less likely to respond to the survey. In this case, whether a survey is
missing depends directly on the value of that unobserved response itself - the
data are missing not at random.

---

### Are my data MAR or MNAR?

.question[
Do you think it's possible to test whether a particular missingness mechanism
holds in your data at hand? If so, how would you do so?
]

Unfortunately, it is **not** possible to test whether certain missingness
mechanisms hold given the observed data. MNAR vs. MAR cannot be tested simply 
because data that are MNAR involve unobserved data.

If the assumption of MAR is made, there are methods to test vs. MCAR (e.g.,
examining associations in the observed data), but a decision should be made
with the guidance of subject-matter expertise. Regardless, MAR is itself an
*unverifiable* assumption and must be justified in the context of each problem.

---

### Complete case analysis

Have you ever done something along the lines of

```r
mean(data$var, na.rm = T)
```

or fit a regression model and see a message like

```r
(_ observations deleted due to missingness)
```

In each of these cases, you are performing a .vocab[complete case analysis]. We
are performing the analysis using only those observations without any missing
variables of interest (and often analyzing the data "as if" we had the full
unobserved dataset).

---

### Complete case analysis

.question[
Let's suppose we have ten variables of interest for a regression model, and
that each of these variables only have 5% missing data (suppose that
missingness is independent between the variables). What is the probability of
observing a complete case? What potential consequences does this have on our
analysis?
]

Even under MCAR, we are using many fewer observations, which would result in
lower power. If the data are MAR or MNAR, then we will generally get biased
estimates for parameters of interest.

---

### Mean/median/mode imputation

The idea behind .vocab[imputation] is to "fill in" missing values in the data
in a reasonable way, and then carry out the analysis.

Often times for missing continuous variables, researchers may simply plug in
the mean or median of observed values for missing values. For categorical
variables, researchers may plug in the most common category.

.question[
What are the potential consequences of this approach under various missing data
mechanisms? How might this affect inference or uncertainty quantification?
]

---

### Imputation based on regression models

The previous approach is often termed "unconditional" mean/median/mode 
imputation, because it doesn't take into consideration the other covariates in
the dataset (which may be especially undesirable for MAR mechanisms, for 
example).

We might try instead to create a model for the missing values based on the
observed data and use predictions from these models as imputed values (for
instance, a linear model for missing continuous predictors). We might also try
.vocab[hot deck] imputation, which fills in missing data based on observed
values from other observations which "match" in some sense.

.question[
What are the potential consequences of these approaches under various missing 
data mechanisms? How might this affect inference or uncertainty quantification?
]

---

### Imputation based on regression models

Imputation based on regression models can result in consistent estimators of
certain model parameters under MAR if the model used to impute the data is
correctly specified. However, correctly specifying this relationship is often
an impossible task in practical settings (and furthermore is also unverifiable).

These methods also suffer from similar issues regarding artificially decreased
variability - in the case of a continuous variable, all missing values will 
still lie on the same regression line, which may inflate type 1 error rates.
We know that there should be some variability involved in the imputation 
process, but in these methods we in fact get *decreased* variance of regression
estimators!

---

### The "ideal" imputation procedure

.question[
So far, we've discussed a few ad hoc imputation procedures and explored some
statistical drawbacks of such approaches. What might an "ideal" imputation 
procedure have/do?
]

As a few general considerations, such methods should...

- provide consistent estimation of parameters of interest,
- take into consideration extra variability due to imputation procedure, and
- allow for principled quantification of variability of estimated parameters.

---

### Brief disclaimer

A brief disclaimer before continuing is that we will be discussing some fairly
advanced models - there is no need to fully understand the procedure we'll be
discussing (just focus on the intuition).

You'll learn more about these types of methods as you get more advanced in your
statistical career!

---

### Multiple imputation

.vocab[Multiple imputation] techniques are a broad class of methods intended to
perform/satisfy the three considerations that characterize a good missing data
analysis approach. There are three main steps in the analysis:

1. For each observation with missing data, values are imputed `$M$` times to 
create `$M$` datasets. Note that these datasets might be different from each 
other.
2. Carry out a statistical analysis of interest on each of the imputed datasets.
3. Combine/pool the results in a principled way that properly deals with the 
imputation process.

Note that the model used for imputation should correspond to the models 
eventually used for analysis. For instance, if you include interaction terms or
variable transformations in your final intended analysis, these should also 
be used for each imputation model.

---

### MICE

.vocab[Multiple Imputation via Chained Equations (MICE)] is an algorithm that
performs multiple imputation assuming MAR is satisfied:

1. First, fill in *all* missing values by sampling randomly from
observed data.
2. Create a predictive model for the first variable with missing values `$Z_1$` 
based on other variables in the dataset `$(Z_2, \cdots Z_k)$` for those with 
*observed data* (that is, those with `$R_1 = 1$`). For those with `$R_1 = 0$`,
simulate draws from the posterior predictive distribution of `$Z_1$`
3. Create a predictive model for the next variable with missing values `$Z_2$`
based on other variables in the dataset `$(Z_1, Z_3, \cdots Z_k)$` for those with
`$R_2 = 1$`. Importantly, for variable `$Z_1$`, we will use the imputed values if
any are missing. For those with `$R_2 = 0$`, simulate draws from the posterior
predictive distribution of `$Z_2$`.
4. Repeat for all other variables.
5. Once all variables have been addressed, repeat steps 2-4 until "convergence"
of the imputed variables. This will correspond to a single imputed dataset.
6. Repeat steps 1 - 5 `$M$` times.

---

### MICE

```r
library(mice)
md.pattern(nhanes2)
```

```
##    age hyp bmi chl   
## 13   1   1   1   1  0
## 3    1   1   1   0  1
## 1    1   1   0   1  1
## 1    1   0   0   1  2
## 7    1   0   0   0  3
##      0   8   9  10 27
```

---

### MICE

```r
nhanes2.imp = mice(nhanes2, 
                   m = 5, 
                   meth = c(age = "sample", 
                            bmi = "pmm", 
                            hyp = "logreg", 
                            chl = "pmm"), 
                   seed = 123)
```

```
## 
##  iter imp variable
##   1   1  bmi  hyp  chl
##   1   2  bmi  hyp  chl
##   1   3  bmi  hyp  chl
##   1   4  bmi  hyp  chl
##   1   5  bmi  hyp  chl
##   2   1  bmi  hyp  chl
##   2   2  bmi  hyp  chl
##   2   3  bmi  hyp  chl
##   2   4  bmi  hyp  chl
##   2   5  bmi  hyp  chl
##   3   1  bmi  hyp  chl
##   3   2  bmi  hyp  chl
##   3   3  bmi  hyp  chl
##   3   4  bmi  hyp  chl
##   3   5  bmi  hyp  chl
##   4   1  bmi  hyp  chl
##   4   2  bmi  hyp  chl
##   4   3  bmi  hyp  chl
##   4   4  bmi  hyp  chl
##   4   5  bmi  hyp  chl
##   5   1  bmi  hyp  chl
##   5   2  bmi  hyp  chl
##   5   3  bmi  hyp  chl
##   5   4  bmi  hyp  chl
##   5   5  bmi  hyp  chl
```

---

### MICE

```r
summary(nhanes2.imp)
```

```
## Class: mids
## Number of multiple imputations:  5 
## Imputation methods:
##      age      bmi      hyp      chl 
##       ""    "pmm" "logreg"    "pmm" 
## PredictorMatrix:
##     age bmi hyp chl
## age   0   1   1   1
## bmi   1   0   1   1
## hyp   1   1   0   1
## chl   1   1   1   0
```

---

### MICE

```r
nhanes2.imp$imp$bmi
```

```
##       1    2    3    4    5
## 1  27.2 26.3 28.7 30.1 27.2
## 3  33.2 22.0 22.0 30.1 30.1
## 4  22.7 22.5 30.1 22.5 27.2
## 6  27.4 25.5 22.0 22.7 22.5
## 10 22.0 22.7 20.4 20.4 35.3
## 11 35.3 27.2 22.0 35.3 27.2
## 12 30.1 22.5 26.3 35.3 22.7
## 16 28.7 27.2 26.3 33.2 30.1
## 21 20.4 30.1 22.5 22.0 27.2
```

---

### MICE

```r
head(nhanes2)
```

```
##     age  bmi  hyp chl
## 1 20-39   NA <NA>  NA
## 2 40-59 22.7   no 187
## 3 20-39   NA   no 187
## 4 60-99   NA <NA>  NA
## 5 20-39 20.4   no 113
## 6 60-99   NA <NA> 184
```

---

### MICE

```r
head(complete(nhanes2.imp, 1))
```

```
##     age  bmi hyp chl
## 1 20-39 27.2  no 187
## 2 40-59 22.7  no 187
## 3 20-39 33.2  no 187
## 4 60-99 22.7  no 187
## 5 20-39 20.4  no 113
## 6 60-99 27.4 yes 184
```

```r
head(complete(nhanes2.imp, 2))
```

```
##     age  bmi hyp chl
## 1 20-39 26.3 yes 187
## 2 40-59 22.7  no 187
## 3 20-39 22.0  no 187
## 4 60-99 22.5 yes 118
## 5 20-39 20.4  no 113
## 6 60-99 25.5  no 184
```

---

### MICE

```r
xyplot(nhanes2.imp, bmi ~ chl, pch = 19, alpha = 0.4)
```

---

### MICE

```r
densityplot(nhanes2.imp)
```

---

### MICE

```r
mod1 <- with(nhanes2.imp, lm(chl ~ age + bmi + hyp))
summary(pool(mod1))
```

```
##          term  estimate std.error  statistic        df    p.value
## 1 (Intercept) 31.244973 61.643374  0.5068667  9.807543 0.62344869
## 2    age40-59 43.936862 23.992772  1.8312541  7.194198 0.10859905
## 3    age60-99 65.006196 38.739903  1.6780165  3.034952 0.19088677
## 4         bmi  5.076024  2.112246  2.4031409 10.879346 0.03526641
## 5      hypyes -7.996066 25.682491 -0.3113431  5.880126 0.76628175
```

```r
sum(complete.cases(nhanes2))
```

```
## [1] 13
```

```r
summary(lm(chl ~ age + bmi + hyp, data = nhanes2))$coef
```

```
##               Estimate Std. Error    t value   Pr(>|t|)
## (Intercept) -35.676683  63.245276 -0.5641004 0.58814686
## age40-59     59.543265  22.946889  2.5948295 0.03187292
## age60-99    109.457782  30.437329  3.5961690 0.00702126
## bmi           7.160119   2.200899  3.2532707 0.01164394
## hypyes       -7.692013  25.179480 -0.3054874 0.76779265
```