class: center, middle, inverse, title-slide # Missing Data ### Yue Jiang ### STA 210 / Duke University / Spring 2023 --- ### Missing data Missing data occurs all the time in data analyses and must be dealt with carefully. For instance: - Respondents to a survey may be unwilling to answer questions in ways that are socially undesirable (e.g., reporting high incomes, alcohol use habits, etc.). - Participants in a trial may be lost to follow-up and have censored data. - We may intentionally incorporate "missing" data to complex longitudinal survey designs in cases where it's overly burdensome to collect "complete" data. .question[ In the past, how have you dealt with missing data? ] --- ### Visualizing missing data ```r library(mice) data("nhanes2") head(nhanes2) ``` ``` ## age bmi hyp chl ## 1 20-39 NA <NA> NA ## 2 40-59 22.7 no 187 ## 3 20-39 NA no 187 ## 4 60-99 NA <NA> NA ## 5 20-39 20.4 no 113 ## 6 60-99 NA <NA> 184 ``` --- ### Visualizing missing data ```r library(naniar) vis_miss(nhanes2) ``` <img src="missing-data_files/figure-html/unnamed-chunk-2-1.png" style="display: block; margin: auto;" /> --- ### Visualizing missing data ```r library(UpSetR) gg_miss_upset(nhanes2) ``` <img src="missing-data_files/figure-html/unnamed-chunk-3-1.png" style="display: block; margin: auto;" /> --- ### Visualizing missing data ```r library(naniar) gg_miss_fct(x = nhanes2, fct = hyp) ``` <img src="missing-data_files/figure-html/unnamed-chunk-4-1.png" style="display: block; margin: auto;" /> --- ### Some terminology Suppose `\(\mathbf{Z} = (Z_1, \cdots, Z_k)\)` is the full data, which may be completely or partially missing for some observations, and suppose `\(\mathbf{R} = (R_1, \cdots, R_k)\)` is a vector of indicators for whether each `\(Z_i\)` is observed (1 if observed, 0 if missing). - .vocab[Missing Completely at Random (MCAR)]: `\(P(\mathbf{R} = \mathbf{r} | \mathbf{Z})\)` does not depend on `\(\mathbf{Z}\)` (that is, `\(\mathbf{R}\)` and `\(\mathbf{Z}\)` are independent). - .vocab[Missing at Random (MAR)]: `\(P(\mathbf{R} = \mathbf{r} | \mathbf{Z})\)` only depends on elements of `\(\mathbf{Z}\)` that are observed for `\(\mathbf{R} = \mathbf{r}\)`. - .vocab[Missing Not at Random (MNAR)]: `\(P(\mathbf{R} = \mathbf{r} | \mathbf{Z})\)` depends on elements of `\(\mathbf{Z}\)` that are *not* observed for `\(\mathbf{R} = \mathbf{r}\)`. --- ### Examples of missingness mechanisms Suppose you are designing a written survey examining alcohol consumption among students and your goal is to quantify alcohol consumption and examine potential predictors. -- .question[ What types of analyses might you choose? What types of data might you collect? What would be examples of each type of missingness mechanism, MCAR, MAR, and MNAR? ] - .vocab[Missing Completely at Random (MCAR)]: `\(P(\mathbf{R} = \mathbf{r} | \mathbf{Z})\)` does not depend on `\(\mathbf{Z}\)` (that is, `\(\mathbf{R}\)` and `\(\mathbf{Z}\)` are independent). - .vocab[Missing at Random (MAR)]: `\(P(\mathbf{R} = \mathbf{r} | \mathbf{Z})\)` only depends on elements of `\(\mathbf{Z}\)` that are observed for `\(\mathbf{R} = \mathbf{r}\)`. - .vocab[Missing Not at Random (MNAR)]: `\(P(\mathbf{R} = \mathbf{r} | \mathbf{Z})\)` depends on elements of `\(\mathbf{Z}\)` that are *not* observed for `\(\mathbf{R} = \mathbf{r}\)`. --- ### MCAR example Under MCAR, *no systematic differences exist* between those with missing data and those with complete data - those with missing data are representative of the entire population. Suppose you are designing a written survey examining alcohol consumption among students, and suppose that some of these survey responses happened to get destroyed because the building they were housed in burned down. In this case, whether a response is missing is completely unrelated to the data at hand - it's only those responses that happened to be in the (former) building. MCAR is a very strong assumption and usually unrealistic unless the study was deliberately designed to include missing data and account for it appropriately. --- ### MAR example Under MAR, missing data are related to observed data, but not with the unobserved data. Taking the same survey example, suppose statistical science students recognize the importance of having complete data and were more likely to complete the survey compared to other departments. Also suppose whether someone completed a survey was due solely to major, and that we fully observed everyone's major. In this case, we would have MAR data. --- ### MNAR example Under MNAR, missing data are related to *unobserved* data. For instance, suppose instead that students who have higher alcohol consumption are less likely to respond to the survey. In this case, whether a survey is missing depends directly on the value of that unobserved response itself - the data are missing not at random. --- ### Are my data MAR or MNAR? .question[ Do you think it's possible to test whether a particular missingness mechanism holds in your data at hand? If so, how would you do so? ] -- Unfortunately, it is **not** possible to test whether certain missingness mechanisms hold given the observed data. MNAR vs. MAR cannot be tested simply because data that are MNAR involve unobserved data. If the assumption of MAR is made, there are methods to test vs. MCAR (e.g., examining associations in the observed data), but a decision should be made with the guidance of subject-matter expertise. Regardless, MAR is itself an *unverifiable* assumption and must be justified in the context of each problem. --- ### Complete case analysis Have you ever done something along the lines of ```r mean(data$var, na.rm = T) ``` or fit a regression model and see a message like ```r (_ observations deleted due to missingness) ``` In each of these cases, you are performing a .vocab[complete case analysis]. We are performing the analysis using only those observations without any missing variables of interest (and often analyzing the data "as if" we had the full unobserved dataset). --- ### Complete case analysis .question[ Let's suppose we have ten variables of interest for a regression model, and that each of these variables only have 5% missing data (suppose that missingness is independent between the variables). What is the probability of observing a complete case? What potential consequences does this have on our analysis? ] -- Even under MCAR, we are using many fewer observations, which would result in lower power. If the data are MAR or MNAR, then we will generally get biased estimates for parameters of interest. --- ### Mean/median/mode imputation The idea behind .vocab[imputation] is to "fill in" missing values in the data in a reasonable way, and then carry out the analysis. Often times for missing continuous variables, researchers may simply plug in the mean or median of observed values for missing values. For categorical variables, researchers may plug in the most common category. .question[ What are the potential consequences of this approach under various missing data mechanisms? How might this affect inference or uncertainty quantification? ] --- ### Imputation based on regression models The previous approach is often termed "unconditional" mean/median/mode imputation, because it doesn't take into consideration the other covariates in the dataset (which may be especially undesirable for MAR mechanisms, for example). We might try instead to create a model for the missing values based on the observed data and use predictions from these models as imputed values (for instance, a linear model for missing continuous predictors). We might also try .vocab[hot deck] imputation, which fills in missing data based on observed values from other observations which "match" in some sense. .question[ What are the potential consequences of these approaches under various missing data mechanisms? How might this affect inference or uncertainty quantification? ] --- ### Imputation based on regression models Imputation based on regression models can result in consistent estimators of certain model parameters under MAR if the model used to impute the data is correctly specified. However, correctly specifying this relationship is often an impossible task in practical settings (and furthermore is also unverifiable). These methods also suffer from similar issues regarding artificially decreased variability - in the case of a continuous variable, all missing values will still lie on the same regression line, which may inflate type 1 error rates. We know that there should be some variability involved in the imputation process, but in these methods we in fact get *decreased* variance of regression estimators! --- ### The "ideal" imputation procedure .question[ So far, we've discussed a few ad hoc imputation procedures and explored some statistical drawbacks of such approaches. What might an "ideal" imputation procedure have/do? ] -- As a few general considerations, such methods should... - provide consistent estimation of parameters of interest, - take into consideration extra variability due to imputation procedure, and - allow for principled quantification of variability of estimated parameters. --- ### Brief disclaimer A brief disclaimer before continuing is that we will be discussing some fairly advanced models - there is no need to fully understand the procedure we'll be discussing (just focus on the intuition). You'll learn more about these types of methods as you get more advanced in your statistical career! --- ### Multiple imputation .vocab[Multiple imputation] techniques are a broad class of methods intended to perform/satisfy the three considerations that characterize a good missing data analysis approach. There are three main steps in the analysis: 1. For each observation with missing data, values are imputed `\(M\)` times to create `\(M\)` datasets. Note that these datasets might be different from each other. 2. Carry out a statistical analysis of interest on each of the imputed datasets. 3. Combine/pool the results in a principled way that properly deals with the imputation process. Note that the model used for imputation should correspond to the models eventually used for analysis. For instance, if you include interaction terms or variable transformations in your final intended analysis, these should also be used for each imputation model. --- ### MICE .vocab[Multiple Imputation via Chained Equations (MICE)] is an algorithm that performs multiple imputation assuming MAR is satisfied: 1. First, fill in *all* missing values by sampling randomly from observed data. 2. Create a predictive model for the first variable with missing values `\(Z_1\)` based on other variables in the dataset `\((Z_2, \cdots Z_k)\)` for those with *observed data* (that is, those with `\(R_1 = 1\)`). For those with `\(R_1 = 0\)`, simulate draws from the posterior predictive distribution of `\(Z_1\)` 3. Create a predictive model for the next variable with missing values `\(Z_2\)` based on other variables in the dataset `\((Z_1, Z_3, \cdots Z_k)\)` for those with `\(R_2 = 1\)`. Importantly, for variable `\(Z_1\)`, we will use the imputed values if any are missing. For those with `\(R_2 = 0\)`, simulate draws from the posterior predictive distribution of `\(Z_2\)`. 4. Repeat for all other variables. 5. Once all variables have been addressed, repeat steps 2-4 until "convergence" of the imputed variables. This will correspond to a single imputed dataset. 6. Repeat steps 1 - 5 `\(M\)` times. --- ### MICE ```r library(mice) md.pattern(nhanes2) ``` <img src="missing-data_files/figure-html/unnamed-chunk-7-1.png" style="display: block; margin: auto;" /> ``` ## age hyp bmi chl ## 13 1 1 1 1 0 ## 3 1 1 1 0 1 ## 1 1 1 0 1 1 ## 1 1 0 0 1 2 ## 7 1 0 0 0 3 ## 0 8 9 10 27 ``` --- ### MICE ```r nhanes2.imp = mice(nhanes2, m = 5, meth = c(age = "sample", bmi = "pmm", hyp = "logreg", chl = "pmm"), seed = 123) ``` ``` ## ## iter imp variable ## 1 1 bmi hyp chl ## 1 2 bmi hyp chl ## 1 3 bmi hyp chl ## 1 4 bmi hyp chl ## 1 5 bmi hyp chl ## 2 1 bmi hyp chl ## 2 2 bmi hyp chl ## 2 3 bmi hyp chl ## 2 4 bmi hyp chl ## 2 5 bmi hyp chl ## 3 1 bmi hyp chl ## 3 2 bmi hyp chl ## 3 3 bmi hyp chl ## 3 4 bmi hyp chl ## 3 5 bmi hyp chl ## 4 1 bmi hyp chl ## 4 2 bmi hyp chl ## 4 3 bmi hyp chl ## 4 4 bmi hyp chl ## 4 5 bmi hyp chl ## 5 1 bmi hyp chl ## 5 2 bmi hyp chl ## 5 3 bmi hyp chl ## 5 4 bmi hyp chl ## 5 5 bmi hyp chl ``` --- ### MICE ```r summary(nhanes2.imp) ``` ``` ## Class: mids ## Number of multiple imputations: 5 ## Imputation methods: ## age bmi hyp chl ## "" "pmm" "logreg" "pmm" ## PredictorMatrix: ## age bmi hyp chl ## age 0 1 1 1 ## bmi 1 0 1 1 ## hyp 1 1 0 1 ## chl 1 1 1 0 ``` --- ### MICE ```r nhanes2.imp$imp$bmi ``` ``` ## 1 2 3 4 5 ## 1 27.2 26.3 28.7 30.1 27.2 ## 3 33.2 22.0 22.0 30.1 30.1 ## 4 22.7 22.5 30.1 22.5 27.2 ## 6 27.4 25.5 22.0 22.7 22.5 ## 10 22.0 22.7 20.4 20.4 35.3 ## 11 35.3 27.2 22.0 35.3 27.2 ## 12 30.1 22.5 26.3 35.3 22.7 ## 16 28.7 27.2 26.3 33.2 30.1 ## 21 20.4 30.1 22.5 22.0 27.2 ``` --- ### MICE ```r head(nhanes2) ``` ``` ## age bmi hyp chl ## 1 20-39 NA <NA> NA ## 2 40-59 22.7 no 187 ## 3 20-39 NA no 187 ## 4 60-99 NA <NA> NA ## 5 20-39 20.4 no 113 ## 6 60-99 NA <NA> 184 ``` --- ### MICE ```r head(complete(nhanes2.imp, 1)) ``` ``` ## age bmi hyp chl ## 1 20-39 27.2 no 187 ## 2 40-59 22.7 no 187 ## 3 20-39 33.2 no 187 ## 4 60-99 22.7 no 187 ## 5 20-39 20.4 no 113 ## 6 60-99 27.4 yes 184 ``` ```r head(complete(nhanes2.imp, 2)) ``` ``` ## age bmi hyp chl ## 1 20-39 26.3 yes 187 ## 2 40-59 22.7 no 187 ## 3 20-39 22.0 no 187 ## 4 60-99 22.5 yes 118 ## 5 20-39 20.4 no 113 ## 6 60-99 25.5 no 184 ``` --- ### MICE ```r xyplot(nhanes2.imp, bmi ~ chl, pch = 19, alpha = 0.4) ``` <img src="missing-data_files/figure-html/unnamed-chunk-14-1.png" style="display: block; margin: auto;" /> --- ### MICE ```r densityplot(nhanes2.imp) ``` <img src="missing-data_files/figure-html/unnamed-chunk-15-1.png" style="display: block; margin: auto;" /> --- ### MICE ```r mod1 <- with(nhanes2.imp, lm(chl ~ age + bmi + hyp)) summary(pool(mod1)) ``` ``` ## term estimate std.error statistic df p.value ## 1 (Intercept) 31.244973 61.643374 0.5068667 9.807543 0.62344869 ## 2 age40-59 43.936862 23.992772 1.8312541 7.194198 0.10859905 ## 3 age60-99 65.006196 38.739903 1.6780165 3.034952 0.19088677 ## 4 bmi 5.076024 2.112246 2.4031409 10.879346 0.03526641 ## 5 hypyes -7.996066 25.682491 -0.3113431 5.880126 0.76628175 ``` ```r sum(complete.cases(nhanes2)) ``` ``` ## [1] 13 ``` ```r summary(lm(chl ~ age + bmi + hyp, data = nhanes2))$coef ``` ``` ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) -35.676683 63.245276 -0.5641004 0.58814686 ## age40-59 59.543265 22.946889 2.5948295 0.03187292 ## age60-99 109.457782 30.437329 3.5961690 0.00702126 ## bmi 7.160119 2.200899 3.2532707 0.01164394 ## hypyes -7.692013 25.179480 -0.3054874 0.76779265 ```