class: center, middle, inverse, title-slide # Dealing with Missing Data ### Dr. Maria Tackett ### 04.03.19 --- ### Announcements - Lab 08 due **today at 11:59p** - Team feedback #2 due **today at 11:59p** - HW 06 due Mon, Apr 8 --- ## What missing data looks like ``` ## age bmi hyp chl ## 1 20-39 NA <NA> NA ## 2 40-59 22.7 no 187 ## 3 20-39 NA no 187 ## 4 60-99 NA <NA> NA ## 5 20-39 20.4 no 113 ## 6 60-99 NA <NA> 184 ## 7 20-39 22.5 no 118 ## 8 20-39 30.1 no 187 ## 9 40-59 22.0 no 238 ## 10 40-59 NA <NA> NA ``` --- ## Why is missing data an issue? .question[ Do you have missingness in your data for the final project? ] -- .question[ Why is missing data an issue when doing an analysis? ] --- ## Dealing with missingness - Deal with missing ness before doing any analysis - This is one of the many reasons exploratory data analysis is an important first step! - Some things to consider if you find missing values: - Why are the values missing? - Is there a pattern of missingness? If so, what is it? - What is the proportion of missing values? - The answers to these questions will help you determine how to deal with the missing data --- ### Why are the values missing? - <font class="vocab">Missing Completely at Random (MCAR)</font>: Missingness does not depend on the observed data or missing data, i.e. the probability of missing is the same for each observation - Example: People used a die to decide whether to share their income on a survey -- - <font class="vocab">Missing at Random (MAR)</font>: Missingness depends on other observed variables but is random after conditioning on those variables, i.e. the probability that a variable is missing only depends on available information - Example: People with a college degree are less likely to share income than people without college degree -- - <font class="vocab">Missing Not at Random (MNAR)</font>: Missingness dependents on the variable itself - Example: People with higher incomes are less likely to share them on a survey --- ## How to deal with missing? 1. Only use observations with no missingness (complete-case analysis) 2. Only use variables with no missingness 2. Impute the missing values --- ## Complete-case analysis Use only complete observations in the analysis, i.e. those that have a value for each variable .question[ What are potential disadvantages of dealing with missing data this way? ] --- ### Complete-case analysis - This may be OK if there are very few observations with missing values - R does this automatically in its regression functions -- **Potential problems:** - Could result in a model being built on very few observations - This is especially true if there are many variables included in the model - Standard errors of model coefficients increase since you're losing information from the partially complete data - If the observations with missingness differ systematically from the complete observations, then resulting analysis could be biased - This is especially true if the missingness is not random --- ## Single Imputation **Single Imputation:** Replace each missing value with a single number/category - Mean imputation - Use information from related observations - Indicator variable for missingness - Logical rule --- ## Mean Imputation - Replace missing values of a variable with the mean calculated from the observed data - **Advantage**: Easy and straightforward method - **Disadvantages**: - Can distort the distribution of the variable - Standard deviation underestimated - Results in inaccurate regression coefficients; relationships between variables become distorted --- ### Related observations - Replace the missing values using information from another observation that is "similar" to the one with missingness - The "similar" observation can come from within the same dataset (hot deck) or from an external dataset (cold deck) - Examples: - Hot Deck: Mother's income can be used to fill in missing values for father's income - Cold Deck: Use respondents from 2009 NHANES survey to fill in missing values for the 2011 NHANES survey - **Disadvantage**: Could expand effects of measurement error --- ### Indicator variable: categorical predictor - Make "missing" an additional category for the variable - Use this updated variable in the regression model; "missing" becomes a term for the model .question[ What can you conclude if the term for missing is significant in the model? ] --- ### Indicator variable: quantitative predictor - Impute the missing in the original variable using the mean (or some other method) and create a new indicator variable for the missingness - Can lead to inaccurate estimates of the coefficients of other variables, since the slope is forced to be the same for the groups with and without missingness - Reduce some of this bias by including interactions between the missing indicator and the other predictors --- ### Logical Rule - Can use some logical rule to impute missing values - Example: The Social Indicators Survey includes a question on the "number of months worked in the previous year" which was answered by all 1501 respondents. Of the people who didn't answer the question about total earnings in the previous year, 10 reported working 0 months during the previous year. -- .question[ For these 10 respondents, what is a logical value to use to impute their earnings? ] -- .question[ How would you impute the earnings for the other respondents who didn't share their earnings? ] --- ## Missing Data Exercise - Copy the `Missing Data` project in Rstudio Cloud. --- ## Acknowledgements These slides draw material from - [Missing Data](https://web.stanford.edu/class/stats202/content/lec25-cond.pdf) - [Handling Missing Data: An Introduction](https://www2.stat.duke.edu/courses/Fall18/sta210.001/slides/lectures/MissingData_AkandeOlanrewaju.pdf) - *Data Analysis Using Regression and Multilevel/Hierarchical Models*, "Chapter 25: Missing-data Imputation" --- ## Homework - [How eye witness identification experiments work ](https://www.cga.ct.gov/jud/tfs/20130901_Eyewitness%20Identification%20Task%20Force/20111019/Dr.%20Gary%20Wells%20Presentation.pdf) - [Simulatneous vs. Sequential Lineup](https://en.wikipedia.org/wiki/Police_lineup)