Lemuroidea is a primate superfamily that contains around 100 species of primates commonly known as lemurs. Lemurs are a diverse group of animals, but unfortunately many species are endangered due to human activity. However, they are an important group of animals to research, as their taxonomic and physical characteristics can provide researchers greater understanding of human evolution.

The Duke Lemur Center, established over 50 years ago, is a renowned research institute that houses the most diverse population of lemurs outside of Madagascar. Since its inception, the Lemur Center has cared for over 4,000 animals, including 200 currently in its care. The Lemur Center is a world leader in conservation and preservation of primates, and as part of its research mission has made available detailed and verified life histories for over 3,600 animals.

The dataset for today is adapted from Zehr et al. (2014), which describes the life histories of every lemur housed at the Duke Lemur Center (through 2014). The following are variables found in the file dlc_animals.csv which is located in your lab folder:

As you will notice, there are missing data for some of the variables. The primary outcome variable of interest is the age at death of the lemur, in years (that is, the longevity of lemurs). Researchers might be interested in answering research questions such as whether there are associations between lemur taxa and lifespan, or whether there are associations between whether a lemur was born at the DLC and its lifespan, perhaps additionally accounting for other variables in the dataset.

Exercises

  1. Explore the missingness among all variables in the dataset using high quality visualizations (don’t worry so much about titles for this exercise). In doing so, characterize not only variable-wise missingness, but also potential patterns in missingness.
  2. Describe the missingness patterns from Ex. 1, paying particular attention to the outcome variable of interest.
  3. We are primarily interested in the outcome variable corresponding to lifespan. Come up with reasonable examples in context of the dataset provided of missingness mechanisms that would correspond to MCAR, MAR, and MCAR in the real world.
  4. Suppose you performed a complete case analysis using a regression model that predicted lifespan based on all variables except DLC ID. What would be your final sample size using these data?
  5. What missingness mechanism do you think is most reasonably assumed for the analysis in Ex. 4? What might be some implications on any analyses performed under this missingness mechanism?

There should only be one submission per team on Gradescope.