The Quarto template for this assignment may be found in the repository at the following link: https://classroom.github.com/a/eN5J9VbD

Sleep apnea is a common sleep disorder characterized by repeated episodes of shallow or stoppage of breathing during sleep. Because of the disruption to sleeping, in severe cases, sleep apnea may lead to a reduction in blood oxygen saturation, which can be potentially serious. A quantity that physicians care about is whether oxygen saturation is below 90%. The “sleeptime” variable specifically measures the percentage of time spent while sleeping when this was the case.

Turan et al. (Anesthesiology, 2015) conducted a study examining physiological characteristics among patients who underwent bariatric (weight-loss) surgery who results available from a comprehensive sleep study at a major medical center.

The variables are as follows:

The overall goal is to predict sleeptime (the percentage of time spent under 90% arterial oxygen saturation) using the other variables in the dataset.

  1. Consider all predictors available in the dataset. Use all-subset selection utilizing \(C_p\) as the selection criterion to arrive at a “best” model. What variables ended up being included in your model? Do not consider any transformations or interaction terms in this analysis.
  2. Consider all predictors available in the dataset. Use stepwise selection utilizing \(AIC\) as the selection criterion to arrive at a “best” model (include results for forward selection, backward elimination, and stepwise in both directions). What variables ended up being included in your model? Do not consider any transformations or interaction terms in this analysis.
  3. Consider all predictors available in the dataset. Use LASSO to arrive at a “best” model, utilizing the lambda that gives the lowest mean MSE from a 10-fold cross-validation procedure aimed at evaluating lambda. What variables ended up being included in your model? Do not consider any transformations or interaction terms in this analysis.
  4. Choose a categorical predictor and a continuous predictor that was present in all models above (hint: there should be at least one of each). Interpret the coefficient estimates for both of them as provided by your “final” model as determined by backward elimination and as provided by LASSO. Provide two reasons why the estimates from the two models are different.
  5. Consider the model that uses all predictors (no transformations or interactions). What is the average root mean square error in test sets from a 10-fold cross-validation procedure? Compare this to the same quantity from the model chosen by backward elimination (using AIC). Which is higher? In the beginning of the R chunk used to evaluate this question, set a random seed of 919 as follows: set.seed(919).
  6. Suppose a collaborator was interested in the relationship between BMI and expected reading percentage of time spent under 90% arterial oxygen saturation and wanted to adjust for potential confounders. After reading Lin, Davidson, and Ancoli-Israel (2008), which notes that “number of population based studies have shown that obstructive sleep apnea is more common in men than in women and this discrepancy is often evident in the clinical setting” and explores “pathophysiological differences to suggest why men are more prone to [sleep apnea] than women,” they particularly want to explore such relationships adjusting for sex. Explain what implications using any of your models from Exercises 1 - 3 might have scientifically in light of this proposed analysis.