library(tidyverse)
library(tidymodels)
dat <- read_csv("data/durham_2020.csv")Exam 2 Solutions
STA 210 Spring 2023 (Jiang)
Exercise 2.1
MCAR - the mail sorting office messed up with some of the returned envelopes. MAR - male residents might be less inclined to answer the survey compared to non-male residents (gender was included in the dataset). NMAR - people with a more neutral opinion of Durham might be less inclined to fill out the survey.
Exercise 2.2
The outcome is ordinal, not nominal (QOL perception was ordered on a Likert scale of very low to very high). As well, the multinomial assumption of independence of irrelevant alternatives is not met. For instance, if someone really has a great impression of Durham (5), if this option were removed, they would most likely answer whatever the highest rating would be (4).
Exercise 2.3
# Code will vary. Here's an example with dplyr
dat <- dat %>%
filter(!is.na(qual_police) | !is.na(qual_fire) | !is.na(qual_ems) |
!is.na(qual_public_transit) | !is.na(qual_water) |
!is.na(qual_parks) | !is.na(qual_pub_school)) %>%
rowwise() %>%
mutate(services = mean(c(qual_police, qual_fire, qual_ems,
qual_public_transit, qual_water,
qual_parks, qual_pub_school), na.rm = T))
mean(dat$services)[1] 3.592563
Instead of calculating a single score, we could use each of the scores individually. However, one big problem is in missing data - in creating regression models, R automatically conducts a complete case analysis such that if any variables are missing, the observation gets dropped. Only about 1/3 of the observations have complete data for all seven services, which both decreases our power and also introduces bias in regression estimates if these are MAR (which is likely the case here). On the other hand, an advantage of this method would be that we can look at conditional relationships between satisfaction with individual services and any outcome of interest (here, perceived quality of life).
Exercise 3.1
dat <- dat %>%
mutate(satisfied = ifelse(qol_durham > 3, 1, 0))
m1 <- glm(satisfied ~ years_in_durham +
rent_own + as.factor(overall_image) + services,
data = dat,
family = "binomial")
summary(m1)
Call:
glm(formula = satisfied ~ years_in_durham + rent_own + as.factor(overall_image) +
services, family = "binomial", data = dat)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.6808 -0.6862 0.2241 0.5121 2.6356
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -5.927181 0.777573 -7.623 2.48e-14 ***
years_in_durham -0.013564 0.005193 -2.612 0.00901 **
rent_ownrent 0.380590 0.208917 1.822 0.06850 .
as.factor(overall_image)2 1.448124 0.522418 2.772 0.00557 **
as.factor(overall_image)3 1.647810 0.504168 3.268 0.00108 **
as.factor(overall_image)4 4.173254 0.521696 7.999 1.25e-15 ***
as.factor(overall_image)5 5.906382 1.122257 5.263 1.42e-07 ***
services 1.117906 0.172829 6.468 9.91e-11 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1043.37 on 775 degrees of freedom
Residual deviance: 627.65 on 768 degrees of freedom
(65 observations deleted due to missingness)
AIC: 643.65
Number of Fisher Scoring iterations: 7
Exercise 3.2
exp(-0.013564)[1] 0.9865276
In comparing two residents with a one year difference in their length of residence we predict the resident that has spent one more year to have 0.9865 times the odds of being satisfied compared to the other resident, while controlling for rental status, overall image of Durham, and average perception of city services.
Exercise 3.3
library(Stat2Data)
emplogitplot1(satisfied ~ years_in_durham,
data = dat,
ngroups = 10)
Linearity appears reasonable as given by the empirical logit plot above.
Exercise 3.4
Perhaps the mailed surveys are related somehow, for example if some addresses were in the same apartment complexes. In this way, respondents might have similar answers due to this clustering (e.g., if an apartment complex is right next to an amazing park or local transit center).
Exercise 3.5
From our regression results, we know that shorter time spent in Durham, renting instead of owning, and higher overall image and average services scores are associated with larger log-odds of satisfaction. Thus to maximize the predicted log-odds (which will in turn maximize predicted probability), we would want a hypothetical resident with 0 years in Durham who rents, has an overall image score of 5, and an average services rating of 5.
new <- data.frame(years_in_durham = 0,
rent_own = "rent",
overall_image = 5,
services = 5)
xb <- augment(m1, newdata = new) %>%
pull(.fitted)
exp(xb)/(1 + exp(xb))[1] 0.9973992
Although this person doesn’t actually exist in the dataset, it is the theoretical maximum given our model results.
Exercise 3.6
Prevalence:
table(dat$satisfied)
0 1
323 486
486 / (486 + 323) # prevalence[1] 0.6007417
Sensitivity, specificity, PPV, NPV:
preds <- augment(m1)
table(preds$satisfied, preds$.fitted > 0)
FALSE TRUE
0 251 58
1 89 378
378 / (378 + 89) # sensitivity[1] 0.8094218
251 / (251 + 58) # specificity[1] 0.8122977
378 / (378 + 58) # PPV[1] 0.8669725
251 / (251 + 89) # NPV[1] 0.7382353
Exercise 3.7
m2 <- glm(satisfied ~ years_in_durham +
rent_own + as.factor(overall_image) + services +
years_in_durham * rent_own,
data = dat,
family = "binomial")
summary(m2)
Call:
glm(formula = satisfied ~ years_in_durham + rent_own + as.factor(overall_image) +
services + years_in_durham * rent_own, family = "binomial",
data = dat)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.6802 -0.6839 0.2241 0.5112 2.6408
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -5.9211145 0.7865030 -7.528 5.14e-14 ***
years_in_durham -0.0139085 0.0084720 -1.642 0.10065
rent_ownrent 0.3658113 0.3552467 1.030 0.30313
as.factor(overall_image)2 1.4491609 0.5228225 2.772 0.00557 **
as.factor(overall_image)3 1.6489597 0.5046722 3.267 0.00109 **
as.factor(overall_image)4 4.1746028 0.5223889 7.991 1.33e-15 ***
as.factor(overall_image)5 5.9073270 1.1224233 5.263 1.42e-07 ***
services 1.1184554 0.1731627 6.459 1.05e-10 ***
years_in_durham:rent_ownrent 0.0005485 0.0106643 0.051 0.95898
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1043.37 on 775 degrees of freedom
Residual deviance: 627.64 on 767 degrees of freedom
(65 observations deleted due to missingness)
AIC: 645.64
Number of Fisher Scoring iterations: 7
We will conduct a hypothesis test comparing the null hypothesis that the interactiont term is 0 vs. the alternative that it is 0 at the 0.05 significance level. The test statistic is 0.051, which has a standard normal distribution under the null hypothesis. This corresponds to a p-value of 0.959, and so we fail to reject the null. There is insufficient evidence to suggest that the relationship between years spent in Durham and overall satisfaction with quality of life depends on whether a resident rents or owns.
Grading note - some students did equivalent tests (e.g., likelihood ratio test comparing nested models, F tests, etc. That was fine too).
Exercise 3.8
library(MASS)
m3 <- polr(as.factor(qol_durham) ~ years_in_durham +
rent_own + as.factor(overall_image) + services,
data = dat)
summary(m3)Call:
polr(formula = as.factor(qol_durham) ~ years_in_durham + rent_own +
as.factor(overall_image) + services, data = dat)
Coefficients:
Value Std. Error t value
years_in_durham -0.01292 0.003896 -3.316
rent_ownrent 0.19664 0.156076 1.260
as.factor(overall_image)2 1.37914 0.302362 4.561
as.factor(overall_image)3 1.99684 0.290112 6.883
as.factor(overall_image)4 4.11330 0.328141 12.535
as.factor(overall_image)5 7.18170 0.467162 15.373
services 1.11251 0.124606 8.928
Intercepts:
Value Std. Error t value
1|2 1.9209 0.4541 4.2302
2|3 3.8098 0.4700 8.1053
3|4 5.9129 0.5024 11.7692
4|5 9.9445 0.5744 17.3115
Residual Deviance: 1449.41
AIC: 1471.41
(65 observations deleted due to missingness)
exp(coef(m3)) years_in_durham rent_ownrent as.factor(overall_image)2
0.9871646 1.2173067 3.9715022
as.factor(overall_image)3 as.factor(overall_image)4 as.factor(overall_image)5
7.3657320 61.1481193 1315.1378597
services
3.0419770
exp(confint(m3)) 2.5 % 97.5 %
years_in_durham 0.9796311 0.9947144
rent_ownrent 0.8965353 1.6535573
as.factor(overall_image)2 2.2079480 7.2347260
as.factor(overall_image)3 4.1990207 13.1141258
as.factor(overall_image)4 32.4510230 117.6120172
as.factor(overall_image)5 537.5897194 3369.8529588
services 2.3878593 3.8931377
For the ordered outcome of QOL in Durham going from 1 (very dissatisfied) through 5 (very satisfied), in comparing two residents with a one year difference in their length of residence we predict the resident that has spent one more year to have 0.987 times the odds of being in the next higher satisfaction category compared to the other resident, while controlling for rental status, overall image of Durham, and average perception of city services.
Our 95% confidence interval is (0.980, 0.995); we are 95% confident that the odds ratio is in these bounds (if we were to repeatedly perform this analysis in independent samples from the same population and construct confidence intervals in the same way, we would expect 95% of such intervals to contain the true odds ratio).
It appears that longer length of time spent in Durham is associated with lower perceived quality of life after adjusting for rental status, overall image of Durham, and perception of city services, but given the small magnitude of the odds ratio (i.e., close to 1), this decrease in perceived QOL through time occurs slowly.
Grading note - I used the model formulation from 3.1, but some students used the formulation from 3.7 and that was ok as well. The difficulty is that they would’ve had to additionally interpret the interaction term for the main (non-interaction) effect.