Exam 2 Solutions

STA 210 Spring 2023 (Jiang)

library(tidyverse)
library(tidymodels)
dat <- read_csv("data/durham_2020.csv")
Exercise 1

Instructions, data description, and Duke Community Standard

As noted in the instructions and course syllabus, any evidence of academic dishonesty will result in a failing grade for the course.

Exercise 2

As previously mentioned, the primary research question of interest is whether long-time residents of Durham have differential perceptions of overall quality of life (the qol_durham variable) compared to more recent transplants.

Exercise 2.1

Provide a specific example in the context of the dataset of how the data for overall quality of life might be missing completely at random (MCAR), missing at random (MAR), and not missing at random (NMAR).

Exercise 2.2

Provide two reasons for why a multinomial model does not make sense for the desired analysis. Your first reason should specifically reference the nature of the outcome variable and your second should specifically reference assumptions needed for the multinomial regression model itself.

Exercise 2.3

Create a new variable corresponding to the respondent’s average perception of the quality of city services. This variable should be created by averaging their responses to quality of police services, fire department, EMS, public transit, water utilities, park system, and public schools (among non-missing values for these variables).

You may treat both this variable as a continuous variable; discard any observations that do not have any ratings at all for each of the seven city services listed (i.e., if you cannot calculate an average city services rating).

What is the average city services score among all survey respondents? Provide your answer to at least 3 decimal places. Propose an alternate way to use the seven responses to city services in a regression model and discuss the relative advantages and disadvantages compared to this single summarized score.

Exercise 2.1

MCAR - the mail sorting office messed up with some of the returned envelopes. MAR - male residents might be less inclined to answer the survey compared to non-male residents (gender was included in the dataset). NMAR - people with a more neutral opinion of Durham might be less inclined to fill out the survey.

Exercise 2.2

The outcome is ordinal, not nominal (QOL perception was ordered on a Likert scale of very low to very high). As well, the multinomial assumption of independence of irrelevant alternatives is not met. For instance, if someone really has a great impression of Durham (5), if this option were removed, they would most likely answer whatever the highest rating would be (4).

Exercise 2.3

# Code will vary. Here's an example with dplyr

dat <- dat %>% 
  filter(!is.na(qual_police) | !is.na(qual_fire) | !is.na(qual_ems) |
           !is.na(qual_public_transit) | !is.na(qual_water) |
           !is.na(qual_parks) | !is.na(qual_pub_school)) %>% 
  rowwise() %>% 
  mutate(services = mean(c(qual_police, qual_fire, qual_ems, 
                           qual_public_transit, qual_water,
                           qual_parks, qual_pub_school), na.rm = T))

mean(dat$services)
[1] 3.592563

Instead of calculating a single score, we could use each of the scores individually. However, one big problem is in missing data - in creating regression models, R automatically conducts a complete case analysis such that if any variables are missing, the observation gets dropped. Only about 1/3 of the observations have complete data for all seven services, which both decreases our power and also introduces bias in regression estimates if these are MAR (which is likely the case here). On the other hand, an advantage of this method would be that we can look at conditional relationships between satisfaction with individual services and any outcome of interest (here, perceived quality of life).

Exercise 3

For now, let’s examine a binary outcome where a score of 4 or 5 on qol_durham indicates “satisfied with quality of life in Durham” and 1, 2, or 3 indicates “not satisfied with quality of life in Durham.”

Fit a logistic regression model that examines the association between years lived in Durham and whether the respondent is satisfied with overall quality of life in Durham, while controlling for whether the respondent rents or owns their residence, the respondent’s overall image of Durham, and the respondent’s average perception of the quality of city services (the variable you created previously).

For the purposes of this exam, conduct a complete case analysis.

Exercise 3.1

Paste your summarized model output in the space provided (you may also upload an image if you prefer).

Exercise 3.2

Interpret the coefficient corresponding to years spent in Durham in your regression model as an odds ratio.

Exercise 3.3

Evaluate whether the linearity assumption for years spent in Durham might be reasonable. You may upload any supporting information in the field provided.

Exercise 3.4

The Durham resident quality of life survey data was collected by sending mailed surveys to a randomly selected number of addresses in Durham county. Explain specifically how independence could hypothetically be violated in the context of this dataset.

Exercise 3.5

What is the highest possible predicted probability of being satisfied with overall quality of life in Durham that your model would assign to a resident?

Show your work and report your answer to at least 3 decimal places. If this question cannot be answered, explain why.

Exercise 3.6

Suppose you predict residents as being satisfied with overall quality of life based on whether their predicted probability of being satisfied was greater than 0.5.

What is the prevalence of satisfied residents among the survey? What are the sensitivity, specificity, positive predictive value, and negative predictive value of your model in predicting satisfied status? Round your answers to at least 3 decimal places (fractions are fine too).

Exercise 3.7

Using a different regression model if needed, is there sufficient evidence to suggest that the association between years spent in Durham and overall satisfaction with quality of life depends on whether a resident rents or owns? Conduct a comprehensive formal hypothesis test at the \[\alpha = 0.05\] level.

Exercise 3.8

Now we will examine the original outcome variable corresponding to overall quality of life in Durham (qol_durham) using an ordinal regression model.

Fit an appropriate regression model examining whether there is an association between overall perceived quality of life and years spent in Durham, while controlling for the same predictors as above.

Interpret the coefficient corresponding to years spent in Durham in your regression model as an odds ratio, and additionally provide and interpret a 95% confidence interval for this odds ratio.

What might you conclude regarding perceptions of overall quality of life vis-�-vis time spent in Durham from this model?

Exercise 3.1

dat <- dat %>% 
  mutate(satisfied = ifelse(qol_durham > 3, 1, 0))

m1 <- glm(satisfied ~ years_in_durham + 
            rent_own + as.factor(overall_image) + services, 
          data = dat,
          family = "binomial")

summary(m1)

Call:
glm(formula = satisfied ~ years_in_durham + rent_own + as.factor(overall_image) + 
    services, family = "binomial", data = dat)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.6808  -0.6862   0.2241   0.5121   2.6356  

Coefficients:
                           Estimate Std. Error z value Pr(>|z|)    
(Intercept)               -5.927181   0.777573  -7.623 2.48e-14 ***
years_in_durham           -0.013564   0.005193  -2.612  0.00901 ** 
rent_ownrent               0.380590   0.208917   1.822  0.06850 .  
as.factor(overall_image)2  1.448124   0.522418   2.772  0.00557 ** 
as.factor(overall_image)3  1.647810   0.504168   3.268  0.00108 ** 
as.factor(overall_image)4  4.173254   0.521696   7.999 1.25e-15 ***
as.factor(overall_image)5  5.906382   1.122257   5.263 1.42e-07 ***
services                   1.117906   0.172829   6.468 9.91e-11 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1043.37  on 775  degrees of freedom
Residual deviance:  627.65  on 768  degrees of freedom
  (65 observations deleted due to missingness)
AIC: 643.65

Number of Fisher Scoring iterations: 7

Exercise 3.2

exp(-0.013564)
[1] 0.9865276

In comparing two residents with a one year difference in their length of residence we predict the resident that has spent one more year to have 0.9865 times the odds of being satisfied compared to the other resident, while controlling for rental status, overall image of Durham, and average perception of city services.

Exercise 3.3

library(Stat2Data)

emplogitplot1(satisfied ~ years_in_durham, 
              data = dat,
              ngroups = 10)

Linearity appears reasonable as given by the empirical logit plot above.

Exercise 3.4

Perhaps the mailed surveys are related somehow, for example if some addresses were in the same apartment complexes. In this way, respondents might have similar answers due to this clustering (e.g., if an apartment complex is right next to an amazing park or local transit center).

Exercise 3.5

From our regression results, we know that shorter time spent in Durham, renting instead of owning, and higher overall image and average services scores are associated with larger log-odds of satisfaction. Thus to maximize the predicted log-odds (which will in turn maximize predicted probability), we would want a hypothetical resident with 0 years in Durham who rents, has an overall image score of 5, and an average services rating of 5.

new <- data.frame(years_in_durham = 0,
                  rent_own = "rent",
                  overall_image = 5,
                  services = 5)

xb <- augment(m1, newdata = new) %>% 
  pull(.fitted)

exp(xb)/(1 + exp(xb))
[1] 0.9973992

Although this person doesn’t actually exist in the dataset, it is the theoretical maximum given our model results.

Exercise 3.6

Prevalence:

table(dat$satisfied)

  0   1 
323 486 
486 / (486 + 323) # prevalence
[1] 0.6007417

Sensitivity, specificity, PPV, NPV:

preds <- augment(m1)

table(preds$satisfied, preds$.fitted > 0)
   
    FALSE TRUE
  0   251   58
  1    89  378
378 / (378 + 89) # sensitivity
[1] 0.8094218
251 / (251 + 58) # specificity
[1] 0.8122977
378 / (378 + 58) # PPV
[1] 0.8669725
251 / (251 + 89) # NPV
[1] 0.7382353

Exercise 3.7

m2 <- glm(satisfied ~ years_in_durham + 
            rent_own + as.factor(overall_image) + services + 
            years_in_durham * rent_own, 
          data = dat,
          family = "binomial")

summary(m2)

Call:
glm(formula = satisfied ~ years_in_durham + rent_own + as.factor(overall_image) + 
    services + years_in_durham * rent_own, family = "binomial", 
    data = dat)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.6802  -0.6839   0.2241   0.5112   2.6408  

Coefficients:
                               Estimate Std. Error z value Pr(>|z|)    
(Intercept)                  -5.9211145  0.7865030  -7.528 5.14e-14 ***
years_in_durham              -0.0139085  0.0084720  -1.642  0.10065    
rent_ownrent                  0.3658113  0.3552467   1.030  0.30313    
as.factor(overall_image)2     1.4491609  0.5228225   2.772  0.00557 ** 
as.factor(overall_image)3     1.6489597  0.5046722   3.267  0.00109 ** 
as.factor(overall_image)4     4.1746028  0.5223889   7.991 1.33e-15 ***
as.factor(overall_image)5     5.9073270  1.1224233   5.263 1.42e-07 ***
services                      1.1184554  0.1731627   6.459 1.05e-10 ***
years_in_durham:rent_ownrent  0.0005485  0.0106643   0.051  0.95898    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1043.37  on 775  degrees of freedom
Residual deviance:  627.64  on 767  degrees of freedom
  (65 observations deleted due to missingness)
AIC: 645.64

Number of Fisher Scoring iterations: 7

We will conduct a hypothesis test comparing the null hypothesis that the interactiont term is 0 vs. the alternative that it is 0 at the 0.05 significance level. The test statistic is 0.051, which has a standard normal distribution under the null hypothesis. This corresponds to a p-value of 0.959, and so we fail to reject the null. There is insufficient evidence to suggest that the relationship between years spent in Durham and overall satisfaction with quality of life depends on whether a resident rents or owns.

Grading note - some students did equivalent tests (e.g., likelihood ratio test comparing nested models, F tests, etc. That was fine too).

Exercise 3.8

library(MASS)
m3 <- polr(as.factor(qol_durham) ~ years_in_durham + 
            rent_own + as.factor(overall_image) + services,
          data = dat)

summary(m3)
Call:
polr(formula = as.factor(qol_durham) ~ years_in_durham + rent_own + 
    as.factor(overall_image) + services, data = dat)

Coefficients:
                             Value Std. Error t value
years_in_durham           -0.01292   0.003896  -3.316
rent_ownrent               0.19664   0.156076   1.260
as.factor(overall_image)2  1.37914   0.302362   4.561
as.factor(overall_image)3  1.99684   0.290112   6.883
as.factor(overall_image)4  4.11330   0.328141  12.535
as.factor(overall_image)5  7.18170   0.467162  15.373
services                   1.11251   0.124606   8.928

Intercepts:
    Value   Std. Error t value
1|2  1.9209  0.4541     4.2302
2|3  3.8098  0.4700     8.1053
3|4  5.9129  0.5024    11.7692
4|5  9.9445  0.5744    17.3115

Residual Deviance: 1449.41 
AIC: 1471.41 
(65 observations deleted due to missingness)
exp(coef(m3))
          years_in_durham              rent_ownrent as.factor(overall_image)2 
                0.9871646                 1.2173067                 3.9715022 
as.factor(overall_image)3 as.factor(overall_image)4 as.factor(overall_image)5 
                7.3657320                61.1481193              1315.1378597 
                 services 
                3.0419770 
exp(confint(m3))
                                2.5 %       97.5 %
years_in_durham             0.9796311    0.9947144
rent_ownrent                0.8965353    1.6535573
as.factor(overall_image)2   2.2079480    7.2347260
as.factor(overall_image)3   4.1990207   13.1141258
as.factor(overall_image)4  32.4510230  117.6120172
as.factor(overall_image)5 537.5897194 3369.8529588
services                    2.3878593    3.8931377

For the ordered outcome of QOL in Durham going from 1 (very dissatisfied) through 5 (very satisfied), in comparing two residents with a one year difference in their length of residence we predict the resident that has spent one more year to have 0.987 times the odds of being in the next higher satisfaction category compared to the other resident, while controlling for rental status, overall image of Durham, and average perception of city services.

Our 95% confidence interval is (0.980, 0.995); we are 95% confident that the odds ratio is in these bounds (if we were to repeatedly perform this analysis in independent samples from the same population and construct confidence intervals in the same way, we would expect 95% of such intervals to contain the true odds ratio).

It appears that longer length of time spent in Durham is associated with lower perceived quality of life after adjusting for rental status, overall image of Durham, and perception of city services, but given the small magnitude of the odds ratio (i.e., close to 1), this decrease in perceived QOL through time occurs slowly.

Grading note - I used the model formulation from 3.1, but some students used the formulation from 3.7 and that was ok as well. The difficulty is that they would’ve had to additionally interpret the interaction term for the main (non-interaction) effect.