The Quarto template for this assignment may be found in the
repository at the following link: https://classroom.github.com/a/9yj-vFdX
In today’s homework we will revisit the NC county-level data as
compiled by the Robert Wood Johnson Foundation from a variety of data
sources, including the American Community Survey, public health
surveillance systems, and various other sources. There are eight
variables in this dataset:
county: the name of the county
life_expect: the mean life expectancy in years for this
county
obesity: the percentage of the county that is
considered obese (BMI 30+)
phys_inactive: the percentage of adults age 20+
reporting no leisure-time physical activity
uninsured: the percentage of adults under age 65 that
do not have health insurance
long_commute: among workers who commute in their car
alone, the percentage that commute more than 30 minutes.
hhi: the median household income in the county, in
thousands.
urbanicity: a categorical variable with levels of
“rural,” “semirural”, “semiurban,” and “urban,” corresponding to
percentages of residents who live in cities. “Rural” is defined as 0-25%
living in cities, semirural is defined as 25-50% living in cities,
semiurban as 50-75%, and urban as 75-100% of residents living in
cities.
- Fit an ordinal regression model predicting urbanicity with all other
variables in the model (except county ID). Interpret the slope parameter
corresponding to life expectancy. Using your model, what urbanicity is
Durham county predicted to be? (note - luckily in this case the factor
variable is in order. This won’t always be the case!)
- Explain what the proportional odds assumption means in the context
of this regression problem. As part of this answer, explain what it
might mean in context if this assumption were to be violated.
- Now fit a multinomial regression model predicting urbanicity with
all other variables in the model. Interpret each of the slope parameters
corresponding to life expectancy. Using your model, what urbanicity is
Durham county predicted to be? Provide the predicted probabilities for
each of the four urbanicity categories.
- Explain what the independence of irrelevant alternatives assumptions
means in the context of this regression problem. As part of this answer,
explain what if might mean in context if this assumption were to be
violated.
- Which model do you think is more appropriate to use for these data?
Explain.
- Now fit a linear model predicting life expectancy based on
the other variables in your model, as in HW 3. Which two counties have
the highest influence in this model? Are you particularly worried that
they are too influential? Explain.
- Whom are you working with for your project (if anyone)? What dataset
are you using? How many observations are there, how many variables are
there, and what is your potential research question?