Lab 09: Multinomial Logistic Regression

due Thu, Apr 16 at 11:59p EDT

The General Social Survey (GSS) has been used to measure trends in attitudes and behaviors in American society since 1972. In addition to collecting demographic information, the survey includes questions used to gauge attitudes about government spending priorities, confidence in institutions, lifestyle, and many other topics. A full description of the survey may be found here.

In today’s lab, we will use multinomial logistic regression to understand the relationship between a person’s political views and their attitudes towards government spending on mass transportation projects. To do so, we will use data from the 2010 GSS survey.

Getting Started

library(usethis)
use_git_config(user.name="github username", user.email="your email")

Password caching

If you would like your git password cached for a week for this project, type the following in the Terminal:

git config --global credential.helper 'cache --timeout 604800'

You will need to enter your GitHub username and password one more time after caching the password. After that you won’t need to enter your credentials for 604800 seconds = 7 days.

Packages

You will need the following packages for today’s lab:

library(tidyverse)
library(nnet)
library(knitr)
library(broom)
# Fill in other packages as needed

Data

The data for this lab are from the 2016 General Social Survey. The original data set contains 2867 observations and 935 variables. Given the size of the dataset, we will handle it differently in our workflow than we’ve handled data in previous assignments.

Working with large files

The size of this dataset is 34.3 MB. Compare that to the Spotify dataset from last weeks’ lab which was 149 KB (0.149 MB)! GitHub will not allow you to push files larger than 100 MB and will give you a warning when you push files as large as 50 MB. Though we could push the file we’re working with today to GitHub, it’s large enough that we’d still prefer not to.

You have may noticed that each repo contains a file called .gitignore. It contains a list of the files you don’t want commit or push to GitHub. If you look at the .gitignore file for today’s lab, you will notice that gss2016.csv is listed at the bottom.

You will use the following variables in the lab:

Use the code below to read in the data.

gss <- read_csv("data/gss2016.csv",
  na = c("", "Don't know", "No answer", 
         "Not applicable"), 
         guess_max = 2867) %>%
  select(natmass, age, sex, sei10, region, polviews) %>%
  drop_na()

The argument guess_max = 2867 tells the read_csv function to use all of the observations in a column to determine its data type. Without this argument, only the first 1,000 observations would be used to make this determination. This becomes important for a variable like age; though age is coded as numeric data for most of the observations, there are some in which age is coded as "89 or older". Without the guess_max argument, you will get warnings when loading the data.

Note also that only the variables of interest will be loaded, not the entire dataset. This will make for faster computation and knitting as you work on the lab.

Exercises

Part I: Exploratory Data Analysis

See Reorder factor levels by hand for documentation about fct_relevel.

  1. The variable natmass will be the response variable in the model, and you want to compare more opinionated views to the moderate position. Recode natmass so it is a factor variable with "About right" as the baseline.

  2. Recode polviews so it is a factor variable type with levels that are in an order that is consistent with question on the survey. Note how the categories are spelled in the data.

    Make a plot of the distribution of polviews. Which political view occurs most frequently in this data set?

  3. Make a plot displaying the relationship between natmass and polviews. Use the plot to describe the relationship between a person’s political views and their views on mass transportation spending.

  4. You want to use age as a quantitative variable in your model; however, it is currently a character data type because some observations are coded as "89 or older". Recode age so that is a numeric variable. Note: Before making the variable numeric, you will need to replace the values "89 or older" with a single value.

Part II: Multinomial Logistic Regression Model

  1. You plan to fit a model using age, sex, sei10, and region to understand variation in opinions about spending on mass transportation. Briefly explain why you should fit a multinomial logistic model.

  2. Fit the model described in the previous exercise and display the model output. Make any necessary adjustments to the variables so the intercept will have a meaningful interpretation. Be sure About Right is the baseline level. Be sure the full model displays in the knitted document.

  3. Interpret the intercept associated with odds of having an opinion of “Too much” versus “About right”.

  4. Consider the relationship between age and one’s opinion about spending on mass transportation. Interpret the coefficient of age in terms of the odds of having an opinion of “Too little” versus “About right”.

  5. Now that you have adjusted for some demographic factors, let’s examine whether a person’s political views has a significant impact on their attitude towards spending on mass transportation.

    Conduct the appropriate test to determine if polviews is a significant predictor of attitude towards spending on mass transportation. State the null and alternative hypothesis, display all relevant code and output, and state your conclusion in the context of the problem.

    Choose the appropriate model based on the results from the test. Use this model for the next part of the lab.

Part III: Model Fit

  1. Calculate the predicted probabilities and residuals from your model.

  2. Let’s make some of the plots and tables you use to check the linearity assumption for multinomial logistic regression. Plot the binned residuals versus the predicted probabilities for each category of natmass. You will have three plots.

You can change the size of your plots, so you can fit multiple plots on a single page. Include the arguments fig.height = and fig.width = in the header of the code chunk to change the plot size. See Using R Markdown for an example.

  1. To examine the residuals versus each categorical predictor, you will look at the average residuals for each each category of the categorical variables.

    • For each category of natmass, calculate the average residuals across categories of region.

Based on the plot and table above, discuss with your group whether there are any obvious violations of the linearity assumption. Note that we haven’t examined all of the plots and tables of the residuals needed to make an assessment about the linearity assumption.

The other assumptions are randomness and independence. Discuss with your group whether these assumptions are satisfied for this analysis.

Part IV: Using the Model

  1. Use your model to describe the relationship between one’s political views and their attitude towards spending on mass transportation.

  2. Use your model to predict the category of natmass for each observation in your dataset. Display a table of the actual versus the predicted natmass. What is the misclassification rate?

Submitting the Assignment

See “Submitting the Assignment” from Lab 01 for detailed instructions on how to upload the assignment on GitHub.

Grading

Labs will be graded for completion, using the following:

Lab completion (50 pts):

Formatting (possible point deductions)

Acknowledgements

The “Data” section is largely inspired by datasciencebox.org.