Lab 08: Multinomial Logistic Regression

due Wed, Apr 3 at 11:59p

The General Social Survey (GSS) has been used to measure trends in attitudes and behaviors in American society since 1972. In addition to collecting demographic information, the survey includes questions used to gauge attitudes about government spending priorities, confidence in institutions, lifestyle, and many other topics. A full description of the survey may be found here.

In today’s lab, we will use multinomial logistic regression to understand the relationship between a person’s political views and their attitudes towards government spending on mass transportation projects. To do so, we will use data from the 2010 GSS survey. Refer to the Multinomial Logistic Regression notes for help with concepts and code.

Getting Started

When configuring Git, be sure to use the email address that is associated with your GitHub account.

library(usethis)
use_git_config(user.name="your name", user.email="your email")

Password caching

If you would like your git password cached for a week for this project, type the following in the Terminal:

git config --global credential.helper 'cache --timeout 604800'

You will need to enter your GitHub username and password one more time after caching the password. After that you won’t need to enter your credentials for 604800 seconds = 7 days.

Packages

You will need the following packages for today’s lab:

library(tidyverse)
library(nnet)
library(knitr)
library(broom)
# Fill in other packages as needed

Project name:

Currently your project is called Untitled Project. Update the name of your project to the title of today’s lab.

Warm up

Before we introduce the data, let’s warm up with a simple exercise.

YAML:

Data

The data for this lab is from the 2016 General Social Survey. The original data set contains 2867 observations and 935 variables. Given the size of the dataset, we will handle it differently in our workflow than we’ve handled data in previous assignments.

Working with large files

The size of this dataset is 34.3 MB. Compare that to the Spotify dataset from last weeks’ lab which was 149 KB (0.149 MB)! GitHub will not allow you to push files larger than 100 MB and will give you a warning when you push files as large as 50 MB. Though we could push the file we’re working with today to GitHub, it’s large enough that we’d still prefer not to.

You have may noticed that each repo contains a file called .gitignore. It contains a list of the files you don’t want commit or push to GitHub. If you look at the .gitignore file for today’s lab, you will notice that gss2016.csv is listed at the bottom.

You will use the following variables in the lab:

Use the code below to read in the data.

gss <- read_csv("data/gss2016.csv",
  na = c("", "Don't know", "No answer", 
         "Not applicable"), 
         guess_max = 2867) %>%
  select(natmass, age, sex, sei10, region, polviews) %>%
  drop_na()

The argument guess_max = 2867 tells the read_csv function to use all of the observations in a column to determine its data type. Without this argument, only the first 1,000 observations would be used to make this determination. This becomes important for a variable like age; though age is coded as numeric data for most of the observations, there are some in which age is coded as "89 or older". Without the guess_max argument, you will get warnings when loading the data.

Note also that only the variables of interest will be loaded, not the entire dataset. This will make for faster computation and knitting as you work on the lab.

Exercises

Part I: Exploratory Data Analysis

See Reorder factor levels by hand for documentation about fct_relevel.

  1. The variable natmass will be the response variable in the model, and you want to compare more opinionated views to the moderate position. Recode natmass so it is a factor variable with "About right" as the baseline.

  2. Recode polviews so it is a factor variable type with levels that are in an order that is consistent with question on the survey. Note how the categories are spelled in the data.

Make a plot of the distribution of polviews. Which political view occurs most frequently in this data set?

  1. Make a plot displaying the relationship between natmass and polviews. Use the plot to describe the relationship between a person’s political views and their views on mass transportation spending.

  2. You want to use age as a quantitative variable in your model; however, it is currently a character data type because some observations are coded as "89 or older". Recode age so that is a numeric variable. Note: Before making the variable numeric, you will need to replace the values "89 or older" with a single value.

Part II: Multinomial Logistic Regression Model

  1. You plan to fit a model using age, sex, sei10, and region to understand variation in opinions about spending on mass transportation. Briefly explain why you should fit a multinomial logistic model.

  2. Fit the model described in the previous exercise and display the model output. Make any necessary adjustments to the variables so the intercept will have a meaningful interpretation. Be sure About Right is the baseline level. Be sure the full model displays in the knitted document.

  3. Interpret the intercept associated with odds of having an opinion of “Too much” versus “About right”.

  4. Consider the relationship between age and one’s opinion about spending on mass transportation.

    • Interpret the coefficient of age in terms of the log odds of having an opinion of “Too little” versus “About right”.
    • Interpret the coefficient of age in terms of the odds of having an opinion of “Too little” versus “About right”.
  5. In general, what is the relationship between an person’s age and their opinions on mass transportation spending?

Now that you have adjusted for some demographic factors, let’s examine whether a person’s political views has a significant impact on their attitude towards spending on mass transportation.

  1. Conduct the appropriate test to determine if polviews is a significant predictor of attitude towards spending on mass transportation. State the null and alternative hypothesis, display all relevant code and output, and state your conclusion in the context of the problem.

Choose the appropriate model based on the results from the test. Use this model for the next part of the lab.

Part III: Model Fit

  1. Calculate the predicted probabilities and residuals from your model.

  2. Plot the binned residuals versus the predicted probabilities for each category of natmass. You will have three plots.

You can change the size of your plots, so you can fit multiple plots on a single page. Include the arguments fig.height = and fig.width = in the header of the code chunk to change the plot size. See Using R Markdown for an example.

  1. Use binned residual plots to examine the residuals versus each of the quantitative variables.

    • Create binned plots of the residuals for each category of natmass versus age. You will have three plots.
    • Create binned plots of the residuals for each category of natmass versus sei10. You will have three plots.
  2. To examine the residuals versus each categorical predictor, you will look at the average residuals for each each category of the categorical variables.

    • For each level of natmass, calculate the average residuals across categories of sex.
    • For each category of natmass, calculate the average residuals across categories of region.
    • For each category of natmass, calculate the average residuals across categories of polviews.
  3. Based on the analysis of the residuals in Exercises 12 - 14, is the model an appropriate fit for the data? Explain.

Regardless of your asssesment of the residuals, use your model for the remainder of the lab.

Part IV: Using the Model

  1. Use your model to describe the relationship between one’s political views and their attitude towards spending on mass transportation.

  2. Use your model to predict the category of natmass for each observation in your dataset. Display a table of the actual versus the predicted natmass. What is the misclassification rate?

Acknowledgements

The “Data” section is largely inspired by datasciencebox.org.