The General Social Survey (GSS) has been used to measure trends in attitudes and behaviors in American society since 1972. In addition to collecting demographic information, the survey includes questions used to gauge attitudes about government spending priorities, confidence in institutions, lifestyle, and many other topics. A full description of the survey may be found here.
In today’s lab, we will use multinomial logistic regression to understand the relationship between a person’s political views and their attitudes towards government spending on mass transportation projects. To do so, we will use data from the 2010 GSS survey. Refer to the Multinomial Logistic Regression notes for help with concepts and code.
Go to the STA210-Sp19 organization on GitHub (https://github.com/STA210-Sp19). Click on the repo with the prefix lab-08-. It contains the starter documents you need to complete the lab.
Clone the repo and create a new project in RStudio Cloud.
Configure git by typing the following in the console.
When configuring Git, be sure to use the email address that is associated with your GitHub account.
library(usethis)
use_git_config(user.name="your name", user.email="your email")
If you would like your git password cached for a week for this project, type the following in the Terminal:
git config --global credential.helper 'cache --timeout 604800'
You will need to enter your GitHub username and password one more time after caching the password. After that you won’t need to enter your credentials for 604800 seconds = 7 days.
You will need the following packages for today’s lab:
library(tidyverse)
library(nnet)
library(knitr)
library(broom)
# Fill in other packages as needed
Currently your project is called Untitled Project. Update the name of your project to the title of today’s lab.
Before we introduce the data, let’s warm up with a simple exercise.
Pick one team member to update the author and date fields at the top of the R Markdown file. Knit, commit, and push all the updated documents to Github.
Now, the remaining team members who have not been concurrently making these changes on their projects should click on the Pull button in their Git pane and observe that the changes are now reflected on their projects as well.
The data for this lab is from the 2016 General Social Survey. The original data set contains 2867 observations and 935 variables. Given the size of the dataset, we will handle it differently in our workflow than we’ve handled data in previous assignments.
The size of this dataset is 34.3 MB. Compare that to the Spotify dataset from last weeks’ lab which was 149 KB (0.149 MB)! GitHub will not allow you to push files larger than 100 MB and will give you a warning when you push files as large as 50 MB. Though we could push the file we’re working with today to GitHub, it’s large enough that we’d still prefer not to.
You have may noticed that each repo contains a file called .gitignore
. It contains a list of the files you don’t want commit or push to GitHub. If you look at the .gitignore
file for today’s lab, you will notice that gss2016.csv
is listed at the bottom.
gss2016.csv
.gss2016.csv
into the data folder of your project.gss2016.csv
does not appear in your Git pane. This is because it is being ignored by git, since it is listed in the .gitignore
file.You will use the following variables in the lab:
natmass
: Respondent’s answer to the following prompt:
“We are faced with many problems in this country, none of which can be solved easily or inexpensively. I’m going to name some of these problems, and for each one I’d like you to tell me whether you think we’re spending too much money on it, too little money, or about the right amount…are we spending too much, too little, or about the right amount on mass transportation?”
age
: Age in years.sex
: Sex recorded as male or femalesei10
: Socioeconomic index from 0 to 100region
: Region where interview took placepolviews
: Respondent’s answer to the following prompt:
“We hear a lot of talk these days about liberals and conservatives. I’m going to show you a seven-point scale on which the political views that people might hold are arranged from extremely liberal - point 1 - to extremely conservative - point 7. Where would you place yourself on this scale?”
Use the code below to read in the data.
gss <- read_csv("data/gss2016.csv",
na = c("", "Don't know", "No answer",
"Not applicable"),
guess_max = 2867) %>%
select(natmass, age, sex, sei10, region, polviews) %>%
drop_na()
The argument guess_max = 2867
tells the read_csv
function to use all of the observations in a column to determine its data type. Without this argument, only the first 1,000 observations would be used to make this determination. This becomes important for a variable like age
; though age
is coded as numeric data for most of the observations, there are some in which age
is coded as "89 or older"
. Without the guess_max
argument, you will get warnings when loading the data.
Note also that only the variables of interest will be loaded, not the entire dataset. This will make for faster computation and knitting as you work on the lab.
See Reorder factor levels by hand for documentation about fct_relevel
.
The variable natmass
will be the response variable in the model, and you want to compare more opinionated views to the moderate position. Recode natmass
so it is a factor variable with "About right"
as the baseline.
Recode polviews
so it is a factor variable type with levels that are in an order that is consistent with question on the survey. Note how the categories are spelled in the data.
Make a plot of the distribution of polviews
. Which political view occurs most frequently in this data set?
Make a plot displaying the relationship between natmass
and polviews
. Use the plot to describe the relationship between a person’s political views and their views on mass transportation spending.
You want to use age
as a quantitative variable in your model; however, it is currently a character data type because some observations are coded as "89 or older"
. Recode age
so that is a numeric variable. Note: Before making the variable numeric, you will need to replace the values "89 or older"
with a single value.
You plan to fit a model using age
, sex
, sei10
, and region
to understand variation in opinions about spending on mass transportation. Briefly explain why you should fit a multinomial logistic model.
Fit the model described in the previous exercise and display the model output. Make any necessary adjustments to the variables so the intercept will have a meaningful interpretation. Be sure About Right
is the baseline level. Be sure the full model displays in the knitted document.
Interpret the intercept associated with odds of having an opinion of “Too much” versus “About right”.
Consider the relationship between age and one’s opinion about spending on mass transportation.
In general, what is the relationship between an person’s age and their opinions on mass transportation spending?
Now that you have adjusted for some demographic factors, let’s examine whether a person’s political views has a significant impact on their attitude towards spending on mass transportation.
polviews
is a significant predictor of attitude towards spending on mass transportation. State the null and alternative hypothesis, display all relevant code and output, and state your conclusion in the context of the problem.Choose the appropriate model based on the results from the test. Use this model for the next part of the lab.
Calculate the predicted probabilities and residuals from your model.
Plot the binned residuals versus the predicted probabilities for each category of natmass.
You will have three plots.
You can change the size of your plots, so you can fit multiple plots on a single page. Include the arguments fig.height =
and fig.width =
in the header of the code chunk to change the plot size. See Using R Markdown for an example.
Use binned residual plots to examine the residuals versus each of the quantitative variables.
natmass
versus age
. You will have three plots.natmass
versus sei10
. You will have three plots.To examine the residuals versus each categorical predictor, you will look at the average residuals for each each category of the categorical variables.
natmass
, calculate the average residuals across categories of sex
.natmass
, calculate the average residuals across categories of region
.natmass
, calculate the average residuals across categories of polviews
.Based on the analysis of the residuals in Exercises 12 - 14, is the model an appropriate fit for the data? Explain.
Regardless of your asssesment of the residuals, use your model for the remainder of the lab.
Use your model to describe the relationship between one’s political views and their attitude towards spending on mass transportation.
Use your model to predict the category of natmass
for each observation in your dataset. Display a table of the actual versus the predicted natmass
. What is the misclassification rate?
The “Data” section is largely inspired by datasciencebox.org.