The General Social Survey (GSS) has been used to measure trends in attitudes and behaviors in American society since 1972. In addition to collecting demographic information, the survey includes questions used to gauge attitudes about government spending priorities, confidence in institutions, lifestyle, and many other topics. A full description of the survey may be found here.
In today’s lab, we will use multinomial logistic regression to understand the relationship between a person’s political views and their attitudes towards government spending on mass transportation projects. To do so, we will use data from the 2010 GSS survey.
Go to the sta210-sp20 organization on GitHub (https://github.com/sta210-sp20). Click on the repo with the prefix lab-09-
Clone the repo and create a new project in your RStudio Docker Container (https://vm-manage.oit.duke.edu).
Configure git by typing the following in the console.
If you would like your git password cached for a week for this project, type the following in the Terminal:
You will need to enter your GitHub username and password one more time after caching the password. After that you won’t need to enter your credentials for 604800 seconds = 7 days.
You will need the following packages for today’s lab:
The data for this lab are from the 2016 General Social Survey. The original data set contains 2867 observations and 935 variables. Given the size of the dataset, we will handle it differently in our workflow than we’ve handled data in previous assignments.
The size of this dataset is 34.3 MB. Compare that to the Spotify dataset from last weeks’ lab which was 149 KB (0.149 MB)! GitHub will not allow you to push files larger than 100 MB and will give you a warning when you push files as large as 50 MB. Though we could push the file we’re working with today to GitHub, it’s large enough that we’d still prefer not to.
You have may noticed that each repo contains a file called .gitignore
. It contains a list of the files you don’t want commit or push to GitHub. If you look at the .gitignore
file for today’s lab, you will notice that gss2016.csv
is listed at the bottom.
gss2016.csv
.gss2016.csv
into the data folder of your project.gss2016.csv
does not appear in your Git pane. This is because it is being ignored by git, since it is listed in the .gitignore
file.You will use the following variables in the lab:
natmass
: Respondent’s answer to the following prompt:
“We are faced with many problems in this country, none of which can be solved easily or inexpensively. I’m going to name some of these problems, and for each one I’d like you to tell me whether you think we’re spending too much money on it, too little money, or about the right amount…are we spending too much, too little, or about the right amount on mass transportation?”
age
: Age in years.
sex
: Sex recorded as male or female
sei10
: Socioeconomic index from 0 to 100
region
: Region where interview took place
polviews
: Respondent’s answer to the following prompt:
“We hear a lot of talk these days about liberals and conservatives. I’m going to show you a seven-point scale on which the political views that people might hold are arranged from extremely liberal - point 1 - to extremely conservative - point 7. Where would you place yourself on this scale?”
Use the code below to read in the data.
gss <- read_csv("data/gss2016.csv",
na = c("", "Don't know", "No answer",
"Not applicable"),
guess_max = 2867) %>%
select(natmass, age, sex, sei10, region, polviews) %>%
drop_na()
The argument guess_max = 2867
tells the read_csv
function to use all of the observations in a column to determine its data type. Without this argument, only the first 1,000 observations would be used to make this determination. This becomes important for a variable like age
; though age
is coded as numeric data for most of the observations, there are some in which age
is coded as "89 or older"
. Without the guess_max
argument, you will get warnings when loading the data.
Note also that only the variables of interest will be loaded, not the entire dataset. This will make for faster computation and knitting as you work on the lab.
See Reorder factor levels by hand for documentation about fct_relevel
.
The variable natmass
will be the response variable in the model, and you want to compare more opinionated views to the moderate position. Recode natmass
so it is a factor variable with "About right"
as the baseline.
Recode polviews
so it is a factor variable type with levels that are in an order that is consistent with question on the survey. Note how the categories are spelled in the data.
Make a plot of the distribution of polviews
. Which political view occurs most frequently in this data set?
Make a plot displaying the relationship between natmass
and polviews
. Use the plot to describe the relationship between a person’s political views and their views on mass transportation spending.
You want to use age
as a quantitative variable in your model; however, it is currently a character data type because some observations are coded as "89 or older"
. Recode age
so that is a numeric variable. Note: Before making the variable numeric, you will need to replace the values "89 or older"
with a single value.
You plan to fit a model using age
, sex
, sei10
, and region
to understand variation in opinions about spending on mass transportation. Briefly explain why you should fit a multinomial logistic model.
Fit the model described in the previous exercise and display the model output. Make any necessary adjustments to the variables so the intercept will have a meaningful interpretation. Be sure About Right
is the baseline level. Be sure the full model displays in the knitted document.
Interpret the intercept associated with odds of having an opinion of “Too much” versus “About right”.
Consider the relationship between age and one’s opinion about spending on mass transportation. Interpret the coefficient of age in terms of the odds of having an opinion of “Too little” versus “About right”.
Now that you have adjusted for some demographic factors, let’s examine whether a person’s political views has a significant impact on their attitude towards spending on mass transportation.
Conduct the appropriate test to determine if polviews
is a significant predictor of attitude towards spending on mass transportation. State the null and alternative hypothesis, display all relevant code and output, and state your conclusion in the context of the problem.
Choose the appropriate model based on the results from the test. Use this model for the next part of the lab.
Calculate the predicted probabilities and residuals from your model.
Let’s make some of the plots and tables you use to check the linearity assumption for multinomial logistic regression. Plot the binned residuals versus the predicted probabilities for each category of natmass.
You will have three plots.
You can change the size of your plots, so you can fit multiple plots on a single page. Include the arguments fig.height =
and fig.width =
in the header of the code chunk to change the plot size. See Using R Markdown for an example.
To examine the residuals versus each categorical predictor, you will look at the average residuals for each each category of the categorical variables.
natmass
, calculate the average residuals across categories of region
.Based on the plot and table above, discuss with your group whether there are any obvious violations of the linearity assumption. Note that we haven’t examined all of the plots and tables of the residuals needed to make an assessment about the linearity assumption.
The other assumptions are randomness and independence. Discuss with your group whether these assumptions are satisfied for this analysis.
Use your model to describe the relationship between one’s political views and their attitude towards spending on mass transportation.
Use your model to predict the category of natmass
for each observation in your dataset. Display a table of the actual versus the predicted natmass
. What is the misclassification rate?
See “Submitting the Assignment” from Lab 01 for detailed instructions on how to upload the assignment on GitHub.
Labs will be graded for completion, using the following:
Lab completion (50 pts):
Formatting (possible point deductions)
The “Data” section is largely inspired by datasciencebox.org.