Lab 08: Logistic Regression

due Thu, Apr 9 at 11:59p

Over the past ten years, recommendation systems have become increasingly popular as more companies strive to offer customized user experiences. Amazon recommends products you may like based on your browse and purchase history, Netflix recommends movies and TV shows based on your viewing history, and music platforms like Spotify recommend songs you may like based on your listening history. While these recommendation systems are built using a variety of algorithms, they are all trying to achieve the same goal: use the characteristics of the products/movies/music a user is known to like to figure out the products/movies/music the user may like but hasn’t discovered yet.

See “How Does Spotify Know You So Well?” for more information about Spotify’s recommendation algorithms.

Today, we will continue form last week’s lab and predict which songs a Spotify user will like. Before using the model for prediction, we will check the model assumptions and assess how well the model fits the data.

Getting Started

library(usethis)
use_git_config(user.name="github username", user.email="your email")

Password caching

If you would like your git password cached for a week for this project, type the following in the Terminal:

git config --global credential.helper 'cache --timeout 604800'

You will need to enter your GitHub username and password one more time after caching the password. After that you won’t need to enter your credentials for 604800 seconds = 7 days.

Packages

You will need the following packages for today’s lab:

library(tidyverse)
library(broom)
library(pROC)
library(plotROC)
# Fill in other packages as needed

Data

The data in this lab is from the Spotify Song Attributes data set in Kaggle. This data set contains song characteristics of 2017 songs played by a single user and whether or not he liked the song. Since this dataset contains the song preferences of a single user, the scope of the analysis is limited to this particular user.

You will use data spotify.csv in the data folder for this lab.

The Spotify documentation page contains a description of the variables included in this dataset.

Exercises

Part I: Data Prep & Modeling

  1. Read through the Spotify documentation page to learn more about the variables in the dataset. The response variable for this analysis is like, where 1 indicates the user likes the song and 0 otherwise. Let’s prepare the response and some predictor variables before modeling.

    • If needed, change like so that it is factor variable type in R.
    • Change key so that it is a factor variable type in R, which takes values “D” if key==2, “D#” if key==3, and “Other” for all other values.
    • Plot the relationship between like and key. Briefly describe the relationship between the two variables.
  2. Fit a logistic regression model with like as the response variable and the following as predictors: acousticness, danceability, duration_ms, instrumentalness, loudness, speechiness, and valence. Display the model output.

  3. We consider adding key to the model. Conduct the appropriate test to determine if key should be included in the model. Display the output from the test and write your conclusion in the context of the data.


Use the model you selected in Exercise 3 for the remainder of the lab.

  1. Display the model you chose in Exercise 3. If appropriate, interpret the coefficient for keyD# in the context of the data. Otherwise, state why it’s not appropriate to interpret this coefficient.

Part II: Checking Assumptions

In the next few questions, we will do an abbreviated analysis of the residuals.

  1. Use the augment function to calculate the predicted probabilities and corresponding residuals.

  2. Create a binned plot of the residuals versus the predicted probabilities.

  3. Choose a quantitative predictor in the final model. Make the appropriate table or plot to examine the residuals versus this predictor variable.

  4. Choose a categorical predictor in the final model. Make the appropriate table or plot to examine the residuals versus this predictor variable.

In practice, you should examine plots of residuals versus every predictor variable to make a complete assessment of the model fit. For the sake of time on the lab, you will use these three plots to help make the assessment about the model fit.

  1. Based on the residuals plots from Exercises 6 - 8, is the linearity assumption satisfied? Briefly explain why or why not.

Part III: Model Assessment & Prediction

  1. Plot the ROC curve and calculate the area under the curve (AUC). Display at least 5 thresholds (n.cut = 5) on the ROC.

  2. Based on the ROC curve and AUC in the previous exercise, do you think this model effectively differentiates between the songs the user likes versus those he doesn’t?

  3. You are part of the data science team at Spotify, and your model will be used to make song recommendations to users. The goal is to recommend songs the user has a high probability of liking.

    As a group, choose a threshold value to distinguish between songs the user will like and those the user won’t like. What is your threshold value? Use the ROC curve to help justify your choice.

  4. Make the confusion matrix using the threshold chosen in the previous question.

  5. Use the confusion matrix from the previous question to answer the following:

    • What is the proportion of true positives (sensitivity)?
    • What is the proportion of false positives (1 - specificity)?
    • What is the misclassification rate?

Submitting the Assignment

See “Submitting the Assignment” from Lab 01 for detailed instructions on how to upload the assignment on GitHub.

Grading

Labs will be graded for completion, using the following:

Lab completion (50 pts):

Formatting (possible point deductions)