Over the past ten years, recommendation systems have become increasingly popular as more companies strive to offer customized user experiences. Amazon recommends products you may like based on your browse and purchase history, Netflix recommends movies and TV shows based on your viewing history, and music platforms like Spotify recommend songs you may like based on your listening history. While these recommendation systems are built using a variety of algorithms, they are all trying to achieve the same goal: use the characteristics of the products/movies/music a user is known to like to figure out the products/movies/music the user may like but hasn’t discovered yet.

In today’s lab, we will focus on using the characteristics of songs a user previously played to determine whether or not a user will like a new song. We will use logistic regression to build a model that predicts the probability a user likes a song using the relevant characteristics of that song.

Getting Started

Go to the STA210-Sp19 organization on GitHub (https://github.com/STA210-Sp19). Click on the repo with the prefix lab-07-. It contains the starter documents you need to complete the warmup exercise.
Clone the repo and create a new project in RStudio Cloud.
Configure git by typing the following in the console.

library(usethis)
use_git_config(user.name="your name", user.email="your email")

⊕When configuring Git, be sure to use the email address that is associated with your GitHub account.

Password caching

If you would like your git password cached for a week for this project, type the following in the Terminal:

git config --global credential.helper 'cache --timeout 604800'

You will need to enter your GitHub username and password one more time after caching the password. After that you won’t need to enter your credentials for 604800 seconds = 7 days.

Packages

You will need the following packages for today’s lab:

library(tidyverse)
library(broom)
# Fill in other packages as needed

Project name:

Currently your project is called Untitled Project. Update the name of your project to the title of today’s lab.

Warm up

Before we introduce the data, let’s warm up with a simple exercise.

YAML:

Pick one team member to update the author and date fields at the top of the R Markdown file. Knit, commit, and push all the updated documents to Github.
Now, the remaining team members who have not been concurrently making these changes on their projects should click on the Pull button in their Git pane and observe that the changes are now reflected on their projects as well.

Data

The data in this lab is from the Spotify Song Attributes data set in Kaggle. This data set contains song characteristics of 2017 songs played by a single user and whether or not he liked the song. Since this dataset contains the song preferences of a single user, the scope of the analysis is limited to this particular user.

You will use data spotify.csv to build the logistic regression model and test the performance of the model using the songs in test_songs.csv. Click here to download the dataset spotify.csv, and click here to download the dataset test_songs.csv. Upload both files to the to the data folder in your lab-07 project.

The Spotify documentation page contains a description of the variables included in this dataset.

Exercises

Exploratory Data Analysis

Read through the Spotify documentation page to learn more about the variables in the dataset. The response variable for this analysis is like, such that 1 indicates that the user likes the song and 0 otherwise. The remaining will be considered as predictor variables in the model. ⊕Part of the code to make x a factor. mutate(x = factor(x))
- Which potential predictor variables are categorical? You only need to include the variables that are in the dataset.
- Recode the each of the categorical predictors so they are a factor variable type.
Choose a quantitative predictor variable. Make the appropriate plot of the response versus this predictor variable. Describe the relationship between the two variables.
Choose a categorical predictor variable. Make the appropriate plot of the response versus this predictor variable. Describe the relationship between the two variables.
Let’s consider a potential interaction effect between the variables you choose in Exercises 2 and 3. Make the appropriate plots to examine the potential interaction effect. Do these plots suggest there is a significant interaction effect? Briefly explain.

In practice, you should do exploratory data analysis for all potential explanatory variables. We did an abbreviated exploratory data analysis to make the assignment more manageable.

Part II: Logistic Regression Model

Fit the full model and display the model output. The main objective for the model is to predict whether the user will like a song. Should we use this model for this objective? Briefly explain.
Use the step function to perform backward selection. Display the output for the selected model.
Briefly describe the criteria used by step to select the final model.

For the remainder of this lab, you will use the model chosen by model selection . In practice; however, you would not just stop with the results from the automated model selection procedure and would examine the model further to see if there are any significant interactions, higher-order terms, or if it could even be simplified.

Consider the variable duration_ms.
- Interpret the coefficient of duration_ms and its 95% confidence interval in terms of the odds of the user liking a song.
- Suppose instead of duration_ms, we use the variable duration_s, the duration of a song in seconds. What would be the effect of duration_s on the odds of the user liking a song? Include the updated coefficient and corresponding 95% confidence interval for duration_s. Assume all other variables in the model are unchanged.
Interpret mode and its 95% confidence interval in terms of the odds of the user liking a song.
- Based on this model, is there evidence of a significant difference in the user’s preference between songs in a major key versus those in a minor key?

Part III: Model Assessment

In the next few questions, we will do an abbreviated analysis of the residuals.

Create a binned plot of the residuals versus the predicted probabilities. You will first need to use the augment function with the options type.predict = "response" and type.residuals = "response" to get the predicted probabilities and corresponding residuals.
Choose a quantitative predictor in the final model. Make the appropriate table or plot to examine the residuals versus this predictor variable.
Choose a categorical predictor in the final model. Make the appropriate table or plot to examine the residuals versus this predictor variable.

In practice, you should examine plots of residuals versus every predictor variable to make a complete assessment of the model fit. For the sake of time on the lab, you will use these three plots to help make the assessment in Exercise 14.

Plot the ROC curve and find the area under the curve.
Based on the residual plots and the ROC curve, is this logistic model a good fit for the data? Briefly explain.

Part IV: Prediction

You are part of the data science team at Spotify, and your model will be used to make song recommendations to users. The goal is to recommend songs the user has a high probability of liking.

As a group, choose a threshold value to distinguish between songs the user will like and those the user won’t like. What is your threshold value? Use the ROC curve to help justify your choice.

Now let’s put your model and decision threshold to the test! Use your model to calculate the predicted probability that the user will like the following two songs:
- “Sign of the Times” by Harry Styles
- “Hotline Bling” by Drake

The data for the songs can be found in test_songs.csv.

Using your decision threshold from Question 15, would you recommend “Sign of the Times” to this user? Would your recommend “Hotline Bling” to this user? Briefly explain your decision.
The user likes “Hotline Bling” but doesn’t like “Sign of the Times”. How good were your recommendations based on these two songs?
- If they were good recommendations, explain how the model and threshold helped you distinguish between songs the user would like and those he wouldn’t.
- If they were not good recommendations, explain the limitations in your model and/or threshold.

Lab 07: Logistic Regression

due Wed, Mar 27 at 11:59p