Over the past ten years, recommendation systems have become increasingly popular as more companies strive to offer customized user experiences. Amazon recommends products you may like based on your browse and purchase history, Netflix recommends movies and TV shows based on your viewing history, and music platforms like Spotify recommend songs you may like based on your listening history. While these recommendation systems are built using a variety of algorithms, they are all trying to achieve the same goal: use the characteristics of the products/movies/music a user is known to like to figure out the products/movies/music the user may like but hasn’t discovered yet.
See “How Does Spotify Know You So Well?” for more information about Spotify’s recommendation algorithms.
Today, we will continue form last week’s lab and predict which songs a Spotify user will like. Before using the model for prediction, we will check the model assumptions and assess how well the model fits the data.
Go to the sta210-sp20 organization on GitHub (https://github.com/sta210-sp20). Click on the repo with the prefix lab-08-. It contains the starter documents you need to complete the warmup exercise.
Clone the repo and create a new project in your RStudio Docker Container (https://vm-manage.oit.duke.edu).
Configure git by typing the following in the console.
If you would like your git password cached for a week for this project, type the following in the Terminal:
You will need to enter your GitHub username and password one more time after caching the password. After that you won’t need to enter your credentials for 604800 seconds = 7 days.
You will need the following packages for today’s lab:
The data in this lab is from the Spotify Song Attributes data set in Kaggle. This data set contains song characteristics of 2017 songs played by a single user and whether or not he liked the song. Since this dataset contains the song preferences of a single user, the scope of the analysis is limited to this particular user.
You will use data spotify.csv
in the data
folder for this lab.
The Spotify documentation page contains a description of the variables included in this dataset.
Read through the Spotify documentation page to learn more about the variables in the dataset. The response variable for this analysis is like
, where 1 indicates the user likes the song and 0 otherwise. Let’s prepare the response and some predictor variables before modeling.
like
so that it is factor variable type in R.key
so that it is a factor variable type in R, which takes values “D” if key==2
, “D#” if key==3
, and “Other” for all other values.like
and key
. Briefly describe the relationship between the two variables.Fit a logistic regression model with like
as the response variable and the following as predictors: acousticness
, danceability
, duration_ms
, instrumentalness
, loudness
, speechiness
, and valence
. Display the model output.
We consider adding key
to the model. Conduct the appropriate test to determine if key
should be included in the model. Display the output from the test and write your conclusion in the context of the data.
Use the model you selected in Exercise 3 for the remainder of the lab.
keyD#
in the context of the data. Otherwise, state why it’s not appropriate to interpret this coefficient.In the next few questions, we will do an abbreviated analysis of the residuals.
Use the augment
function to calculate the predicted probabilities and corresponding residuals.
Create a binned plot of the residuals versus the predicted probabilities.
Choose a quantitative predictor in the final model. Make the appropriate table or plot to examine the residuals versus this predictor variable.
Choose a categorical predictor in the final model. Make the appropriate table or plot to examine the residuals versus this predictor variable.
In practice, you should examine plots of residuals versus every predictor variable to make a complete assessment of the model fit. For the sake of time on the lab, you will use these three plots to help make the assessment about the model fit.
Plot the ROC curve and calculate the area under the curve (AUC). Display at least 5 thresholds (n.cut = 5
) on the ROC.
Based on the ROC curve and AUC in the previous exercise, do you think this model effectively differentiates between the songs the user likes versus those he doesn’t?
You are part of the data science team at Spotify, and your model will be used to make song recommendations to users. The goal is to recommend songs the user has a high probability of liking.
As a group, choose a threshold value to distinguish between songs the user will like and those the user won’t like. What is your threshold value? Use the ROC curve to help justify your choice.
Make the confusion matrix using the threshold chosen in the previous question.
Use the confusion matrix from the previous question to answer the following:
See “Submitting the Assignment” from Lab 01 for detailed instructions on how to upload the assignment on GitHub.
Labs will be graded for completion, using the following:
Lab completion (50 pts):
Formatting (possible point deductions)