use_git_config()
function. You can also cache your password.We will use the following packages in this assignment:
library(tidyverse)
library(broom)
library(knitr)
library(rms)
The data for this homework assignment is based on the data from Lab 05 - the Airbnb listings in Asheville, NC. See the Lab 05 instructions for more information about the original data source.
airbnb_mod <- read_csv("data/airbnb_mod.csv")
This dataset only contains Airbnb listings with a with a minimum night \(\leq 3\). We will use the following variables
price
: Cost per night (in U.S. dollars)cleaning_fee
: Cleaning fee (in U.S. dollars)price_3_nights
: Total cost for 3 nights, calculated as 3 * price + cleaning_fee
prop_type_simp
: Type of dwelling (House, Apartment, Guest suite, Other)room_type
:
number_of_reviews
: Total number of reviews for the listingreview_scores_rating
: Average review score (0 - 100)This homework contains short answer questions about the concepts discussed in class. Some of these questions may also require short chunks of code to produce the output needed to answer the question. Answers should be written in complete sentences. You are only required to do what is asked for each question. You are not required to include any additional analysis.
Fit a regression model called m_logprice
using the variables prop_type_simp
, number_of_reviews
, and review_scores_rating
to predict the log_price_3
, the log-transformed version of price_3_nights
. Display the model output.
Use the augment
function to create a data frame called m_logprice_aug
that contains model output and statistics for each observation. Use the code below to display the top 5 rows of the data frame.
m_logprice_aug %>%
slice(1:5)
First, we will examine the leverage for each observation.
Based on the lecture notes, what threshold will we use to determine if observations are high leverage points?
Plot the leverage vs. the observation number (you may need to create a new variable that contains the observation number). Include a line in the plot marking the threshold from the previous part.
How many observations are considered high leverage?
Based on Cook’s distance, do these points have a significant influence on the model coefficents? Briefly explain.
Next, we will examine the standardized residuals.
Plot the standardized residuals versus the predicted values. Include lines at 2 and -2 indicating the thresholds used to determine if standardized residuals have a large magnitude.
Based on our thresholds, how many observations are considered to have standarized residuals with large magnitude?
We can approximate the distribution of standardized residuals using a \(N(0,1)\) distribution. Based on this, what proportion of observations do you expect to have standardized residuals with magntiude \(> 2\)? Consider your answer from part the previous part. Do you think there is a concern with the number of observations flagged as having standardized residuals with large magnitude? Briefly explain.
Use the vif
function in the rms package to find the variance inflation factor for each predictor variable in the model. Are there any obvious concerns with multicollinearity in this model? Briefly explain.
Total | 50 |
---|---|
Questions 1 - 5 | 40 |
Documents complete and neatly organized (Markdown and knitted documents) | 5 |
Answers written in complete sentences | 3 |
Regular and informative commit messages | 2 |
The data used in this homework is from insideairbnb.com