Getting Started

  • Go to the STA210-Sp19 organization on GitHub (https://github.com/STA210-Sp19). Click on the repo with the prefix hw-04. It contains the starter documents you need to complete the lab.
  • Clone the repo in RStudio Cloud
  • Configure git using the use_git_config() function. You can also cache your password.

Packages

We will use the following packages in this assignment:

library(tidyverse)
library(broom)
library(knitr) 
library(rms)

Data

The data for this homework assignment is based on the data from Lab 05 - the Airbnb listings in Asheville, NC. See the Lab 05 instructions for more information about the original data source.

airbnb_mod <- read_csv("data/airbnb_mod.csv")

This dataset only contains Airbnb listings with a with a minimum night \(\leq 3\). We will use the following variables

  • price: Cost per night (in U.S. dollars)
  • cleaning_fee: Cleaning fee (in U.S. dollars)
  • price_3_nights: Total cost for 3 nights, calculated as 3 * price + cleaning_fee
  • prop_type_simp: Type of dwelling (House, Apartment, Guest suite, Other)
  • room_type:
    • Entire home/apt (guests have entire place to themselves)
    • Private room (Guests have private room to sleep, all other rooms shared)
    • Shared room (Guests sleep in room shared with others)
  • number_of_reviews: Total number of reviews for the listing
  • review_scores_rating: Average review score (0 - 100)

Questions

This homework contains short answer questions about the concepts discussed in class. Some of these questions may also require short chunks of code to produce the output needed to answer the question. Answers should be written in complete sentences. You are only required to do what is asked for each question. You are not required to include any additional analysis.

  1. Fit a regression model called m_logprice using the variables prop_type_simp, number_of_reviews, and review_scores_rating to predict the log_price_3, the log-transformed version of price_3_nights. Display the model output.

  2. Use the augment function to create a data frame called m_logprice_aug that contains model output and statistics for each observation. Use the code below to display the top 5 rows of the data frame.

m_logprice_aug %>%
  slice(1:5)
  1. First, we will examine the leverage for each observation.

    • Based on the lecture notes, what threshold will we use to determine if observations are high leverage points?

    • Plot the leverage vs. the observation number (you may need to create a new variable that contains the observation number). Include a line in the plot marking the threshold from the previous part.

    • How many observations are considered high leverage?

    • Based on Cook’s distance, do these points have a significant influence on the model coefficents? Briefly explain.

  2. Next, we will examine the standardized residuals.

    • Plot the standardized residuals versus the predicted values. Include lines at 2 and -2 indicating the thresholds used to determine if standardized residuals have a large magnitude.

    • Based on our thresholds, how many observations are considered to have standarized residuals with large magnitude?

    • We can approximate the distribution of standardized residuals using a \(N(0,1)\) distribution. Based on this, what proportion of observations do you expect to have standardized residuals with magntiude \(> 2\)? Consider your answer from part the previous part. Do you think there is a concern with the number of observations flagged as having standardized residuals with large magnitude? Briefly explain.

  3. Use the vif function in the rms package to find the variance inflation factor for each predictor variable in the model. Are there any obvious concerns with multicollinearity in this model? Briefly explain.

Grading

Total 50
Questions 1 - 5 40
Documents complete and neatly organized (Markdown and knitted documents) 5
Answers written in complete sentences 3
Regular and informative commit messages 2

Acknowledgement

The data used in this homework is from insideairbnb.com