Lab 05: Airbnbs in Asheville, NC

Total: 70 points

due Tuesday, October 1 at 11:59p

When doing statistical analyses in practice, there is often a lot of time spent on cleaning and preparing the data. The goal of today’s lab is to practice cleaning messy data, so it can be used in a regression analysis. You will also practice interpreting the results from a regression model that has numeric and categorical predictors and a log-transformed response variable.

Getting Started

library(usethis)
use_git_config(user.name="github username", user.email="your email")

Password caching

If you would like your git password cached for a week for this project, type the following in the Terminal:

git config --global credential.helper 'cache --timeout 604800'

You will need to enter your GitHub username and password one more time after caching the password. After that you won’t need to enter your credentials for 604800 seconds = 7 days.

Packages

We will use the following packages in today’s lab.

library(tidyverse)
library(knitr)
library(skimr)
library(broom)

Project name:

Currently your project is called Untitled Project. Update the name of your project to be “Lab 05 - Airbnb”.

Warm up

Before we introduce the data, let’s warm up with a simple exercise.

YAML:

Data

Today’s data is about Airbnb listings in Asheville, NC. The data was obtained from http://insideairbnb.com/; it was originally scraped from airbnb.com.

You can see a visualization of some of the data used in today’s lab at http://insideairbnb.com/asheville/.

basic_info <- read_csv("data/airbnb_basic.csv")
details <- read_csv("data/airbnb_details.csv")

We will use the following variables in this lab:

Exercises

Data wrangling

  1. We would like to use variables from both the basic_info and details data frames in this analysis. Both data frames have the variable id that uniquely identifies each Airbnb listing. Because we need data from basic_info and details, we only want to include observations that are in both the basic_info and details datasets. You may use Section 13.4 in R for Data Science to help determine the appropriate join. Hint: Both datasets have a variable called id to uniquly identify the Airbnb listings.
  1. Some Airbnb rentals have cleaning fees, and we want to include the cleaning fee when we calculate the total rental cost.
  1. Since cleaning_fee is a quantitative variable, we need to make sure it is stored as numeric data in the data frame. To do so, we will first use the extract function in tidyr package to create a column of cleaning fees that don’t have the dollar sign. Then, we need to make cleaning_fee a numeric data type. Fill in the code below to extract the values of cleaning fee and store them as numeric data type.

See [https://tidyr.tidyverse.org/reference/extract.html) for more information about the extract function.

airbnb <- airbnb %>% 
  extract(cleaning_fee, "cleaning_fee") %>%
  mutate(cleaning_fee = ______)
  1. How many observations have missing values for cleaning_fee? What do you think is the most likely reason for the missing observations of cleaning_fee? In other words, what does a missing value of cleaning_fee indicate? Show the code and output used to determine the number of observations with missing values.

  2. Impute (i.e. fill in) the missing values of cleaning_fee with the appropriate numeric value. Then, display summary statistics for cleaning_fee to confirm that there are no longer missing values. Show the code and output used to impute the missing values and determine if there are missing values. Hint: You can use the case_when or if_else functions can be used to impute the missing values.

See https://dplyr.tidyverse.org/reference/case_when.html for more information about the case_when function.
See https://dplyr.tidyverse.org/reference/if_else.html for more information about the if_else function.


This is an example of data that is missing not at random, since there is a specific pattern/explanation to the misisng data. We will talk more about dealing with missing data later in the semester.

See Section 5.6.3 of R for Data Science for more information about counting observations.

  1. Next, we look at the variable property_type.
    • Use the count function to determine how many categories are in the variable property and the frequency of each category.
    • What are the top 4 most common property types? These make up what proportion of the observations?
  2. Since an overwhelming majority of the observations in the data are one of the top 4 property types, we would like to create a simplified version of the proprety_type variable that has 5 categories: House, Apartment, Guest suite, Bungalow, and Other. Fill in the code below to create prop_type_simp.
airbnb <- airbnb %>%
  mutate(prop_type_simp = case_when(
    property_type %in% c("House","______", "______","______") ~ property_type, 
    TRUE ~ "Other"
  ))

You can use the code below to check that prop_type_simp was correctly made.

airbnb %>%
  count(property_type, prop_type_simp) %>%
  arrange(desc(n))
  1. Airbnb is most commonly used for travel purposes, i.e. as an alternative to traditional hotels; we only want to include Airbnb listings in our regression analysis that are intended for travel purposes. What are the 5 most common values for the variable minimum_nights? Which value in the top 5 stands out? What is the likely intended purpose for Airbnb listings with this seemingly unusual value for minimum_nights?

Filter the airbnb data so that it only includes observations with minimum_nights <= 3. You will use this filtered dataset for the remainder of the lab.

Regression Analysis

  1. For the response variable, will use the cost to stay at an Airbnb location for 3 nights. Create a new variable called price_3_nights that uses price and cleaning_fee to calculate the total cost to stay at the Airbnb property for 3 nights. Note that the cleaning fee is only applied one time per stay.

  2. Create plots to visualize the distributions of price_3_nights and log(price_3_nights). You should be able to see the shape, approximate center, and approximate spread from the plot you choose. Based on your plots, which variable should you use for the regression model? Briefly explain.

Use this variable as the response variable for the remainder of the lab.

  1. Fit a regression model with the response variable from the previous question and the following predictor variables: prop_type_simp, number_of_reviews, and review_scores_rating. Display the model with the inference statistics and confidence intervals for each coefficient.

  2. Interpret the coefficient review_scores_rating in terms of price_3_nights.

  3. Interpret the coefficient of prop_type_simpGuest suite in terms of price_3_nights.

  4. Suppose you are planning to visit Asheville for Fall Break, and you want to stay in an Airbnb. You find an Airbnb that is an apartment with a private room, has 10 reviews, and an average rating of 90. Use the model to predict the total cost to stay at this Airbnb for 3 nights. Include the appropriate 95% interval with your prediction. Report the prediction and interval in terms of price_3_nights.

You’re done! Commit all remaining changes, use the commit message “Done with Lab 5!”, and push. Before you wrap up the assignment, make sure the .Rmd and .md documents are updated in your GitHub repo. There is a 10% penalty if the .Rmd file has to be knitted to display graphs, i.e. the graphs are not showing in the .md file on GitHub.

Acknowledgement

The data from this lab is from insideairbnb.com