Lab 05: Airbnbs in Asheville, NC

Data Wrangling & Regression

due Wed, Feb 20 at 11:59p

When doing statistical analyses in practice, there is often a lot of time spent on cleaning and preparing the data. The goal of today’s lab is to practice cleaning messy data, so it can be used in a regression analysis. You will also practice interpreting the results from a regression model that has numeric and categorical predictors and a log-transformed response variable.

Getting Started

library(usethis)
use_git_config(user.name="your name", user.email="your email")

Password caching

If you would like your git password cached for a week for this project, type the following in the Terminal:

git config --global credential.helper 'cache --timeout 604800'

You will need to enter your GitHub username and password one more time after caching the password. After that you won’t need to enter your credentials for 604800 seconds = 7 days.

Packages

We will use the following packages in today’s lab.

library(tidyverse)
library(knitr)
library(skimr)
library(broom)

Project name:

Currently your project is called Untitled Project. Update the name of your project to be “Lab 05 - Airbnb”.

Warm up

Before we introduce the data, let’s warm up with a simple exercise.

YAML:

Data

Today’s data is about Airbnb listings in Asheville, NC. The data was obtained from http://insideairbnb.com/; it was originally scraped from airbnb.com.

You can see a visualization of some of the data used in today’s lab at http://insideairbnb.com/asheville/.

basic_info <- read_csv("data/airbnb_basic.csv")
details <- read_csv("data/airbnb_details.csv")

We will use the following variables in this lab:

Exercises

Data wrangling

  1. We would like to use variables from both the basic_info and details data frames in this analysis. Both dataframes have the variable id that uniquely identifies each Airbnb listing. Because we need data from basic_info and details, we only want to include observations that are in both the basic_info and details datasets. Therefore, we will use an inner_join to combine the two data sets. (Note: Both data frames include a variable called id that uniquely identifies each Airbnb listing. R will use this variable to join the two data frames.)

See Section 13.4 of R for Data Science for more information about joins.

airbnb <- inner_join(basic_info,details)

How many observations are in airbnb? How many variables?

  1. Some Airbnb rentals have cleaning fees, and we want to include the cleaning fee when we calculate the total rental cost. Use the code below to see how the data in the column cleaning_fee is currently stored in the airbnb data frame.
typeof(airbnb$cleaning_fee)

The column cleaning_fee currently contains what type of data? Why do you think the data is stored this way even though cleaning_fee is a quantitative variable?

  1. Since cleaning_fee is a quantitative variable, we need to make sure it is stored as numeric data in the dataframe. To do so, we will first use the extract function in tidyr package to create a column of cleaning fees that don’t have the dollar sign. Then, we will use the as.numeric() function to make the extracted data the numeric data type double.

See [https://tidyr.tidyverse.org/reference/extract.html) for more information about the extract function.

airbnb <- airbnb %>% 
  extract(cleaning_fee, "cleaning_fee") %>%
  mutate(cleaning_fee = as.numeric(cleaning_fee))

Use the typeof function to confirm that cleaning_fee is now stored as a double data type.

  1. Use the skim function to view a summary of the cleaning_fee data. How many observations have missing values for cleaning_fee? What do you think is the most likely reason for the missing observations of cleaning_fee? In other words, what does a missing value of cleaning_fee indicate?

  2. Fill in the code below to impute the missing values of cleaning_fee with an appropriate numeric value. Then use the skim function to confirm that there are no longer missing values of cleaning_fee.

See https://dplyr.tidyverse.org/reference/case_when.html for more information about the case_when function.

airbnb <- airbnb %>%
  mutate(cleaning_fee = case_when(
    is.na(cleaning_fee) ~ ______, 
    TRUE ~ cleaning_fee
  ))

This is an example of data that is missing not at random, since there is a specific pattern/explanation to the misisng data. We will talk more about dealing with missing data later in the semester.

See Section 5.6.3 of R for Data Science for more information about counting observations.

  1. Next, we look at the variable property_type.
    • Use the count function to determine how many categories are in the variable property and the frequency of each category.
    • What are the top 4 most common property types? These make up what proportion of the observations?
  2. Since an overwhelming majority of the observations in the data are one of the top 4 property types, we would like to create a simplified version of the proprety_type variable that has 5 categories: House, Apartment, Guest suite, Bungalow, and Other. Fill in the code below to create prop_type_simp.
airbnb <- airbnb %>%
  mutate(prop_type_simp = case_when(
    property_type %in% c("House","______", "______","______") ~ property_type, 
    TRUE ~ "Other"
  ))

Use the code below to check that prop_type_simp was correctly made.

airbnb %>%
  count(property_type, prop_type_simp) %>%
  arrange(desc(n))
  1. Airbnb is most commonly used for travel purposes, i.e. as an alternative to traditional hotels. We only want to include Airbnb listings in our regression analysis that are intended for travel purposes. What are the 5 most common values for the variable minimum_nights? Which value in the top 5 stands out? What is the likely intended purpose for Airbnb listings with this seemingly unusual value for minimum_nights?

Filter the airbnb data so that it only includes observations with minimum_nights <= 3. You will use this filtered dataset for the remainder of the lab.

Regression Analysis

  1. For the response variable, will use the cost to stay at an Airbnb location for 3 nights. Create a new variable called price_3_nights that uses price and cleaning_fee to calculate the total cost to stay at the Airbnb property for 3 nights. Be sure to add this variable to your dataframe.

  2. Use histograms to examine the distributions of price_3_nights and log(price_3_nights). Based on the histograms, which variable should you use for the regression model? Briefly explain.

Use this variable as the response for the remainder of the lab.

  1. Fit a regression model called model1 with the response variable from the previous question and the following predictor variables: prop_type_simp, number_of_reviews, and review_scores_rating. Display the model output.

  2. Interpret the coefficient review_scores_rating in terms of price_3_nights.

  3. Interpret the coefficient of prop_type_simpGuest suite in terms of price_3_nights.

  4. We want to determine if room_type is a significant predictor of the cost for 3 nights, given everything else in the model. Fit a regression model called model2 that includes all of the predictor variables in model1 and room_type. Display the model output.

Use the code below to conduct a Nested F test to determine if room_type is a significant predictor of the minimum cost. What is your conclusion from the Nested F test?

anova(model1, model2)
  1. Suppose you are planning to visit Asheville over spring break, and you want to stay in an Airbnb. You find an Airbnb that is an apartment with a private room, has 10 reviews, and an average rating of 90. Use model2 to predict the total cost to stay at this Airbnb for 3 nights. Include the appropriate 95% interval with your prediction. Report the prediction and interval in terms of price_3_nights.

You’re done! Commit all remaining changes, use the commit message “Done with Lab 5!”, and push. Before you wrap up the assignment, make sure the .Rmd and .md documents are updated in your GitHub repo. There is a 10% penalty if the .Rmd file has to be knitted to display graphs, i.e. the graphs are not showing in the .md file on GitHub.

Acknowledgement

The data from this lab is from insideairbnb.com