When doing statistical analyses in practice, there is often a lot of time spent on cleaning and preparing the data. The goal of today’s lab is to practice cleaning messy data, so it can be used in a regression analysis. You will also practice interpreting the results from a regression model that has numeric and categorical predictors and a log-transformed response variable.
Go to the STA210-Sp19 organization on GitHub (https://github.com/STA210-Sp19). Click on the repo with the prefix lab-05-. It contains the starter documents you need to complete the warmup exercise.
Clone the repo and create a new project in RStudio Cloud.
Configure git by typing the following in the console.
library(usethis)
use_git_config(user.name="your name", user.email="your email")
If you would like your git password cached for a week for this project, type the following in the Terminal:
git config --global credential.helper 'cache --timeout 604800'
You will need to enter your GitHub username and password one more time after caching the password. After that you won’t need to enter your credentials for 604800 seconds = 7 days.
We will use the following packages in today’s lab.
library(tidyverse)
library(knitr)
library(skimr)
library(broom)
Currently your project is called Untitled Project. Update the name of your project to be “Lab 05 - Airbnb”.
Before we introduce the data, let’s warm up with a simple exercise.
Pick one team member to update the author and date fields at the top of the R Markdown file. Knit, commit, and push all the updated documents to Github.
Now, the remaining team members who have not been concurrently making these changes on their projects should click on the Pull button in their Git pane and observe that the changes are now reflected on their projects as well.
Today’s data is about Airbnb listings in Asheville, NC. The data was obtained from http://insideairbnb.com/; it was originally scraped from airbnb.com.
You can see a visualization of some of the data used in today’s lab at http://insideairbnb.com/asheville/.
basic_info <- read_csv("data/airbnb_basic.csv")
details <- read_csv("data/airbnb_details.csv")
We will use the following variables in this lab:
price
: Cost per night (in U.S. dollars)cleaning_fee
: Cleaning fee (in U.S. dollars)property_type
: Type of dwelling (House, Apartment, etc.)room_type
:
number_of_reviews
: Total number of reviews for the listingreview_scores_rating
: Average review score (0 - 100)basic_info
and details
data frames in this analysis. Both dataframes have the variable id
that uniquely identifies each Airbnb listing. Because we need data from basic_info
and details
, we only want to include observations that are in both the basic_info
and details
datasets. Therefore, we will use an inner_join
to combine the two data sets. (Note: Both data frames include a variable called id
that uniquely identifies each Airbnb listing. R will use this variable to join the two data frames.)See Section 13.4 of R for Data Science for more information about joins.
airbnb <- inner_join(basic_info,details)
How many observations are in airbnb
? How many variables?
cleaning_fee
is currently stored in the airbnb
data frame.typeof(airbnb$cleaning_fee)
The column cleaning_fee
currently contains what type of data? Why do you think the data is stored this way even though cleaning_fee
is a quantitative variable?
cleaning_fee
is a quantitative variable, we need to make sure it is stored as numeric data in the dataframe. To do so, we will first use the extract
function in tidyr
package to create a column of cleaning fees that don’t have the dollar sign. Then, we will use the as.numeric()
function to make the extracted data the numeric data type double
.See [https://tidyr.tidyverse.org/reference/extract.html) for more information about the extract
function.
airbnb <- airbnb %>%
extract(cleaning_fee, "cleaning_fee") %>%
mutate(cleaning_fee = as.numeric(cleaning_fee))
Use the typeof
function to confirm that cleaning_fee
is now stored as a double
data type.
Use the skim
function to view a summary of the cleaning_fee
data. How many observations have missing values for cleaning_fee
? What do you think is the most likely reason for the missing observations of cleaning_fee
? In other words, what does a missing value of cleaning_fee
indicate?
Fill in the code below to impute the missing values of cleaning_fee
with an appropriate numeric value. Then use the skim
function to confirm that there are no longer missing values of cleaning_fee
.
See https://dplyr.tidyverse.org/reference/case_when.html for more information about the case_when
function.
airbnb <- airbnb %>%
mutate(cleaning_fee = case_when(
is.na(cleaning_fee) ~ ______,
TRUE ~ cleaning_fee
))
This is an example of data that is missing not at random, since there is a specific pattern/explanation to the misisng data. We will talk more about dealing with missing data later in the semester.
See Section 5.6.3 of R for Data Science for more information about counting observations.
property_type
.
count
function to determine how many categories are in the variable property
and the frequency of each category.proprety_type
variable that has 5 categories: House, Apartment, Guest suite, Bungalow, and Other. Fill in the code below to create prop_type_simp
.airbnb <- airbnb %>%
mutate(prop_type_simp = case_when(
property_type %in% c("House","______", "______","______") ~ property_type,
TRUE ~ "Other"
))
Use the code below to check that prop_type_simp
was correctly made.
airbnb %>%
count(property_type, prop_type_simp) %>%
arrange(desc(n))
minimum_nights
? Which value in the top 5 stands out? What is the likely intended purpose for Airbnb listings with this seemingly unusual value for minimum_nights
?Filter the airbnb
data so that it only includes observations with minimum_nights <= 3
. You will use this filtered dataset for the remainder of the lab.
For the response variable, will use the cost to stay at an Airbnb location for 3 nights. Create a new variable called price_3_nights
that uses price
and cleaning_fee
to calculate the total cost to stay at the Airbnb property for 3 nights. Be sure to add this variable to your dataframe.
Use histograms to examine the distributions of price_3_nights
and log(price_3_nights)
. Based on the histograms, which variable should you use for the regression model? Briefly explain.
Use this variable as the response for the remainder of the lab.
Fit a regression model called model1
with the response variable from the previous question and the following predictor variables: prop_type_simp
, number_of_reviews
, and review_scores_rating
. Display the model output.
Interpret the coefficient review_scores_rating
in terms of price_3_nights
.
Interpret the coefficient of prop_type_simpGuest suite
in terms of price_3_nights
.
We want to determine if room_type
is a significant predictor of the cost for 3 nights, given everything else in the model. Fit a regression model called model2
that includes all of the predictor variables in model1
and room_type
. Display the model output.
Use the code below to conduct a Nested F test to determine if room_type
is a significant predictor of the minimum cost. What is your conclusion from the Nested F test?
anova(model1, model2)
model2
to predict the total cost to stay at this Airbnb for 3 nights. Include the appropriate 95% interval with your prediction. Report the prediction and interval in terms of price_3_nights
.You’re done! Commit all remaining changes, use the commit message “Done with Lab 5!”, and push. Before you wrap up the assignment, make sure the .Rmd and .md documents are updated in your GitHub repo. There is a 10% penalty if the .Rmd file has to be knitted to display graphs, i.e. the graphs are not showing in the .md file on GitHub.
The data from this lab is from insideairbnb.com