When doing statistical analyses in practice, there is often a lot of time spent on cleaning and preparing the data. The goal of today’s lab is to practice cleaning messy data, so it can be used in a regression analysis. You will also practice interpreting the results from a regression model that has numeric and categorical predictors and a log-transformed response variable.
Go to the sta210-fa19 organization on GitHub (https://github.com/sta210-fa19). Click on the repo with the prefix lab-05-. It contains the starter documents you need to complete the warmup exercise.
Clone the repo and create a new project in RStudio Cloud.
Configure git by typing the following in the console.
If you would like your git password cached for a week for this project, type the following in the Terminal:
You will need to enter your GitHub username and password one more time after caching the password. After that you won’t need to enter your credentials for 604800 seconds = 7 days.
We will use the following packages in today’s lab.
Currently your project is called Untitled Project. Update the name of your project to be “Lab 05 - Airbnb”.
Before we introduce the data, let’s warm up with a simple exercise.
Pick one team member to update the author and date fields at the top of the R Markdown file. Knit, commit, and push all the updated documents to Github.
Now, the remaining team members who have not been concurrently making these changes on their projects should click on the Pull button in their Git pane and observe that the changes are now reflected on their projects as well.
Today’s data is about Airbnb listings in Asheville, NC. The data was obtained from http://insideairbnb.com/; it was originally scraped from airbnb.com.
You can see a visualization of some of the data used in today’s lab at http://insideairbnb.com/asheville/.
We will use the following variables in this lab:
price
: Cost per night (in U.S. dollars)cleaning_fee
: Cleaning fee (in U.S. dollars)property_type
: Type of dwelling (House, Apartment, etc.)room_type
:
number_of_reviews
: Total number of reviews for the listingreview_scores_rating
: Average review score (0 - 100)basic_info
and details
data frames in this analysis. Both data frames have the variable id
that uniquely identifies each Airbnb listing. Because we need data from basic_info
and details
, we only want to include observations that are in both the basic_info
and details
datasets. You may use Section 13.4 in R for Data Science to help determine the appropriate join. Hint: Both datasets have a variable called id
to uniquly identify the Airbnb listings.airbnb
.airbnb
? How many variables?cleaning_fee
currently stored in the airbnb
data frame?cleaning_fee
is a quantitative variable?cleaning_fee
is a quantitative variable, we need to make sure it is stored as numeric data in the data frame. To do so, we will first use the extract
function in tidyr package to create a column of cleaning fees that don’t have the dollar sign. Then, we need to make cleaning_fee
a numeric data type. Fill in the code below to extract the values of cleaning fee and store them as numeric data type.See [https://tidyr.tidyverse.org/reference/extract.html) for more information about the extract
function.
How many observations have missing values for cleaning_fee
? What do you think is the most likely reason for the missing observations of cleaning_fee
? In other words, what does a missing value of cleaning_fee
indicate? Show the code and output used to determine the number of observations with missing values.
Impute (i.e. fill in) the missing values of cleaning_fee
with the appropriate numeric value. Then, display summary statistics for cleaning_fee
to confirm that there are no longer missing values. Show the code and output used to impute the missing values and determine if there are missing values. Hint: You can use the case_when
or if_else
functions can be used to impute the missing values.
See https://dplyr.tidyverse.org/reference/case_when.html for more information about the case_when
function.
See https://dplyr.tidyverse.org/reference/if_else.html for more information about the if_else
function.
This is an example of data that is missing not at random, since there is a specific pattern/explanation to the misisng data. We will talk more about dealing with missing data later in the semester.
See Section 5.6.3 of R for Data Science for more information about counting observations.
property_type
.
count
function to determine how many categories are in the variable property
and the frequency of each category.proprety_type
variable that has 5 categories: House, Apartment, Guest suite, Bungalow, and Other. Fill in the code below to create prop_type_simp
.airbnb <- airbnb %>%
mutate(prop_type_simp = case_when(
property_type %in% c("House","______", "______","______") ~ property_type,
TRUE ~ "Other"
))
You can use the code below to check that prop_type_simp
was correctly made.
minimum_nights
? Which value in the top 5 stands out? What is the likely intended purpose for Airbnb listings with this seemingly unusual value for minimum_nights
?Filter the airbnb
data so that it only includes observations with minimum_nights <= 3
. You will use this filtered dataset for the remainder of the lab.
For the response variable, will use the cost to stay at an Airbnb location for 3 nights. Create a new variable called price_3_nights
that uses price
and cleaning_fee
to calculate the total cost to stay at the Airbnb property for 3 nights. Note that the cleaning fee is only applied one time per stay.
Create plots to visualize the distributions of price_3_nights
and log(price_3_nights)
. You should be able to see the shape, approximate center, and approximate spread from the plot you choose. Based on your plots, which variable should you use for the regression model? Briefly explain.
Use this variable as the response variable for the remainder of the lab.
Fit a regression model with the response variable from the previous question and the following predictor variables: prop_type_simp
, number_of_reviews
, and review_scores_rating
. Display the model with the inference statistics and confidence intervals for each coefficient.
Interpret the coefficient review_scores_rating
in terms of price_3_nights
.
Interpret the coefficient of prop_type_simpGuest suite
in terms of price_3_nights
.
Suppose you are planning to visit Asheville for Fall Break, and you want to stay in an Airbnb. You find an Airbnb that is an apartment with a private room, has 10 reviews, and an average rating of 90. Use the model to predict the total cost to stay at this Airbnb for 3 nights. Include the appropriate 95% interval with your prediction. Report the prediction and interval in terms of price_3_nights
.
You’re done! Commit all remaining changes, use the commit message “Done with Lab 5!”, and push. Before you wrap up the assignment, make sure the .Rmd and .md documents are updated in your GitHub repo. There is a 10% penalty if the .Rmd file has to be knitted to display graphs, i.e. the graphs are not showing in the .md file on GitHub.
The data from this lab is from insideairbnb.com