When doing statistical analyses in practice, there is often a lot of time spent on cleaning and preparing the data. The goal of today’s lab is to practice cleaning messy data, so it can be used in a regression analysis. You will also practice interpreting the results from a regression model that has numeric and categorical predictors and a log-transformed response variable.
Go to the sta210-fa20 organization on GitHub (https://github.com/sta210-fa20). Click on the repo with the prefix lab-05-. It contains the starter documents you need to complete the warmup exercise.
Clone the repo and create a new project in your RStudio Docker Container (https://vm-manage.oit.duke.edu).
Configure git by typing the following in the console.
If you would like your git password cached for a week for this project, type the following in the Terminal:
You will need to enter your GitHub username and password one more time after caching the password. After that you won’t need to enter your credentials for 604800 seconds = 7 days.
We will use the following packages in today’s lab. You may need to install the package skimr
using the install.packages()
command in your console.
Today’s data is about Airbnb listings in Nashville, TN. The data was obtained from http://insideairbnb.com/; it was originally scraped from airbnb.com.
You can see a visualization of some of the data used in today’s lab at http://insideairbnb.com/nashville/.
We will use the following variables in this lab:
price
: Cost per night (in U.S. dollars)cleaning_fee
: Cleaning fee (in U.S. dollars)property_type
: Type of dwelling (House, Apartment, etc.)room_type
:
number_of_reviews
: Total number of reviews for the listingreview_scores_rating
: Average review score (0 - 100)minimum_nights
: Number of nights required to book rentalcleaning_fee
(categorical / quantitative)?cleaning_fee
stored in the data frame as this type?cleaning_fee
, so it is stored correctly in the data frame. To do so, write code that completes the following tasks:
str_replace
function in the stringr package to remove the dollar sign, $, i.e. replace the dollar sign with a space. The dollar sign is a special character, so use \\$
in your code to identify the dollar sign.cleaning_fee
, so it is numeric in the data frame.Be sure to save the numeric version of cleaning_fee
to use in the remainder of the lab.
Visualize the distribution of cleaning_fee
and display the appropriate summary statistics. Use the graph and summary statistics to describe the distribution of cleaning_fee
.
Next, let’s examine the property_type
.
property_type
are in the dataset? Show code and output to support your answer.Since an overwhelming majority of the observations in the data are one of the top 4 property types, we would like to create a simplified version of the property_type
variable that has 5 categories.
Create a new variable called prop_type_simp
that has 5 categories: the four categories from the previous question and “Other” for all other property types. Be sure to save the new variable in the data frame.
minimum_nights
? Which value in the top 4 stands out? What is the likely intended purpose for Airbnb listings with this seemingly unusual value for minimum_nights
? Show code and output to support your answer.Airbnb is most commonly used for travel purposes, i.e. as an alternative to traditional hotels, so we only want to include Airbnb listings in our regression analysis that are intended for travel purposes. Filter airbnb
so that it only includes observations with minimum_nights
<= 3`.
You will use this filtered dataset for the remainder of the lab.
price_3_nights
that uses price
and cleaning_fee
to calculate the total cost to stay at the Airbnb property for 3 nights. Note that the cleaning fee is only applied one time per stay.Be sure price
is in the correct format before calculating the new variable.
Fit a regression model with the response variable from the previous question and the following predictor variables: prop_type_simp
, number_of_reviews
, and review_scores_rating
. Display the model with the inferential statistics and confidence intervals for each coefficient.
Interpret the coefficient of number_of_reviews
and its 95% confidence interval in the context of the data.
Interpret the coefficient of prop_type_simpHouse
and its 95% confidence interval in the context of the data.
Interpret the intercept in the context of the data. Does the intercept have a meaningful interpretation? Briefly explain why or why not.
Suppose you are planning to visit Nashville over Spring Break, and you want to stay in an Airbnb. You find an Airbnb that is a townhouse, has 10 reviews, and an average rating of 90. Use the model to predict the total cost to stay at this Airbnb for 3 nights. Include the appropriate 95% interval with your prediction.
You’re done and ready to submit your work! Knit, commit, and push all remaining changes. You can use the commit message “Done with Lab 05!”, and make sure you have pushed all the files to GitHub (your Git pane in RStudio should be empty) and that all documents are updated in your repo on GitHub. Then submit the assignment on Gradescope following the instructions below.
See “Submitting the Assignment” from Lab 01 for detailed instructions on how to upload the assignment on GitHub.
Data Analysis & EDA | 19 |
Regression Analysis | 22 |
Lab attendance & participation | 3 |
Narrative in full sentences | 2 |
Document neatly organized with appropriate headers | 2 |
Commit messages from every member | 2 |
Total | 50 |
The data from this lab is from insideairbnb.com