Getting Started

A brief outline of getting started is shown below. See the Lab 01 Instructions for more details about the steps.

  • Go to the sta210-sp20 organization on GitHub (https://github.com/sta210-sp20). Click on the repo with the prefix hw-03. It contains the starter documents you need to complete the lab.
  • Clone the repo in RStudio Cloud

Tips

Here are some tips as you complete HW 03:

  • Periodically knit your document and commit changes (the more often the better, for example, once per each new task)
  • Push all your changes back to your GitHub repo. The Git pane in RStudio should be empty after you push. You can also check your assignment repo on github.com to make sure it has updated.
  • Once you have completed your work, you will submit your assignment in Gradescope. You are welcome to resubmit multiple times until the assignment deadline. We will grade the most recent version of the .pdf file in Gradescope.

Packages

We will use the following packages in this assignment:

library(tidyverse)
library(broom)
library(knitr) 

If you need to install any of the packages, type install.packages("package_name") in the console, where package_name is the package you need to install.

Questions

Part 1: Conceptual

The Conceptual section of homework contains short answer questions about the concepts discussed in class. Some of these questions may also require short chunks of code to produce the output needed to answer the question. Answers should be written in complete sentences.

For Questions 1- 7, we will use data on Ebay auctions for Mario Kart video for the Wii. The data was originally collected in October 2009. Click here to read more about the dataset and the variables.

Below is a regression model using cond, stock_photo, and wheels, n_bids and cond * wheels to explain variation in game_pr. The variable game_pr is calculated as total_pr - ship_pr, the difference in the total price and shipping price. The model is shown below in mathematical notation as well as in the output from R. Use the model to answer Questions 1 - 7.

\[\hat{\text{game_pr}} = \hat{\beta}_0 + \hat{\beta}_1 \text{condition_used} + \hat{\beta}_2 \text{stock_photo_yes} + \hat{\beta}_3 \text{wheels} + \hat{\beta}_4 \text{n_bids} + \hat{\beta}_5 (\text{cond_used} \times \text{wheels})\]

term estimate std.error statistic p.value conf.low conf.high
(Intercept) 36.682 2.310 15.877 0.000 32.113 41.252
condused -1.632 2.009 -0.812 0.418 -5.605 2.341
stock_photoyes 1.108 1.131 0.980 0.329 -1.128 3.344
wheels 9.235 1.079 8.559 0.000 7.101 11.368
n_bids -0.090 0.079 -1.136 0.258 -0.246 0.067
condused:wheels -3.092 1.304 -2.370 0.019 -5.672 -0.512
  1. Write the equation of the model that can used to predict game_pr for new Mario Kart games sold on Ebay. Write the equation using the estimated coefficients from R and the variable names.

  2. Write the equation of the model that can used to predict game_pr for used Mario Kart games. Write the equation using the estimated coefficients from R and the variable names.

  3. What subset of Mario Kart games is described by the intercept? In other words, $36.68 is the mean selling price for which subset of games? Is the intercept meaningful? Explain why or why not.

  4. Suppose we wish to test if there is a statistically significant difference in the mean selling price between new and used Mario Kart games with 0 wheels, a stock photo and 5 bids. What is the p-value associated with the test? State your conclusion for the test in the context of the data.

  5. Interpret the coefficient of stock_photoyes and it’s 95% confidence interval in the context of the data.

  6. Suppose we wish to test whether there is a statistically significant difference in the slope of wheels for new and used Mario Kart games. What is the p-value for this test? State your conclusion in the context of the data.

  7. Interpret the coefficient of condused:wheels in the context of the data.

Part 2: Data Analysis

The Data Analysis section of homework contains open-ended data analysis questions. Your response should be neatly organized and read as a complete narrative. This means that in addition to addressing the question there should also be exploratory data analysis and an analysis of the model assumptions. In short, these questions should be treated as “mini-projects”.

For the data analysis, we will return to the Airbnb data from Lab 05. Please see the Lab 05 instructions for a description of the dataset and variables.

You will use the dataset nashville-airbnb-mod.csv found in the data folder of your repo. This dataset already includes the following modifications you made in Lab 05:

  • Recode price and cleaning_fee to be numeric variables.
  • Create the variable prop_type_simp which takes categories Apartment, Condominium, House, Townhouse, and Other.
  • Create the varible price_3_nights calculated as price * 3 + cleaning fee.



Recall in Lab 05 that you fit a model of the following form:

\[\hat{\text{price_3_nights}} = \hat{\beta}_0 + \hat{\beta}_1 \text{prop_type_Condo} + \hat{\beta}_2 \text{prop_type_House} + \hat{\beta}_3 \text{prop_type_Townhouse} + \hat{\beta}_4 \text{numer_of_reviews} + \hat{\beta}_5 \text{review_scores_rating}\]

Below are plots of the residuals for this model:

Residual plots for model with price_3_nights as the response variable

  • Based on these residual plots, which assumption(s) for regression is(are) violated? Briefly explain your reasoning, including which plot(s) you used to draw your conclusion.

  • Let’s address the violations in the assumptions by using the log-transformed version of price_3_nights, instead of the original variable, to fit the model. As usual, we must start with exploratory data analysis. Use the plots below to write the narrative for the exploratory data analysis. Include as much detail as possible from the plots. You do not need to recreate the plots.

Univariate Data Analysis

Bivariate Data Analysis

  • Refit the model from Lab 05 with the log-transformed version of price_3_nights as the response and prop_type_simp, number_of_reviews, and review_scores_rating as the predictor variables. Show the code and output for your model.

  • Check the model assumptions for your model. Be sure to include all plots and narrative to support your conclusions.

  • Use your model to complete the following. Write all responses in terms of the natural units, i.e. the price for 3 nights:

    • Interpret the 95% confidence intervals for the coefficients of number_of_reviews and review_scores_rating in terms of price_3_nights in the context of the data.

    • Describe how the expected price for 3 nights differs based on the property type of the Airbnb. Be sure to include appropriate confidence intervals in your interpretation.

    • Use your model to describe how the median price for 3 nights changes when going from an Airbnb that’s an Apartment, has 100 reviews, and an average rating of 90 to an Airbnb that’s a Townhouse, has 200 reviews, and an average rating of 95. Show all output/calculations/formulas/etc. used to derive your answer. You do not need to include a confidence interval.

Grading

Total
Part 1: Conceptual 20
Part 2: Data analysis 25
Document neatly organized with clear headers 3
At least 3 informative commit messages 2

Submitting Your Assignment

Once your work is finalized in your GitHub repo, you will submit it to Gradescope. Your assignment must be submitted on Gradescope by the deadline to be considered on time.

See Submitting the Assignment for more details on how to submit the assignment on Gradescope.