A brief outline of getting started is shown below. See the Lab 01 Instructions for more details about the steps.
Here are some tips as you complete HW 03:
We will use the following packages in this assignment:
library(tidyverse)
library(broom)
library(knitr)
If you need to install any of the packages, type install.packages("package_name")
in the console, where package_name
is the package you need to install.
The Conceptual section of homework contains short answer questions about the concepts discussed in class. Some of these questions may also require short chunks of code to produce the output needed to answer the question. Answers should be written in complete sentences.
For Questions 1- 7, we will use data on Ebay auctions for Mario Kart video for the Wii. The data was originally collected in October 2009. Click here to read more about the dataset and the variables.
Below is a regression model using cond
, stock_photo
, and wheels
, n_bids
and cond * wheels
to explain variation in game_pr
. The variable game_pr
is calculated as total_pr - ship_pr
, the difference in the total price and shipping price. The model is shown below in mathematical notation as well as in the output from R. Use the model to answer Questions 1 - 7.
\[\hat{\text{game_pr}} = \hat{\beta}_0 + \hat{\beta}_1 \text{condition_used} + \hat{\beta}_2 \text{stock_photo_yes} + \hat{\beta}_3 \text{wheels} + \hat{\beta}_4 \text{n_bids} + \hat{\beta}_5 (\text{cond_used} \times \text{wheels})\]
term | estimate | std.error | statistic | p.value | conf.low | conf.high |
---|---|---|---|---|---|---|
(Intercept) | 36.682 | 2.310 | 15.877 | 0.000 | 32.113 | 41.252 |
condused | -1.632 | 2.009 | -0.812 | 0.418 | -5.605 | 2.341 |
stock_photoyes | 1.108 | 1.131 | 0.980 | 0.329 | -1.128 | 3.344 |
wheels | 9.235 | 1.079 | 8.559 | 0.000 | 7.101 | 11.368 |
n_bids | -0.090 | 0.079 | -1.136 | 0.258 | -0.246 | 0.067 |
condused:wheels | -3.092 | 1.304 | -2.370 | 0.019 | -5.672 | -0.512 |
Write the equation of the model that can used to predict game_pr
for new Mario Kart games sold on Ebay. Write the equation using the estimated coefficients from R and the variable names.
Write the equation of the model that can used to predict game_pr
for used Mario Kart games. Write the equation using the estimated coefficients from R and the variable names.
What subset of Mario Kart games is described by the intercept? In other words, $36.68 is the mean selling price for which subset of games? Is the intercept meaningful? Explain why or why not.
Suppose we wish to test if there is a statistically significant difference in the mean selling price between new and used Mario Kart games with 0 wheels, a stock photo and 5 bids. What is the p-value associated with the test? State your conclusion for the test in the context of the data.
Interpret the coefficient of stock_photoyes
and it’s 95% confidence interval in the context of the data.
Suppose we wish to test whether there is a statistically significant difference in the slope of wheels
for new and used Mario Kart games. What is the p-value for this test? State your conclusion in the context of the data.
Interpret the coefficient of condused:wheels
in the context of the data.
The Data Analysis section of homework contains open-ended data analysis questions. Your response should be neatly organized and read as a complete narrative. This means that in addition to addressing the question there should also be exploratory data analysis and an analysis of the model assumptions. In short, these questions should be treated as “mini-projects”.
For the data analysis, we will return to the Airbnb data from Lab 05. Please see the Lab 05 instructions for a description of the dataset and variables.
You will use the dataset nashville-airbnb-mod.csv
found in the data
folder of your repo. This dataset already includes the following modifications you made in Lab 05:
price
and cleaning_fee
to be numeric variables.prop_type_simp
which takes categories Apartment, Condominium, House, Townhouse, and Other.price_3_nights
calculated as price * 3 + cleaning fee
.Recall in Lab 05 that you fit a model of the following form:
\[\hat{\text{price_3_nights}} = \hat{\beta}_0 + \hat{\beta}_1 \text{prop_type_Condo} + \hat{\beta}_2 \text{prop_type_House} + \hat{\beta}_3 \text{prop_type_Townhouse} + \hat{\beta}_4 \text{numer_of_reviews} + \hat{\beta}_5 \text{review_scores_rating}\]
Below are plots of the residuals for this model:
price_3_nights
as the response variableBased on these residual plots, which assumption(s) for regression is(are) violated? Briefly explain your reasoning, including which plot(s) you used to draw your conclusion.
Let’s address the violations in the assumptions by using the log-transformed version of price_3_nights
, instead of the original variable, to fit the model. As usual, we must start with exploratory data analysis. Use the plots below to write the narrative for the exploratory data analysis. Include as much detail as possible from the plots. You do not need to recreate the plots.
Refit the model from Lab 05 with the log-transformed version of price_3_nights
as the response and prop_type_simp
, number_of_reviews
, and review_scores_rating
as the predictor variables. Show the code and output for your model.
Check the model assumptions for your model. Be sure to include all plots and narrative to support your conclusions.
Use your model to complete the following. Write all responses in terms of the natural units, i.e. the price for 3 nights:
Interpret the 95% confidence intervals for the coefficients of number_of_reviews
and review_scores_rating
in terms of price_3_nights
in the context of the data.
Describe how the expected price for 3 nights differs based on the property type of the Airbnb. Be sure to include appropriate confidence intervals in your interpretation.
Use your model to describe how the median price for 3 nights changes when going from an Airbnb that’s an Apartment, has 100 reviews, and an average rating of 90 to an Airbnb that’s a Townhouse, has 200 reviews, and an average rating of 95. Show all output/calculations/formulas/etc. used to derive your answer. You do not need to include a confidence interval.
Total | |
---|---|
Part 1: Conceptual | 20 |
Part 2: Data analysis | 25 |
Document neatly organized with clear headers | 3 |
At least 3 informative commit messages | 2 |
Once your work is finalized in your GitHub repo, you will submit it to Gradescope. Your assignment must be submitted on Gradescope by the deadline to be considered on time.
See Submitting the Assignment for more details on how to submit the assignment on Gradescope.