The goal of this lab is to use multiple linear regression to understand the variation in the selling price of houses in King County, Washington. You will also gain practice using special predictors, such as categorical predictors and interaction effects, in the model, and you will be introduced to variable transformations.
Go to the STA210-Sp19 organization on GitHub (https://github.com/STA210-Sp19). Click on the repo with the prefix lab-04-. It contains the starter documents you need to complete the warmup exercise.
Clone the repo and create a new project in RStudio Cloud.
Configure git by typing the following in the console.
library(usethis)
use_git_config(user.name="your name", user.email="your email")
If you would like your git password cached for a week for this project, type the following in the Terminal:
git config --global credential.helper 'cache --timeout 604800'
You will need to enter your GitHub username and password one more time after caching the password. After that you won’t need to enter your credentials for 604800 seconds = 7 days.
We will use the following packages in today’s lab.
library(tidyverse)
library(knitr)
library(broom)
Currently your project is called Untitled Project. Update the name of your project to be “Lab 04 - Multiple linear regression”.
Before we introduce the data, let’s warm up with a simple exercise.
Pick one team member to update the author and date fields at the top of the R Markdown file. Knit, commit, and push all the updated documents to Github.
Now, the remaining team members who have not been concurrently making these changes on their projects should click on the Pull button in their Git pane and observe that the changes are now reflected on their projects as well.
The for today’s lab contains the price and other characteristics of over 20,000 houses sold in King County, Washington (the county that includes Seattle). The dataset includes the following variables:
price
: selling price of the housedate
: date house was sold, measured in days since January 1, 2014bedrooms
: number of bedroomsbathrooms
: number of bathroomssqft
: interior square footagefloors
: number of floorswaterfront
: 1 if the house has a view of the waterfront, 0 otherwiseyr_built
: year the house was builtyr_renovated
: 0 if the house was never renovated, the year the house was renovated if elsehouses <- read_csv("data/KingCountyHouses.csv")
bedrooms
. What is the maximum value? Does this value make sense? If not, what is this an indication of, i.e. how did this value get recorded in the data? Briefly explain.See the documentation for more information about the summarise
function.
We want to remove observations that have extreme values for bedrooms, i.e. those with values for bedrooms
above the 95th percentile in the data. What is the 95th percentile for bedrooms
? Use the summarise
function to help you calculate this value.
Fill in the code below to filter the data so that the extreme observations are removed. How many observations are in the updated dataset?
houses <- houses %>% filter(bedrooms <= ____)
We will use this dataset for the remainder of the analysis.
Plot the histogram and Normal QQ-plot of the residuals. Based on these plots, what regression assumption appears to be violated? Briefly explain.
log(y)
), the square root (sqrt(y)
), and the reciprocal (1/y
).Each transformation is applied to the response variable price
, and the distributions of the transformed data are shown below.
Which transformation should we use to fix the violations of the model assumptions observed in the previous exercise? Briefly explain your choice.
Add the variable logprice
, the log-transformed version of price
, to the data frame. Fit a regression model with logprice
as the response and sqft
as the predictor variable. Create the residuals plots (residuals vs. predicted, histogram of residuals, Normal QQ-plot). Briefly comment on whether or not using the transformed variable improved on the model assumptions.
Though we can explain about 48% of the variation in a house prices by the square footage, we would like to incorporate some of the other available house characteristics in the model.
Before fitting the model, use the code below to add the variablefloorsCat
that is the categorical version of the variable floors
. Discuss with your group why it may make sense to treat floors
as categorical, even though it represents a count.
houses <- houses %>%
mutate(floorsCat = as.factor(floors))
See the documentation for more information about the count
function.
Use the count
function to see the number of observations at each level of floorsCat
. What is the most common number of floors?
sqft
, bedrooms
, and bathrooms
and add them to the data frame.houses <- houses %>%
mutate(sqftCent = sqft - mean(sqft),
bedroomsCent = bedrooms - mean(bedrooms),
bathroomsCent = bathrooms-mean(bathrooms))
It is not appropriate to calculate the mean-centered version of the variable waterfront
. Briefly explain why it isn’t.
Fit a regression model with logprice
as the response variable, and the mean-centered variables from the previous exercise along with waterfront
and floorsCat
as the predictor variables. Display the model output.
What is the baseline level for the variable floorsCat
?
Interpret the intercept of the model in the context of the data. Write the interpretation in terms of the price.
What is the intercept of the model for the subset of houses with 3 floors that are not on the waterfront? Write the intercept in terms of the log(price).
We would like to consider potential interactions for the model. A significant interaction occurs when the relationship of a predictor variable with the response depends on the value of another predictor variable.
Fill in the code below to plot the relationship between logprice
and bedrooms
by waterfront
. Based on this plot, do you think there is a significant interaction effect between bedrooms
and waterfront
? In other words, do you think the relationship between the logprice and the number of bedrooms differs based on whether or not a house is on the waterfront? Briefly explain.
ggplot(data=houses,mapping=aes(x=_____,y=_____,color=as.factor(waterfront))) +
geom_smooth(method="lm", se=FALSE) +
labs(title="__________________",
x="Number of bedrooms",
y="Log Price",
color="Waterfront")
We will talk more about interaction effects in Monday’s lecture. In HW 03, you explore potential interaction effects using this housing data.
You’re done! Commit all remaining changes, use the commit message “Done with Lab 4!”, and push. Before you wrap up the assignment, make sure the .Rmd and .md documents are updated in your GitHub repo. There is a 10% penalty if the .Rmd file has to be knitted to display graphs, i.e. the graphs are not showing in the .md file on GitHub.
The data used in this lab was obtained from https://github.com/proback/BYSH.