The goal of this lab is to use multiple linear regression to understand the variation in the selling price of houses in King County, Washington. You will also gain practice using special predictors, such as categorical predictors and interaction effects, in the model, and you will be introduced to variable transformations.

Getting Started

Go to the STA210-Sp19 organization on GitHub (https://github.com/STA210-Sp19). Click on the repo with the prefix lab-04-. It contains the starter documents you need to complete the warmup exercise.
Clone the repo and create a new project in RStudio Cloud.
Configure git by typing the following in the console.

library(usethis)
use_git_config(user.name="your name", user.email="your email")

Password caching

If you would like your git password cached for a week for this project, type the following in the Terminal:

git config --global credential.helper 'cache --timeout 604800'

You will need to enter your GitHub username and password one more time after caching the password. After that you won’t need to enter your credentials for 604800 seconds = 7 days.

Packages

We will use the following packages in today’s lab.

library(tidyverse)
library(knitr)
library(broom)

Project name:

Currently your project is called Untitled Project. Update the name of your project to be “Lab 04 - Multiple linear regression”.

Warm up

Before we introduce the data, let’s warm up with a simple exercise.

YAML:

Pick one team member to update the author and date fields at the top of the R Markdown file. Knit, commit, and push all the updated documents to Github.
Now, the remaining team members who have not been concurrently making these changes on their projects should click on the Pull button in their Git pane and observe that the changes are now reflected on their projects as well.

Data

The for today’s lab contains the price and other characteristics of over 20,000 houses sold in King County, Washington (the county that includes Seattle). The dataset includes the following variables:

price: selling price of the house
date: date house was sold, measured in days since January 1, 2014
bedrooms: number of bedrooms
bathrooms: number of bathrooms
sqft: interior square footage
floors: number of floors
waterfront: 1 if the house has a view of the waterfront, 0 otherwise
yr_built: year the house was built
yr_renovated: 0 if the house was never renovated, the year the house was renovated if else

houses <- read_csv("data/KingCountyHouses.csv")

Exercises

Use data visualization and summary statistics to examine the distribution of bedrooms. What is the maximum value? Does this value make sense? If not, what is this an indication of, i.e. how did this value get recorded in the data? Briefly explain.

⊕See the documentation for more information about the summarise function.

We want to remove observations that have extreme values for bedrooms, i.e. those with values for bedrooms above the 95^th percentile in the data. What is the 95^th percentile for bedrooms? Use the summarise function to help you calculate this value.
Fill in the code below to filter the data so that the extreme observations are removed. How many observations are in the updated dataset?

houses <- houses %>% filter(bedrooms <= ____)

We will use this dataset for the remainder of the analysis.

Fit a regression model using square feet to explain variation in the price. Plot the residuals versus the predicted values. Based on this plot, what regression assumption appears to be violated? Briefly explain.

Plot the histogram and Normal QQ-plot of the residuals. Based on these plots, what regression assumption appears to be violated? Briefly explain.

One way to deal with violations in regression assumptions is to transform the response variable and use that transformed variable when fitting the regression model. (We will talk about this in class next week). Some common transformations used in regression are the natural log (log(y)), the square root (sqrt(y)), and the reciprocal (1/y).

Each transformation is applied to the response variable price, and the distributions of the transformed data are shown below.

Which transformation should we use to fix the violations of the model assumptions observed in the previous exercise? Briefly explain your choice.

Add the variable logprice, the log-transformed version of price, to the data frame. Fit a regression model with logprice as the response and sqft as the predictor variable. Create the residuals plots (residuals vs. predicted, histogram of residuals, Normal QQ-plot). Briefly comment on whether or not using the transformed variable improved on the model assumptions.
Though we can explain about 48% of the variation in a house prices by the square footage, we would like to incorporate some of the other available house characteristics in the model.

Before fitting the model, use the code below to add the variablefloorsCat that is the categorical version of the variable floors. Discuss with your group why it may make sense to treat floors as categorical, even though it represents a count.

houses <- houses %>%
  mutate(floorsCat = as.factor(floors))

⊕See the documentation for more information about the count function.

Use the count function to see the number of observations at each level of floorsCat. What is the most common number of floors?

Use the code below to calculate the mean-centered versions of the variables sqft, bedrooms, and bathrooms and add them to the data frame.

houses <- houses %>% 
  mutate(sqftCent = sqft - mean(sqft), 
         bedroomsCent = bedrooms - mean(bedrooms),
         bathroomsCent = bathrooms-mean(bathrooms))

It is not appropriate to calculate the mean-centered version of the variable waterfront. Briefly explain why it isn’t.

Fit a regression model with logprice as the response variable, and the mean-centered variables from the previous exercise along with waterfront and floorsCat as the predictor variables. Display the model output.
What is the baseline level for the variable floorsCat?
Interpret the intercept of the model in the context of the data. Write the interpretation in terms of the price.
What is the intercept of the model for the subset of houses with 3 floors that are not on the waterfront? Write the intercept in terms of the log(price).
We would like to consider potential interactions for the model. A significant interaction occurs when the relationship of a predictor variable with the response depends on the value of another predictor variable.

Fill in the code below to plot the relationship between logprice and bedrooms by waterfront. Based on this plot, do you think there is a significant interaction effect between bedrooms and waterfront? In other words, do you think the relationship between the logprice and the number of bedrooms differs based on whether or not a house is on the waterfront? Briefly explain.

ggplot(data=houses,mapping=aes(x=_____,y=_____,color=as.factor(waterfront))) +
  geom_smooth(method="lm", se=FALSE) +
  labs(title="__________________", 
       x="Number of bedrooms", 
       y="Log Price",
       color="Waterfront")

We will talk more about interaction effects in Monday’s lecture. In HW 03, you explore potential interaction effects using this housing data.

You’re done! Commit all remaining changes, use the commit message “Done with Lab 4!”, and push. Before you wrap up the assignment, make sure the .Rmd and .md documents are updated in your GitHub repo. There is a 10% penalty if the .Rmd file has to be knitted to display graphs, i.e. the graphs are not showing in the .md file on GitHub.

Acknowledgement

The data used in this lab was obtained from https://github.com/proback/BYSH.

Lab 04: Multiple Linear Regression

due Wed, Feb 13 at 11:59p