Getting Started

Go to the STA210-Sp19 organization on GitHub (https://github.com/STA210-Sp19). Click on the repo with the prefix hw-03. It contains the starter documents you need to complete the lab.
Clone the repo in RStudio Cloud
Configure git using the use_git_config() function. You can also cache your password.

Packages

We will use the following packages in this assignment:

library(tidyverse)
library(broom)
library(knitr)

Questions

Part 1: Computations & Concepts

The Computations & Concepts section of homework contains short answer questions about the concepts discussed in class. Some of these questions may also require short chunks of code to produce the output needed to answer the question. Answers should be written in complete sentences.

For Questions 1- 7, we will use data about the Kentucky Derby races 1896 - 2017. The Kentucky Derby is a 1.25 mile horse race held annually at the Churchill Downs race track in Louisville, Kentucky. This dataset contains information about 122 derbies held 1896 to 2017. We will use the following variables from the data:

year: year of race recorded as number of years since 1896
- e.g.: 2017 is recorded as 2017 - 1896 = 121
condition: condition of the track - fast, good, and slow.
- “good” includes the official designations “good” and “dusty”
- “slow” includes the official designations “slow”, “heavy”, “muddy”, and “sloppy”
starters: number of horses who raced
speed: average speed of the winner (in feet per second)

Below is a regression model using year, condition, and starters, and year * condition to explain variation in speed. The model is shown in mathetmatical notation as well as in the output from R. Use the model to answer Questions 1 - 7.

\[\hat{\text{speed}} = \beta_0 + \beta_1 \text{year} + \beta_2 \text{starters} + \beta_3 \text{good} + \beta_4 \text{slow} + \beta_5 (\text{year} \times \text{good}) + \beta_6 (\text{year} \times \text{slow})\]

term	estimate	std.error	statistic	p.value
(Intercept)	52.3865418	0.1996821	262.349680	0.0000000
year	0.0195533	0.0025811	7.575536	0.0000000
starters	-0.0030462	0.0161077	-0.189113	0.8503376
conditiongood	-1.0695406	0.4233001	-2.526672	0.0128741
conditionslow	-2.1827826	0.2695654	-8.097414	0.0000000
year:conditiongood	0.0121714	0.0076171	1.597909	0.1128074
year:conditionslow	0.0119453	0.0041678	2.866092	0.0049431

Write the model describing the relationship between speed, year, starters and condition for fast track conditions.
Write the model describing the relationship between speed, year, starters and condition for slow track conditions.
What subset is described by the intercept? In other words, 52.387 feet per second is the average speed for the winner in what race(s)? Is the intercept meaningful? Explain.
What is the p-value for the test that the mean winning speed is the same for fast track conditions and good track conditions (holding year and starters constant)?
What is the p-value for the test that the slope of year is the same for fast track conditions and good track conditions (holding year and starters constant)?
What is the 95% confidence interval for the amount by which the slope of year for slow track conditions exceeds the slope of year for fast track conditions?
Let’s conduct the nested F test to test whether the slope of year is the same for all three condition types. In other words, we wish to test the hypotheses

\[ \begin{aligned} &H_0: \beta_5 = \beta_6 = 0 \\ &H_a: \text{ at least one }\beta_j \text{ is not 0}\end{aligned}\]

The residual sum of squares for the model shown above is 50.076 with 115 degrees of freedom. The residual sum of squares for the model that does not include the interaction terms between year and condition is 54.307 with 117 degrees of freedom. Calculate the F test statistic and p-value for this test. What is your conclusion?

Part 2: Data Analysis

The Data Analysis section of homework contains open-ended data analysis questions. Your response should be neatly organized and read as a complete narrative. This means that in addition to addressing the question there should also be exploratory data analysis and an analysis of the model assumptions. In short, these questions should be treated as “mini-projects”.

For this portion, you will use the housing data you started analyzing in Lab 04. Use the code below to load the data and prepare the data.

houses <- read_csv("data/KingCountyHouses.csv")
houses <- houses %>%
  filter(bedrooms <= 5 ) %>%
  mutate(floorsCat = as.factor(floors), 
         sqftCent = sqft - mean(sqft), 
         bedroomsCent = bedrooms - mean(bedrooms),
         bathroomsCent = bathrooms-mean(bathrooms),
         logprice = log(price))

Fit a regression model with logprice as the response and floorsCat, sqftCent, bedroomsCent, bathroomsCent, and waterfront as predictor variables. In your analysis, include the following:

Briefly explain why we should use the log-transformed version of price instead of the original version of the variable.
Describe the relationship between a house’s price and square footage (holding all else constant), including the appropriate confidence intervals.
Describe how the expected price differs based on the number of floors in the house (holding all else constant). Include discussion about whether or not the differences are statistically significant.
Consider the interaction between waterfront and bedrooms. Is this interaction significant? If so, describe how the relationship between price and bedrooms changes based on whether the house has a waterfront view.

As usual, be sure to include exploratory data analysis and an analysis of the model assumptions.

Extra Credit (5 pts)

To best satisfy the modeling assumptions, we should log-transform both the price and the square footage. Build a simple linear regression model of logprice versus logsqft, the log-transformed version of sqft. Display the model output.

Interpret the intercept in the context of the data. Does the intercept have a meaningful interpretation? Briefly explain.
Based on this model, what is the expected change in price when the square footage is multiplied by a factor of 1.1?

Grading

Total	70
Questions 1 - 7	30
Question 8	30
Documents complete and neatly organized (Markdown and knitted documents)	5
Answers written in complete sentences	3
Regular and informative commit messages	2

Acknowledgement

The data used in this assignment was obtained from https://github.com/proback/BYSH.

HW 03: Multiple Linear Regression

due Mon, Feb 18 at 11:59p