use_git_config()
function. You can also cache your password.We will use the following packages in this assignment:
library(tidyverse)
library(broom)
library(knitr)
The Computations & Concepts section of homework contains short answer questions about the concepts discussed in class. Some of these questions may also require short chunks of code to produce the output needed to answer the question. Answers should be written in complete sentences.
For Questions 1- 7, we will use data about the Kentucky Derby races 1896 - 2017. The Kentucky Derby is a 1.25 mile horse race held annually at the Churchill Downs race track in Louisville, Kentucky. This dataset contains information about 122 derbies held 1896 to 2017. We will use the following variables from the data:
year
: year of race recorded as number of years since 1896
2017 - 1896 = 121
condition
: condition of the track - fast, good, and slow.
starters
: number of horses who racedspeed
: average speed of the winner (in feet per second)Below is a regression model using year
, condition
, and starters
, and year * condition
to explain variation in speed
. The model is shown in mathetmatical notation as well as in the output from R. Use the model to answer Questions 1 - 7.
\[\hat{\text{speed}} = \beta_0 + \beta_1 \text{year} + \beta_2 \text{starters} + \beta_3 \text{good} + \beta_4 \text{slow} + \beta_5 (\text{year} \times \text{good}) + \beta_6 (\text{year} \times \text{slow})\]
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | 52.3865418 | 0.1996821 | 262.349680 | 0.0000000 |
year | 0.0195533 | 0.0025811 | 7.575536 | 0.0000000 |
starters | -0.0030462 | 0.0161077 | -0.189113 | 0.8503376 |
conditiongood | -1.0695406 | 0.4233001 | -2.526672 | 0.0128741 |
conditionslow | -2.1827826 | 0.2695654 | -8.097414 | 0.0000000 |
year:conditiongood | 0.0121714 | 0.0076171 | 1.597909 | 0.1128074 |
year:conditionslow | 0.0119453 | 0.0041678 | 2.866092 | 0.0049431 |
Write the model describing the relationship between speed
, year
, starters
and condition
for fast track conditions.
Write the model describing the relationship between speed
, year
, starters
and condition
for slow track conditions.
What subset is described by the intercept? In other words, 52.387 feet per second is the average speed for the winner in what race(s)? Is the intercept meaningful? Explain.
What is the p-value for the test that the mean winning speed is the same for fast track conditions and good track conditions (holding year and starters constant)?
What is the p-value for the test that the slope of year is the same for fast track conditions and good track conditions (holding year and starters constant)?
What is the 95% confidence interval for the amount by which the slope of year for slow track conditions exceeds the slope of year for fast track conditions?
Let’s conduct the nested F test to test whether the slope of year is the same for all three condition types. In other words, we wish to test the hypotheses
\[ \begin{aligned} &H_0: \beta_5 = \beta_6 = 0 \\ &H_a: \text{ at least one }\beta_j \text{ is not 0}\end{aligned}\]
The residual sum of squares for the model shown above is 50.076 with 115 degrees of freedom. The residual sum of squares for the model that does not include the interaction terms between year and condition is 54.307 with 117 degrees of freedom. Calculate the F test statistic and p-value for this test. What is your conclusion?
The Data Analysis section of homework contains open-ended data analysis questions. Your response should be neatly organized and read as a complete narrative. This means that in addition to addressing the question there should also be exploratory data analysis and an analysis of the model assumptions. In short, these questions should be treated as “mini-projects”.
housing
data you started analyzing in Lab 04. Use the code below to load the data and prepare the data.houses <- read_csv("data/KingCountyHouses.csv")
houses <- houses %>%
filter(bedrooms <= 5 ) %>%
mutate(floorsCat = as.factor(floors),
sqftCent = sqft - mean(sqft),
bedroomsCent = bedrooms - mean(bedrooms),
bathroomsCent = bathrooms-mean(bathrooms),
logprice = log(price))
Fit a regression model with logprice
as the response and floorsCat
, sqftCent
, bedroomsCent
, bathroomsCent
, and waterfront
as predictor variables. In your analysis, include the following:
price
instead of the original version of the variable.waterfront
and bedrooms
. Is this interaction significant? If so, describe how the relationship between price
and bedrooms
changes based on whether the house has a waterfront view.As usual, be sure to include exploratory data analysis and an analysis of the model assumptions.
To best satisfy the modeling assumptions, we should log-transform both the price and the square footage. Build a simple linear regression model of logprice
versus logsqft
, the log-transformed version of sqft
. Display the model output.
Total | 70 |
---|---|
Questions 1 - 7 | 30 |
Question 8 | 30 |
Documents complete and neatly organized (Markdown and knitted documents) | 5 |
Answers written in complete sentences | 3 |
Regular and informative commit messages | 2 |
The data used in this assignment was obtained from https://github.com/proback/BYSH.