You must turn in a knitted file to Gradescope from a Quarto Markdown
file in order to receive credit. Be sure to “associate”
questions appropriately on Gradescope. As a reminder, late work
is not accepted outside of the 24-hour grace period for homework
assignments.
The Quarto template for this assignment may be found in the
repository at the following link: https://classroom.github.com/a/1VA3G83W
We will use a subset of the diamonds dataset; it is
available in your directory. Here we will formally describe some of the
variables which you will be using:
price: price in dollars
carat: the carat weight of the diamond
cut: the quality of the cut (in increasing order of
desirability, the categories are Fair, Good, Very Good, Premium,
Ideal)
color the color of the diamond, graded in letters from
J through D (lower letters are better color, so we would prefer a D
diamond vs. a J diamond)
clarity: the clarity of the diamond (in increasing
order of desirability, the categories are I1, SI2, SI1, VS2, VS1, VVS2,
VVS1, IF)
Important: Please continue to make regular commits
and follow good coding practices (e.g., with not having code run off the
page). As well, suppress warnings and messages in your R code
chunks.
Note: “log(x)” refers to natural log (base \(e\)). “log2(x)” refers to log base 2. Be
careful in the following exercises!
- Create a linear model with log(price) (natural log) as the response
variable and log2(carat) (base 2), color, and clarity as predictors.
Describe the relationship between carat and price (while holding color
and clarity constant) as estimated by this model. Make sure your
explanation is on the original scales for carat and price (i.e.,
un-transformed).
- Evaluate whether the linear model assumptions for your model are
satisfied. You may assume independence is reasonable.
- Fit another linear model with log(price) (natural log) as the
response variable and log2(carat) (base 2) as the only
predictor. Provide and compare the RMSEs of these two models. Which
model seems to do better in terms of RMSE? What is the unit of the RMSE
of the model?
- Compare these two models - is there evidence that additionally
including color and clarity in the model is “helpful” somehow? Carry out
a formal test of the following hypotheses at the \(\alpha = 0.001\) significance level.
If you cannot carry out this test, explain why.:
- \(H_0\): All of the slopes
corresponding to the color and clarity terms are zero (while adjusting
for log2(carat))
- \(H_1\): At least one of the slopes
corresponding to the color or clarity terms are non-zero (while
adjusting for log2(carat))
- Consider the model with log(price) as the response variable and only
log2(carat) as the predictor. Suppose you wanted to compare it to a
model with log(price) as the response and only color/clarity as the
predictors. Describe how you might carry out a formal hypothesis test to
compare these two models. If you cannot carry out this test,
explain why.
- Consider the model on page 31 of the slides from Feb. 14. What do
you predict happens to price given a one unit change in carat size?