Main Effects Model
- Why should we use
logprice
instead ofprice
as the response variable? In other words, what is an example of previous analysis that could have been done to help us determine whether to uselogprice
orprice
?
To determine if a transformation on the response variable is needed, we can examine the following:
- The distribution of the response variable to see if there is extreme skewness
- The plot of the residuals vs. predicted to check for non-constant variance
- The histogram and QQ-plot of the residuals to see if there is extreme skewness in the residuals
Below is the model with log_price
as the response and caratCent
, color
, and clarity
as the predictor variables.
model_orig <- lm(log_price ~ caratCent + color + clarity, data=diamonds_samp)
kable(tidy(model_orig),format="markdown")
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | 6.8107909 | 0.1124919 | 60.544718 | 0.0000000 |
caratCent | 3.1800122 | 0.0379705 | 83.749467 | 0.0000000 |
colorE | -0.0631703 | 0.0325252 | -1.942192 | 0.0530990 |
colorF | -0.1414348 | 0.0306200 | -4.619027 | 0.0000058 |
colorG | -0.1994737 | 0.0316903 | -6.294479 | 0.0000000 |
colorH | -0.3297467 | 0.0342762 | -9.620286 | 0.0000000 |
colorI | -0.4220826 | 0.0394798 | -10.691092 | 0.0000000 |
colorJ | -0.4714336 | 0.0503802 | -9.357511 | 0.0000000 |
clarityIF | 1.1870339 | 0.1201021 | 9.883538 | 0.0000000 |
claritySI1 | 0.6534845 | 0.1113625 | 5.868084 | 0.0000000 |
claritySI2 | 0.4652723 | 0.1116914 | 4.165694 | 0.0000412 |
clarityVS1 | 0.8708423 | 0.1130097 | 7.705910 | 0.0000000 |
clarityVS2 | 0.8330494 | 0.1117319 | 7.455791 | 0.0000000 |
clarityVVS1 | 1.1221371 | 0.1143346 | 9.814499 | 0.0000000 |
clarityVVS2 | 0.9634392 | 0.1141949 | 8.436797 | 0.0000000 |
- What is the baseline level of
color
? What is the baseline level ofclarity
?
The baseline level of color
is D
. The baseline level of clarity
is I1
.
- Interpret the intercept in terms of
price
.
coef <- model_orig$coefficients
We expect the median price of diamonds with color D, clarity I1, and the mean carat weight (0.6024333) to be approximately exp(6.811) = $907.59.
- Describe the difference in the typical prices of diamonds that are color E and diamonds that are color D, holding all else constant.
The difference in terms of the log(price) is the coefficient of colorE
, -0.0631703. Therefore, the difference in terms of the price is
(diff_e_d <- exp(coef[3]))
## colorE
## 0.9387836
Therefore, holding all else constant, diamonds that are color E are expected to have a median price that is 0.939 times the median price of diamonds that are color D.
- Describe the difference in the typical prices of diamonds that are color E and diamonds that are color G, holding all else constant.
(diff_e_g_log <- coef[3] - coef[5])
## colorE
## 0.1363035
Therefore, the difference in terms of the price is
(diff_e_g <- exp(diff_e_g_log))
## colorE
## 1.14603
Therefore, holding all else constant, diamonds that are color E are expected to have a median price that is 1.146 times the median price of diamonds that are color G.
- What is the predicted price of a single diamond that has color E, clarity VS2 and is 0.3 carats? Finish the code below the predicted value and the corresponding interval.
x0 <- data.frame(color="E", clarity="VS2", carat=0.3)
x0 <- x0 %>% mutate(
caratCent = carat - mean(diamonds_samp$carat),
caratCent_sq = caratCent^2
)
(exp(predict(model_orig,x0,interval="prediction"))) #interval to predict for single observation
## fit lwr upr
## 1 749.1419 550.5672 1019.337
Suppose we wish to find the predicted median price of subset of all diamonds with color E, clarity VS2, and 0.3 carats. How do you expect the predicted price to change? How do you expect the corresponding interval to change?
- The predicted price won’t change, but the interval will be more narrow.
Write code to find the predicted price and corresponding interval for the median price for the subset of all diamonds with color E, clarity VS2 and 0.3 carats.
(exp(predict(model_orig,x0,interval="confidence"))) #interval to predict typical price for subset
## fit lwr upr
## 1 749.1419 705.9399 794.9878
Use the code below to obtain the ANOVA table for this model.
anova(model_orig)
## Analysis of Variance Table
##
## Response: log_price
## Df Sum Sq Mean Sq F value Pr(>F)
## caratCent 1 173.435 173.435 7357.819 < 2.2e-16 ***
## color 6 3.705 0.618 26.197 < 2.2e-16 ***
## clarity 7 10.332 1.476 62.620 < 2.2e-16 ***
## Residuals 285 6.718 0.024
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
- What is the estimated regression variance?
The estimated regression variance is the Residual Mean Square, 0.0204.
- Compare \(R^2\) and \(Adj. R^2\). What does this comparison tell you about the predictors in the model?
\(R^2\) is 0.9654056 and Adjusted \(R^2\) is 0.9637063. These values are very close, indicating that the predictors in the model are important for understanding variation in price. There aren’t a lot of predictors in the model that aren’t significant.
- Use the code below to calculate the VIF for this model.
vif(model_orig)
## caratCent colorE colorF colorG colorH colorI
## 1.358048 1.747926 2.025269 2.120465 1.836018 1.624679
## colorJ clarityIF claritySI1 claritySI2 clarityVS1 clarityVS2
## 1.240465 6.484585 27.377826 20.977238 21.475329 26.966560
## clarityVVS1 clarityVVS2
## 15.415579 14.937186
- This model has potential problems with multicollinearity. How did we come to this conclusion? Which variables are highly collinear?
We know this model has potential problems with multicollinearity, because there are multiple predictors with VIFs close or above 10. The variables with high collinearity are the indicator variables for clarity
.
- Why do you think this multicollinearity is occurring? Hint: Examine the distribution of the variable(s) that have high multicollinearity.
Let’s look at the distribution of clarity
.
diamonds_samp %>%
count(clarity)
## # A tibble: 8 x 2
## clarity n
## <fct> <int>
## 1 I1 2
## 2 IF 11
## 3 SI1 67
## 4 SI2 47
## 5 VS1 47
## 6 VS2 65
## 7 VVS1 31
## 8 VVS2 30
The baseline level is I1
, and there are only 2 observations out of 300 with this level for clarity. Because there are so few observations at the baseline level, it is almost as if we have no baseline level for the categorical predictor clarity
in the model. Remember, if we have no baseline level for a categorical variable in the model and there is an intercept, then the indicator variables are just linear combinations for one another. In this case, the indicator variables aren’t exact linear combinations of one another, but they are highly collinear.
This multicollinearity is reduced when the baseline level is changed to a different level of clarity
. Below is the VIF for a model with IF
, the highest level of clarity as the baseline.
diamonds_samp %>%
mutate(clarity = fct_rev(clarity)) %>% #reverse the factoring order of clarity
lm(log_price ~ caratCent + color + clarity, data=.) %>%
vif()
## caratCent colorE colorF colorG colorH colorI
## 1.358048 1.747926 2.025269 2.120465 1.836018 1.624679
## colorJ clarityVVS1 clarityVS2 clarityVS1 claritySI2 claritySI1
## 1.240465 1.853290 2.604675 2.222750 2.551777 2.624684
## clarityIF clarityI1
## 1.371711 1.099082