Load packages:
library(ggplot2)
library(dplyr)
library(forcats)
We will simplify the data slightly:
set.seed(100)
dmdsimple = diamonds %>%
sample_n(1000) %>%
mutate(color = fct_collapse(color,
colorless = c("D","E","F"),
near_colorless = c("G","H","I","J")),
clarity = fct_collapse(clarity,
SI = c("SI1","SI2"),
VS = c("VS1","VS2"),
VVS1 = c("VVS1","VVS2")))
ggplot(dmdsimple, aes(x=carat, y=price, color=color)) + geom_point(alpha=0.3) + geom_smooth(se=FALSE)
Does the relationship between carats and sqrt(price) appear to vary by color?
ggplot(dmdsimple, aes(x=carat, y=sqrt(price), color=color)) +
geom_point(alpha=0.3) +
geom_smooth(method="lm", se=FALSE)
(m = lm(sqrt(price) ~ carat + color, data = dmdsimple))
##
## Call:
## lm(formula = sqrt(price) ~ carat + color, data = dmdsimple)
##
## Coefficients:
## (Intercept) carat color.L
## 9.608 58.000 -3.336
\[\widehat{\sqrt{\text{price}}} = 9.6 + 58 \times \text{carat} - 3.3 \times \text{color}\]
How does changing color change our model?
color
to get the linear model for diamonds with colorless color.\[\widehat{\sqrt{\text{price}}} = 9.6 + 58 \times \text{carat} - 3.3 \times 0 = 9.6 + 58 \times \text{carat} \]
color
to get the linear model for diamonds with near colorless color\[\widehat{\sqrt{\text{price}}} = 9.6 + 58 \times \text{carat} - 3.3 \times 1 = 6.3 + 58 \times \text{carat} \]
ggplot(dmdsimple, aes(x=carat, y=sqrt(price), color=color)) +
geom_point(alpha=0.3) +
geom_abline(intercept=9.6, slope=58, color="#F8766D", alpha=0.75, size=1.5) +
geom_abline(intercept=6.3, slope=58, color="#00BFC4", alpha=0.75, size=1.5)
Why is our linear regression model different from what we got from geom_smooth(method='lm')
?
The way we specified our model only lets color affect the intercept.
Model implicitly assumes that both colors have the same slope and only allows for different intercepts.
What seems more appropriate in this case?
Same slope and same intercept for both colors
Same slope and different intercept for both colors
Different slope and different intercept for both colors?
Including an interaction effect in the model allows for different slopes, i.e. nonparallel lines.
This implies that the regression coefficient for an explanatory variable would change as another explanatory variable changes.
This can be accomplished by adding an interaction variable which is just the product of the two explanatory variables.
lm(sqrt(price) ~ carat * color , data = dmdsimple)
##
## Call:
## lm(formula = sqrt(price) ~ carat * color, data = dmdsimple)
##
## Coefficients:
## (Intercept) carat color.L carat:color.L
## 8.684 59.931 4.066 -9.584
\[\widehat{\sqrt{\text{price}}} = 8.68 + 59.9 \times \text{carat} + 4.07 \times \text{color} - 9.58 \times \text{carat} \times \text{color}\]
How does changing color change our model?
color
to get the linear model for diamonds with colorless color.\[ \begin{aligned} \widehat{\sqrt{\text{price}}} &= 8.68 + 59.9 \times \text{carat} + 4.07 \times 0 - 9.58 \times \text{carat} \times 0\\ &= 8.68 + 59.9 \times \text{carat} \end{aligned} \]
color
to get the linear model for diamonds with near colorless color\[ \begin{aligned} \widehat{\sqrt{\text{price}}} &= 8.68 + 59.9 \times \text{carat} + 4.07 \times 1 - 9.58 \times \text{carat} \times 1\\ &= 12.75 + 50.32 \times \text{carat} \end{aligned} \]
ggplot(dmdsimple, aes(x=carat, y=sqrt(price), color=color)) +
geom_point(alpha=0.3) +
geom_abline(intercept=8.68, slope=59.9, color="#F8766D", alpha=0.75, size=1.5) +
geom_abline(intercept=12.75, slope=50.32, color="#00BFC4", alpha=0.75, size=1.5)
summary( lm(sqrt(price) ~ carat, data = dmdsimple) )$r.squared
## [1] 0.8841681
summary( lm(sqrt(price) ~ carat + color, data = dmdsimple) )$r.squared
## [1] 0.8906609
summary( lm(sqrt(price) ~ carat * color, data = dmdsimple) )$r.squared
## [1] 0.9017037
Occam’s Razor states that among competing hypotheses that predict equally well, the one with the fewest assumptions should be selected.
Model selection should follow this principle.
We only want to add another variable to the model if the addition of that variable brings something valuable in terms of predictive power to the model.
In other words, we prefer the simplest best model, i.e. parsimonious model.
What we have just explored is what happens when we add a categorical variable to an existing linear model.
Just adding the categorical variable only adjusts the intercept of the model
Adding an interaction term allows for changes in slope as well
We can continue to add other categorical variables (and interactions) and they will have the same effect
Higher level interactions are possible (3 or more predictors) but are generally not a good idea.
You must enable Javascript to view this page properly.
lm(sqrt(price) ~ carat + table , data = dmdsimple)
##
## Call:
## lm(formula = sqrt(price) ~ carat + table, data = dmdsimple)
##
## Coefficients:
## (Intercept) carat table
## 31.634 57.357 -0.375
\[\widehat{\sqrt{\text{price}}} = 31.6 + 57.4 \times \text{carat} - 0.375 \times \text{table}\]
You must enable Javascript to view this page properly.
lm(sqrt(price) ~ carat * table , data = dmdsimple)
##
## Call:
## lm(formula = sqrt(price) ~ carat * table, data = dmdsimple)
##
## Coefficients:
## (Intercept) carat table carat:table
## -35.1574 141.0885 0.7836 -1.4466
\[\widehat{\sqrt{\text{price}}} = -35.2 + 141 \times \text{carat} + 0.78 \times \text{table} -1.45 \times \text{carat} \times \text{table}\]
You must enable Javascript to view this page properly.
summary( lm(sqrt(price) ~ carat, data = dmdsimple) )$r.squared
## [1] 0.8841681
summary( lm(sqrt(price) ~ carat + table, data = dmdsimple) )$r.squared
## [1] 0.8850684
summary( lm(sqrt(price) ~ carat * table, data = dmdsimple) )$r.squared
## [1] 0.8883356
Interpretation is the same as with SLR, we just need to add the caveat that the expect change occurs when all other variables are held constant
\[\widehat{\sqrt{\text{price}}} = 31.6 + 57.4 \times \text{carat} - 0.375 \times \text{table}\]
\(b_0\) - For a diamond with 0 carats and 0 table size we expect it to sell for 31.6 \(\sqrt{\$}\)s.
\(b_1\) - For a unit change in carat we expect the square root of price to change by 57.4 \(\sqrt{\$}\)s on average, given the other variables are held constant.
\(b_2\) - For a unit change in table size we expect the square root of price to change by -0.375 \(\sqrt{\$}\)s on average, given the other variables are held constant.
\[\widehat{\sqrt{\text{price}}} = -35.2 + 141 \times \text{carat} + 0.78 \times \text{table} -1.45 \times \text{carat} \times \text{table}\]
\(b_0\) - For a diamond with 0 carats and 0 table size we expect it to sell for -35.2 \(\sqrt{\$}\)s.
\(b_1\) - For a unit change in carat we expect the square root of price to change by 141 \(\sqrt{\$}\)s on average, given the other variables are held constant.
\(b_2\) - For a unit change in table size we expect the square root of price to change by 0.78 \(\sqrt{\$}\)s on average, given the other variables are held constant.
\(b_2\) - For a unit change in the product of carat and table size we expect the square root of price to change by -1.45 \(\sqrt{\$}\)s on average, given the other variables are held constant.