Today’s agenda

Review App Ex from last time
- Recap modeling with logged response variable
Multiple linear regression
- Using additional variables explain more of the variability in the response variable
Interaction variables
- Adding flexibility / building models that better reflect reality

Initial setup

Back to the diamonds (simplified)

Load packages:

library(ggplot2)
library(dplyr)
library(forcats)

We will simplify the data slightly:

set.seed(100)
dmdsimple = diamonds %>%
    sample_n(1000) %>%
    mutate(color = fct_collapse(color, 
                                colorless  = c("D","E","F"),
                                near_colorless = c("G","H","I","J")),
           clarity = fct_collapse(clarity, 
                                  SI = c("SI1","SI2"),
                                  VS = c("VS1","VS2"),
                                  VVS1 = c("VVS1","VVS2")))

Adding categorical predictors

Price, Carat, and Color

ggplot(dmdsimple, aes(x=carat, y=price, color=color)) + geom_point(alpha=0.3) + geom_smooth(se=FALSE)

Let try transforming our outcome variable

What about linearity?

Linear model of sqrt(price), carat, and color

Does the relationship between carats and sqrt(price) appear to vary by color?

ggplot(dmdsimple, aes(x=carat, y=sqrt(price), color=color)) + 
    geom_point(alpha=0.3) +
    geom_smooth(method="lm", se=FALSE)

Color Model

(m = lm(sqrt(price) ~ carat + color, data = dmdsimple))

## 
## Call:
## lm(formula = sqrt(price) ~ carat + color, data = dmdsimple)
## 
## Coefficients:
## (Intercept)        carat      color.L  
##       9.608       58.000       -3.336

\[\widehat{\sqrt{\text{price}}} = 9.6 + 58 \times \text{carat} - 3.3 \times \text{color}\]

Models by Color

How does changing color change our model?

Plug in 0 for color to get the linear model for diamonds with colorless color.

\[\widehat{\sqrt{\text{price}}} = 9.6 + 58 \times \text{carat} - 3.3 \times 0 = 9.6 + 58 \times \text{carat} \]

Plug in 1 for color to get the linear model for diamonds with near colorless color

\[\widehat{\sqrt{\text{price}}} = 9.6 + 58 \times \text{carat} - 3.3 \times 1 = 6.3 + 58 \times \text{carat} \]

That’s not quite right

ggplot(dmdsimple, aes(x=carat, y=sqrt(price), color=color)) + 
    geom_point(alpha=0.3) +
    geom_abline(intercept=9.6, slope=58, color="#F8766D", alpha=0.75, size=1.5) + 
    geom_abline(intercept=6.3, slope=58, color="#00BFC4", alpha=0.75, size=1.5)

What went wrong?

Why is our linear regression model different from what we got from geom_smooth(method='lm')?

The way we specified our model only lets color affect the intercept.
Model implicitly assumes that both colors have the same slope and only allows for different intercepts.
What seems more appropriate in this case?
- Same slope and same intercept for both colors
- Same slope and different intercept for both colors
- Different slope and different intercept for both colors?

Interactions between explanatory variables

Including an interaction effect in the model allows for different slopes, i.e. nonparallel lines.
This implies that the regression coefficient for an explanatory variable would change as another explanatory variable changes.
This can be accomplished by adding an interaction variable which is just the product of the two explanatory variables.

Interaction Model

lm(sqrt(price) ~ carat * color , data = dmdsimple)

## 
## Call:
## lm(formula = sqrt(price) ~ carat * color, data = dmdsimple)
## 
## Coefficients:
##   (Intercept)          carat        color.L  carat:color.L  
##         8.684         59.931          4.066         -9.584

\[\widehat{\sqrt{\text{price}}} = 8.68 + 59.9 \times \text{carat} + 4.07 \times \text{color} - 9.58 \times \text{carat} \times \text{color}\]

Interaction Models by Color

How does changing color change our model?

Again plug in 0 for color to get the linear model for diamonds with colorless color.

\[ \begin{aligned} \widehat{\sqrt{\text{price}}} &= 8.68 + 59.9 \times \text{carat} + 4.07 \times 0 - 9.58 \times \text{carat} \times 0\\ &= 8.68 + 59.9 \times \text{carat} \end{aligned} \]

Again plug in 1 for color to get the linear model for diamonds with near colorless color

\[ \begin{aligned} \widehat{\sqrt{\text{price}}} &= 8.68 + 59.9 \times \text{carat} + 4.07 \times 1 - 9.58 \times \text{carat} \times 1\\ &= 12.75 + 50.32 \times \text{carat} \end{aligned} \]

That’s Better

ggplot(dmdsimple, aes(x=carat, y=sqrt(price), color=color)) + 
    geom_point(alpha=0.3) +
    geom_abline(intercept=8.68, slope=59.9, color="#F8766D", alpha=0.75, size=1.5) + 
    geom_abline(intercept=12.75, slope=50.32, color="#00BFC4", alpha=0.75, size=1.5)

Is it really better?

summary( lm(sqrt(price) ~ carat, data = dmdsimple) )$r.squared

## [1] 0.8841681

summary( lm(sqrt(price) ~ carat + color, data = dmdsimple) )$r.squared

## [1] 0.8906609

summary( lm(sqrt(price) ~ carat * color, data = dmdsimple) )$r.squared

## [1] 0.9017037

In pursuit of Occam’s Razor

Occam’s Razor states that among competing hypotheses that predict equally well, the one with the fewest assumptions should be selected.
Model selection should follow this principle.
We only want to add another variable to the model if the addition of that variable brings something valuable in terms of predictive power to the model.
In other words, we prefer the simplest best model, i.e. parsimonious model.

Summary

What we have just explored is what happens when we add a categorical variable to an existing linear model.

Just adding the categorical variable only adjusts the intercept of the model
Adding an interaction term allows for changes in slope as well
We can continue to add other categorical variables (and interactions) and they will have the same effect
Higher level interactions are possible (3 or more predictors) but are generally not a good idea.

Adding numeric predictors

sqrt(price) vs. carat vs. table

You must enable Javascript to view this page properly.

Linear model?

lm(sqrt(price) ~ carat + table , data = dmdsimple)

## 
## Call:
## lm(formula = sqrt(price) ~ carat + table, data = dmdsimple)
## 
## Coefficients:
## (Intercept)        carat        table  
##      31.634       57.357       -0.375

\[\widehat{\sqrt{\text{price}}} = 31.6 + 57.4 \times \text{carat} - 0.375 \times \text{table}\]

Fitted

You must enable Javascript to view this page properly.

Linear model + interaction

lm(sqrt(price) ~ carat * table , data = dmdsimple)

## 
## Call:
## lm(formula = sqrt(price) ~ carat * table, data = dmdsimple)
## 
## Coefficients:
## (Intercept)        carat        table  carat:table  
##    -35.1574     141.0885       0.7836      -1.4466

\[\widehat{\sqrt{\text{price}}} = -35.2 + 141 \times \text{carat} + 0.78 \times \text{table} -1.45 \times \text{carat} \times \text{table}\]

Fitted interaction model

You must enable Javascript to view this page properly.

Which is better?

summary( lm(sqrt(price) ~ carat, data = dmdsimple) )$r.squared

## [1] 0.8841681

summary( lm(sqrt(price) ~ carat + table, data = dmdsimple) )$r.squared

## [1] 0.8850684

summary( lm(sqrt(price) ~ carat * table, data = dmdsimple) )$r.squared

## [1] 0.8883356

Interpreting slopes for MLR

Interpretation is the same as with SLR, we just need to add the caveat that the expect change occurs when all other variables are held constant

\[\widehat{\sqrt{\text{price}}} = 31.6 + 57.4 \times \text{carat} - 0.375 \times \text{table}\]

$b_0$ - For a diamond with 0 carats and 0 table size we expect it to sell for 31.6 $\sqrt{\$}$s.
$b_1$ - For a unit change in carat we expect the square root of price to change by 57.4 $\sqrt{\$}$s on average, given the other variables are held constant.
$b_2$ - For a unit change in table size we expect the square root of price to change by -0.375 $\sqrt{\$}$s on average, given the other variables are held constant.

Interpreting slopes for MLR w/ interaction

\[\widehat{\sqrt{\text{price}}} = -35.2 + 141 \times \text{carat} + 0.78 \times \text{table} -1.45 \times \text{carat} \times \text{table}\]

$b_0$ - For a diamond with 0 carats and 0 table size we expect it to sell for -35.2 $\sqrt{\$}$s.
$b_1$ - For a unit change in carat we expect the square root of price to change by 141 $\sqrt{\$}$s on average, given the other variables are held constant.
$b_2$ - For a unit change in table size we expect the square root of price to change by 0.78 $\sqrt{\$}$s on average, given the other variables are held constant.
$b_2$ - For a unit change in the product of carat and table size we expect the square root of price to change by -1.45 $\sqrt{\$}$s on average, given the other variables are held constant.

Multiple Regression

Today’s agenda

Today’s agenda

Initial setup

Back to the diamonds (simplified)

Adding categorical predictors

Price, Carat, and Color

Let try transforming our outcome variable

What about linearity?

Linear model of sqrt(price), carat, and color

Color Model

Models by Color

That’s not quite right

What went wrong?

Interactions between explanatory variables

Interaction Model

Interaction Models by Color

That’s Better

Is it really better?

In pursuit of Occam’s Razor

Summary

Adding numeric predictors

sqrt(price) vs. carat vs. table

Linear model?

Fitted

Linear model + interaction

Fitted interaction model

Which is better?

Interpreting slopes for MLR

Interpreting slopes for MLR w/ interaction