Sta112FS 9. Multiple regression, interaction effects, and model selection

September 24, 2015

Today's agenda

Review App Ex from last time
- Recap modeling with logged response variable
Multiple linear regression
- So you can explain more of the variability in the response variable
Interaction variables
- So you can start building models that better reflect reality
Model selection
- So you can decide on the best model(s)
Due Tuesday:
- App Ex 5: Requires consultation with Sandra

Initial setup

Load packages & data + data fixes

Load packages:

library(ggplot2)
library(dplyr)
library(stringr)

Load data:

pp <- read.csv("paris_paintings.csv", stringsAsFactors = FALSE) %>%
  tbl_df()

Fix prices:

pp <- pp %>%
  mutate(price = as.numeric(str_replace(price, ",", "")))

More data fixes

Fix shape coding:

pp <- pp %>%
  mutate(shape_recode = ifelse(Shape == "", NA,
                               ifelse(Shape == "ovale", "oval",
                                      ifelse(Shape == "ronde", "round",
                                             ifelse(Shape == "octogon", "octagon", Shape)))))

Fix material coding:

pp <- pp %>%
  mutate(mat_recode = ifelse(mat %in% c("a", "bc", "c"), "metal",
                             ifelse(mat %in% c("al", "ar", "m"), "stone",
                                    ifelse(mat %in% c("co", "bt", "t"), "canvas",
                                           ifelse(mat %in% c("p", "ca"), "paper",
                                                  ifelse(mat %in% c("b"), "wood",
                                                         ifelse(mat %in% c("o", "e", "v"), "other",
                                                                ifelse(mat %in% c("n/a", ""), NA,
                                                                       "uncertain"))))))))

Multiple linear regression

From last time…

The linear model with multiple predictors

Population model: \[ \hat{y} = \beta_0 + \beta_1~x_1 + \beta_2~x_2 + \cdots + \beta_k~x_k \]
Sample model that we use to estimate the population model: \[ \hat{y} = b_0 + b_1~x_1 + b_2~x_2 + \cdots + b_k~x_k \]

Main effects

Price, surface, and living artist

Very few paintings withs Surface >= 5000:

Price, surface, and living artist

For simplicity let's focus on the paintings with Surface < 5000:

Price vs. surface and living artist

Does the relationship between surface and logged price vary by whether or not the artist is living?

ggplot(data = pp_Surf_lt_5000,
       aes(y = log(price), x = Surface, color = factor(artistliving))) +
  geom_point(alpha = 0.3) +
  stat_smooth(method = "lm", fullrange = TRUE)

Modeling with main effects

(m_main <- lm(log(price) ~ Surface + factor(artistliving), data = pp_Surf_lt_5000))

## 
## Call:
## lm(formula = log(price) ~ Surface + factor(artistliving), data = pp_Surf_lt_5000)
## 
## Coefficients:
##           (Intercept)                Surface  factor(artistliving)1  
##             4.8800835              0.0002653              0.1372103

\[\widehat{log(price)} = 4.88 + 0.00027~surface + 1.14~artistliving \]

Plug in 0 for artistliving to get the linear model for paintings by non-living artists.
Plug in 1 for artistliving to get the linear model for paintings by living artists.

Interpretation of main effects

Non-living artist: \(\widehat{log(price)} = 4.88 + 0.00027~surface + 1.14 \times 0\) \(= 4.88 + 0.00027~surface\)
Living artist: \(\widehat{log(price)} = 4.88 + 0.00027~surface + 1.14 \times 1\) \(= 6.02 + 0.00027~surface\)

Rate of change in price as the surface area of the painting increases does not vary between paintings by living and non-living artists (same slope),
Paintings by living artists are consistently more expensive than paintings by non-living artists (different intercept).

Interaction effects

Interacting explanatory variables

Including an interaction effect in the model allows for different slopes, i.e. nonparallel lines.
This implies that the regression coefficient for an explanatory variable would change as another explanatory variable changes.
This can be accomplished by adding an interaction variable: the product of two explanatory variables.

Price vs. surface and artist living interacting

ggplot(data = pp_Surf_lt_5000,
       aes(y = log(price), x = Surface, color = factor(artistliving))) +
  geom_point(alpha = 0.3) +
  stat_smooth(method = "lm", fullrange = TRUE)

Modeling with interaction effects

(m_int <- lm(log(price) ~ Surface * factor(artistliving), data = pp_Surf_lt_5000))

## 
## Call:
## lm(formula = log(price) ~ Surface * factor(artistliving), data = pp_Surf_lt_5000)
## 
## Coefficients:
##                   (Intercept)                        Surface  
##                     4.9141894                      0.0002059  
##         factor(artistliving)1  Surface:factor(artistliving)1  
##                    -0.1261225                      0.0004792

\[\widehat{log(price)} = 4.91 + 0.00021~surface - 0.126~artistliving + 0.00048~surface \times artistliving \]

Interpretation of interaction effects

Non-living artist: \(\widehat{log(price)} = 4.91 + 0.00021~surface\) \(- 0.126 \times 0 + 0.00048~surface \times 0\) \(= 4.91 + 0.00021~surface\)
Living artist: \(\widehat{log(price)} = 4.91 + 0.00021~surface\) \(- 0.126 \times 1 + 0.00048~surface \times 1\) \(= 4.91 + 0.00021~surface\) \(- 0.126 + 0.00048~surface\) \(= 4.784 + 0.00069~surface\)

Rate of change in price as the surface area of the painting increases does vary between paintings by living and non-living artists (different slopes),
Paintings by living artists are consistently more expensive than paintings by non-living artists (different intercept).

Third order interactions

Can you? Yes
Should you? Probably not if you want to interpret these interactions in context of the data.

Quality of fit in MLR

\(R^2\)

\(R^2\) is the percentage of variability in the response variable explained by the regression model.

summary(m_main)$r.squared

## [1] 0.01320884

summary(m_int)$r.squared

## [1] 0.0176922

Clearly the model with interactions has a higher \(R^2\).

However using \(R^2\) for model selection in models with multiple explanatory variables is not a good idea as \(R^2\) increases when any variable is added to the model.

\(R^2\) - first principles

\[ R^2 = \frac{ SS_{Reg} }{ SS_{Total} } = 1 - \left( \frac{ SS_{Error} }{ SS_{Total} } \right) \]

anova(m_main)

## Analysis of Variance Table
## 
## Response: log(price)
##                        Df  Sum Sq Mean Sq F value    Pr(>F)    
## Surface                 1   138.5 138.537 40.6741 2.058e-10 ***
## factor(artistliving)    1     6.8   6.810  1.9994    0.1575    
## Residuals            3188 10858.4   3.406                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

\[ R^2_{m\_main} = \frac{138.5 + 6.8}{138.5 + 6.8 + 10858.4} = \frac{145.3}{11003.7} \approx 0.013 \]

\(R^2\) - first principles

\[ R^2 = \frac{ SS_{Reg} }{ SS_{Total} } = 1 - \left( \frac{ SS_{Error} }{ SS_{Total} } \right) \]

anova(m_int)

## Analysis of Variance Table
## 
## Response: log(price)
##                                Df  Sum Sq Mean Sq F value    Pr(>F)    
## Surface                         1   138.5 138.537 40.8469 1.886e-10 ***
## factor(artistliving)            1     6.8   6.810  2.0079 0.1565836    
## Surface:factor(artistliving)    1    49.3  49.334 14.5458 0.0001394 ***
## Residuals                    3187 10809.1   3.392                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

\[ R^2_{m\_int} = \frac{138.5 + 6.8 + 49.3}{138.5 + 6.8 + 49.3 + 10809.1} = \frac{194.6}{11003.7} \approx 0.018 \]

Adjusted \(R^2\)

\[ R^2_{adj} = 1 - \left( \frac{ SS_{Error} }{ SS_{Total} } \times \frac{n - 1}{n - k - 1} \right) \] where \(n\) is the number of cases and \(k\) is the number of predictors in the model

Adjusted \(R^2\) doesn't increase if the new variable does not provide any new informaton or is completely unrelated.

This makes adjusted \(R^2\) a preferable metric for model selection in multiple regression models.

In pursuit of Occam's Razor

Occam's Razor states that among competing hypotheses that predict equally well, the one with the fewest assumptions should be selected.
Model selection follows this principle.
We only want to add another variable to the model if the addition of that variable brings something valuable in terms of predictive power to the model.
In other words, we prefer the simplest best model, i.e. parsimonious model.

Comparing models

It appears that adding the interaction actually increased adjusted \(R^2\) so we should indeed use the model with the interactions.

summary(m_main)$adj.r.squared

## [1] 0.01258977

summary(m_int)$adj.r.squared

## [1] 0.01676753

Model selection methods

Backwards elimination

Start with full model (including all candidate explanatory variables and all candidate interactions)
Remove one variable at a time, and select the model with the highest adjusted \(R^2\)
Continue until adjusted \(R^2\) does not increase

Forward selection

Start with empty model
Add one variable (or interaction effect) at a time, and select the model with the highest adjusted \(R^2\)
Continue until adjusted \(R^2\) does not increase

Model selection and interaction effects

If an interaction is included in the model, the main effects of both of those variables must also be in the model

Other model selection criteria

Adjuster \(R^2\) is one model selection criterion
There are others out there (many many others!), we'll discuss some later in the course, and some you might see in another course

Application Exercise

App Ex 5

See course website.