September 24, 2015
Load packages:
library(ggplot2) library(dplyr) library(stringr)
Load data:
pp <- read.csv("paris_paintings.csv", stringsAsFactors = FALSE) %>%
tbl_df()
Fix prices:
pp <- pp %>% mutate(price = as.numeric(str_replace(price, ",", "")))
Fix shape coding:
pp <- pp %>%
mutate(shape_recode = ifelse(Shape == "", NA,
ifelse(Shape == "ovale", "oval",
ifelse(Shape == "ronde", "round",
ifelse(Shape == "octogon", "octagon", Shape)))))
Fix material coding:
pp <- pp %>%
mutate(mat_recode = ifelse(mat %in% c("a", "bc", "c"), "metal",
ifelse(mat %in% c("al", "ar", "m"), "stone",
ifelse(mat %in% c("co", "bt", "t"), "canvas",
ifelse(mat %in% c("p", "ca"), "paper",
ifelse(mat %in% c("b"), "wood",
ifelse(mat %in% c("o", "e", "v"), "other",
ifelse(mat %in% c("n/a", ""), NA,
"uncertain"))))))))
The linear model with multiple predictors
Population model: \[ \hat{y} = \beta_0 + \beta_1~x_1 + \beta_2~x_2 + \cdots + \beta_k~x_k \]
Sample model that we use to estimate the population model: \[ \hat{y} = b_0 + b_1~x_1 + b_2~x_2 + \cdots + b_k~x_k \]
Very few paintings withs Surface >= 5000:
For simplicity let's focus on the paintings with Surface < 5000:
ggplot(data = pp_Surf_lt_5000,
aes(y = log(price), x = Surface, color = factor(artistliving))) +
geom_point(alpha = 0.3) +
stat_smooth(method = "lm", fullrange = TRUE)
(m_main <- lm(log(price) ~ Surface + factor(artistliving), data = pp_Surf_lt_5000))
## ## Call: ## lm(formula = log(price) ~ Surface + factor(artistliving), data = pp_Surf_lt_5000) ## ## Coefficients: ## (Intercept) Surface factor(artistliving)1 ## 4.8800835 0.0002653 0.1372103
\[\widehat{log(price)} = 4.88 + 0.00027~surface + 1.14~artistliving \]
Plug in 0 for artistliving to get the linear model for paintings by non-living artists.
Plug in 1 for artistliving to get the linear model for paintings by living artists.
Non-living artist: \(\widehat{log(price)} = 4.88 + 0.00027~surface + 1.14 \times 0\) \(= 4.88 + 0.00027~surface\)
Rate of change in price as the surface area of the painting increases does not vary between paintings by living and non-living artists (same slope),
Paintings by living artists are consistently more expensive than paintings by non-living artists (different intercept).
Including an interaction effect in the model allows for different slopes, i.e. nonparallel lines.
This implies that the regression coefficient for an explanatory variable would change as another explanatory variable changes.
This can be accomplished by adding an interaction variable: the product of two explanatory variables.
ggplot(data = pp_Surf_lt_5000,
aes(y = log(price), x = Surface, color = factor(artistliving))) +
geom_point(alpha = 0.3) +
stat_smooth(method = "lm", fullrange = TRUE)
(m_int <- lm(log(price) ~ Surface * factor(artistliving), data = pp_Surf_lt_5000))
## ## Call: ## lm(formula = log(price) ~ Surface * factor(artistliving), data = pp_Surf_lt_5000) ## ## Coefficients: ## (Intercept) Surface ## 4.9141894 0.0002059 ## factor(artistliving)1 Surface:factor(artistliving)1 ## -0.1261225 0.0004792
\[\widehat{log(price)} = 4.91 + 0.00021~surface - 0.126~artistliving + 0.00048~surface \times artistliving \]
Non-living artist: \(\widehat{log(price)} = 4.91 + 0.00021~surface\) \(- 0.126 \times 0 + 0.00048~surface \times 0\) \(= 4.91 + 0.00021~surface\)
Rate of change in price as the surface area of the painting increases does vary between paintings by living and non-living artists (different slopes),
Paintings by living artists are consistently more expensive than paintings by non-living artists (different intercept).
Can you? Yes
Should you? Probably not if you want to interpret these interactions in context of the data.
summary(m_main)$r.squared
## [1] 0.01320884
summary(m_int)$r.squared
## [1] 0.0176922
\[ R^2 = \frac{ SS_{Reg} }{ SS_{Total} } = 1 - \left( \frac{ SS_{Error} }{ SS_{Total} } \right) \]
anova(m_main)
## Analysis of Variance Table ## ## Response: log(price) ## Df Sum Sq Mean Sq F value Pr(>F) ## Surface 1 138.5 138.537 40.6741 2.058e-10 *** ## factor(artistliving) 1 6.8 6.810 1.9994 0.1575 ## Residuals 3188 10858.4 3.406 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
\[ R^2_{m\_main} = \frac{138.5 + 6.8}{138.5 + 6.8 + 10858.4} = \frac{145.3}{11003.7} \approx 0.013 \]
\[ R^2 = \frac{ SS_{Reg} }{ SS_{Total} } = 1 - \left( \frac{ SS_{Error} }{ SS_{Total} } \right) \]
anova(m_int)
## Analysis of Variance Table ## ## Response: log(price) ## Df Sum Sq Mean Sq F value Pr(>F) ## Surface 1 138.5 138.537 40.8469 1.886e-10 *** ## factor(artistliving) 1 6.8 6.810 2.0079 0.1565836 ## Surface:factor(artistliving) 1 49.3 49.334 14.5458 0.0001394 *** ## Residuals 3187 10809.1 3.392 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
\[ R^2_{m\_int} = \frac{138.5 + 6.8 + 49.3}{138.5 + 6.8 + 49.3 + 10809.1} = \frac{194.6}{11003.7} \approx 0.018 \]
\[ R^2_{adj} = 1 - \left( \frac{ SS_{Error} }{ SS_{Total} } \times \frac{n - 1}{n - k - 1} \right) \] where \(n\) is the number of cases and \(k\) is the number of predictors in the model
Occam's Razor states that among competing hypotheses that predict equally well, the one with the fewest assumptions should be selected.
Model selection follows this principle.
We only want to add another variable to the model if the addition of that variable brings something valuable in terms of predictive power to the model.
In other words, we prefer the simplest best model, i.e. parsimonious model.
It appears that adding the interaction actually increased adjusted \(R^2\) so we should indeed use the model with the interactions.
summary(m_main)$adj.r.squared
## [1] 0.01258977
summary(m_int)$adj.r.squared
## [1] 0.01676753
Start with full model (including all candidate explanatory variables and all candidate interactions)
Remove one variable at a time, and select the model with the highest adjusted \(R^2\)
Continue until adjusted \(R^2\) does not increase
Start with empty model
Add one variable (or interaction effect) at a time, and select the model with the highest adjusted \(R^2\)
Continue until adjusted \(R^2\) does not increase
If an interaction is included in the model, the main effects of both of those variables must also be in the model
Adjuster \(R^2\) is one model selection criterion
There are others out there (many many others!), we'll discuss some later in the course, and some you might see in another course
See course website.