class: center, middle, inverse, title-slide # Multiple linear regression ### Dr. Çetinkaya-Rundel ### October 17, 2017 --- class: center, middle # Getting started --- ## Getting started - Any questions from last time? - Linear models with multiple predictors and interaction effects --- class: center, middle # Interaction effects --- ## Data: Paris Paintings ```r library(tidyverse) # ggplot2 + dplyr + readr + and some others ``` ``` ## Warning: package 'dplyr' was built under R version 3.4.2 ``` ```r library(broom) ``` ```r pp <- read_csv("data/paris_paintings.csv", na = c("n/a", "", "NA")) ``` --- ## New package: `forcats` **For** dealing with **cat**egocrical variable**s** ![](img/13/forcats.png) ```r library(forcats) ``` --- ## Data fixes Collapse levels of `Shape` and `mat`erial variables. .small[ ```r pp <- pp %>% mutate( Shape = fct_collapse(Shape, oval = c("oval", "ovale"), round = c("round", "ronde"), squ_rect = "squ_rect", other = c("octogon", "octagon", "miniature")), mat = fct_collapse(mat, metal = c("a", "br", "c"), canvas = c("co", "t", "ta"), paper = c("p", "ca"), wood = "b", other = c("e", "g", "h", "mi", "o", "pa", "v", "al", "ar", "m")) ) ``` ] --- ## Review fixes .small[ ```r pp %>% count(Shape) ``` ``` ## # A tibble: 5 x 2 ## Shape n ## <fctr> <int> ## 1 other 12 ## 2 oval 52 ## 3 round 74 ## 4 squ_rect 3219 ## 5 <NA> 36 ``` ```r pp %>% count(mat) ``` ``` ## # A tibble: 6 x 2 ## mat n ## <fctr> <int> ## 1 metal 321 ## 2 other 59 ## 3 wood 886 ## 4 paper 38 ## 5 canvas 1783 ## 6 <NA> 306 ``` ] --- ## Review: Main effects, numerical predictors ```r (m_main_n <- lm(log(price) ~ Width_in + Height_in, data = pp)) ``` ``` ## ## Call: ## lm(formula = log(price) ~ Width_in + Height_in, data = pp) ## ## Coefficients: ## (Intercept) Width_in Height_in ## 4.76944 0.02694 -0.01327 ``` --- ## Visualizing the model
--- ## Review: Main effects, numerical and categorical predictors .small[ ```r pp_Surf_lt_5000 <- pp %>% filter(Surface < 5000) m_main <- lm(log(price) ~ Surface + factor(artistliving), data = pp_Surf_lt_5000) round(exp(m_main$coefficients), 4) ``` ``` ## (Intercept) Surface factor(artistliving)1 ## 131.6417 1.0003 1.1471 ``` ] - All else held constant, for each additional square inch in painting's surface area, the price of the painting is predicted, on average, to be higher by a factor of 1.0003. - All else held constant, paintings by a living artist are predicted, on average, to be higher by a factor of 1.15 compared to paintings by an artist who is no longer alive. - Paintings that are by an artist who is not alive and that have a surface area of 0 square inches are predicted, on average, to be 131.64 livres. --- ## What went wrong? <div class="question"> Why is our linear regression model different from what we got from `geom_smooth(method = "lm")`? </div> ![](13-deck_files/figure-html/unnamed-chunk-11-1.png)<!-- --> ![](13-deck_files/figure-html/unnamed-chunk-12-1.png)<!-- --> --- ## What went wrong? (cont.) - The way we specified our model only lets `artistliving` affect the intercept. - Model implicitly assumes that paintings with living and deceased artists have the *same slope* and only allows for *different intercepts*. - What seems more appropriate in this case? * Same slope and same intercept for both colors * Same slope and different intercept for both colors * Different slope and different intercept for both colors? --- ## Interacting explanatory variables - Including an interaction effect in the model allows for different slopes, i.e. nonparallel lines. - This implies that the regression coefficient for an explanatory variable would change as another explanatory variable changes. - This can be accomplished by adding an interaction variable: the product of two explanatory variables. --- ## Price vs. surface and artist living interacting .small[ ```r ggplot(data = pp_Surf_lt_5000, mapping = aes(y = log(price), x = Surface, color = factor(artistliving))) + geom_point(alpha = 0.3) + geom_smooth(method = "lm", fullrange = TRUE) ``` ![](13-deck_files/figure-html/unnamed-chunk-13-1.png)<!-- --> ] --- ## Modeling with interaction effects .small[ ```r (m_int <- lm(log(price) ~ Surface * factor(artistliving), data = pp_Surf_lt_5000)) ``` ``` ## ## Call: ## lm(formula = log(price) ~ Surface * factor(artistliving), data = pp_Surf_lt_5000) ## ## Coefficients: ## (Intercept) Surface ## 4.9141894 0.0002059 ## factor(artistliving)1 Surface:factor(artistliving)1 ## -0.1261225 0.0004792 ``` ] $$ \widehat{log(price)} = 4.91 + 0.00021~surface - 0.126~artistliving + 0.00048~surface \times artistliving $$ --- ## Interpretation of interaction effects - Rate of change in price as the surface area of the painting increases does vary between paintings by living and non-living artists (different slopes), - Some paintings by living artists are more expensive than paintings by non-living artists, and some are not (different intercept). .small[ .pull-left[ - Non-living artist: `\(\widehat{log(price)} = 4.91 + 0.00021~surface\)` `\(- 0.126 \times 0 + 0.00048~surface \times 0\)` `\(= 4.91 + 0.00021~surface\)` - Living artist: `\(\widehat{log(price)} = 4.91 + 0.00021~surface\)` `\(- 0.126 \times 1 + 0.00048~surface \times 1\)` `\(= 4.91 + 0.00021~surface\)` `\(- 0.126 + 0.00048~surface\)` `\(= 4.784 + 0.00069~surface\)` ] .pull-right[ ![](13-deck_files/figure-html/unnamed-chunk-15-1.png)<!-- --> ] ] --- ## Third order interactions - Can you? Yes - Should you? Probably not if you want to interpret these interactions in context of the data. ---