class: center, middle, inverse, title-slide # Multiple linear regression
🤹‍♀ --- layout: true <div class="my-footer"> <span> Dr. Mine Çetinkaya-Rundel - <a href="http://www2.stat.duke.edu/courses/Fall18/sta112.01/schedule" target="_blank">stat.duke.edu/courses/Fall18/sta112.01 </a> </span> </div> --- ## Announcements - Peer evaluations due tonight at 11:59pm - 6/10 of you still need to fill them out - Project assignment is posted, due next Tuesday --- class: center, middle # The linear model with multiple predictors --- ## Getting started **Data:** Paris Paintings ```r pp <- read_csv("data/paris_paintings.csv", na = c("n/a", "", "NA")) ``` --- ## Multiple predictors - Response variable: log(price) - Explanatory variables: Width and height ```r m_wi_hgt <- lm(log(price) ~ Width_in + Height_in, data = pp) tidy(m_wi_hgt) ``` ``` ## # A tibble: 3 x 5 ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) 4.77 0.0579 82.4 0. ## 2 Width_in 0.0269 0.00373 7.22 6.58e-13 ## 3 Height_in -0.0133 0.00395 -3.36 7.93e- 4 ``` -- - Linear model: `$$\widehat{log(price)} = 4.77 + 0.0269 width - 0.0133 height$$` --- ## Visualizing models with multiple predictors
--- ## Exploration 1 ### Price, surface area, and living artist - Explore the relationship between price of paintings and surface area, conditioned on whether or not the artist is still living - First visualize and explore, then model --- ## Typical surface area .question[ What is the typical surface area for paintings? ] <!-- --> -- Less than 1000 square inches (which is roughly a painting that is 31in x 31in). There are very few paintings that have surface area above 5000 square inches. --- ## Narrowing the scope For simplicity let's focus on the paintings with `Surface < 5000`: ```r pp_Surf_lt_5000 <- pp %>% filter(Surface < 5000) ``` <!-- --> --- ## Two ways to model - **Main effects:** Assuming relationship between surface and logged price **does not vary** by whether or not the artist is living. - **Interaction effects:** Assuming relationship between surface and logged price **varies** by whether or not the artist is living. .pull-left[ <!-- --> ] .pull-right[ <!-- --> ] --- ## Fit model with main effects - Response variable: log(price) - Explanatory variables: Surface area and artist living (0/1 variable) .midi[ ```r m_main <- lm(log(price) ~ Surface + factor(artistliving), data = pp_Surf_lt_5000) tidy(m_main) ``` ``` ## # A tibble: 3 x 5 ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) 4.88 0.0424 115. 0. ## 2 Surface 0.000265 0.0000415 6.39 1.85e-10 ## 3 factor(artistliving)1 0.137 0.0970 1.41 1.57e- 1 ``` ] -- - Linear model: $$ \widehat{log(price)} = 4.88 + 0.000265~surface + 0.137~artistliving $$ --- ## Solving the model <!-- --> -- - Non-living artist: Plug in 0 for `artistliving` `\(\widehat{log(price)} = 4.88 + 0.000265~surface + 0.137 \times 0\)` `\(= 4.88 + 0.000265~surface\)` -- - Living artist: Plug in 1 for `artistliving` `\(\widehat{log(price)} = 4.88 + 0.000265~surface + 0.137 \times 1\)` `\(= 5.017 + 0.000265~surface\)` --- ## Visualizing main effects <!-- --> - **Same slope:** Rate of change in price as the surface area increases does not vary between paintings by living and non-living artists. - **Different intercept:** Paintings by living artists are consistently more expensive than paintings by non-living artists. --- ## Interpreting main effects .midi[ ```r tidy(m_main) %>% mutate(exp_estimate = exp(estimate)) %>% select(term, estimate, exp_estimate) ``` ``` ## # A tibble: 3 x 3 ## term estimate exp_estimate ## <chr> <dbl> <dbl> ## 1 (Intercept) 4.88 132. ## 2 Surface 0.000265 1.00 ## 3 factor(artistliving)1 0.137 1.15 ``` ] - All else held constant, for each additional square inch in painting's surface area, the price of the painting is predicted, on average, to be higher by a factor of 1. - All else held constant, paintings by a living artist are predicted, on average, to be higher by a factor of 1.15 compared to paintings by an artist who is no longer alive. - Paintings that are by an artist who is not alive and that have a surface area of 0 square inches are predicted, on average, to be 132 livres. --- ## What went wrong? .question[ Why is our linear regression model different from what we got from `geom_smooth(method = "lm")`? ] .pull-left[ <!-- --> ] .pull-right[ <!-- --> ] --- ## What went wrong? (cont.) - The way we specified our model only lets `artistliving` affect the intercept. - Model implicitly assumes that paintings with living and deceased artists have the *same slope* and only allows for *different intercepts*. - What seems more appropriate in this case? + Same slope and same intercept for both colors + Same slope and different intercept for both colors + Different slope and different intercept for both colors? --- ## Interacting explanatory variables - Including an interaction effect in the model allows for different slopes, i.e. nonparallel lines. - This implies that the regression coefficient for an explanatory variable would change as another explanatory variable changes. - This can be accomplished by adding an interaction variable: the product of two explanatory variables. --- ## Interaction: surface * artist living .small[ ```r ggplot(data = pp_Surf_lt_5000, mapping = aes(y = log(price), x = Surface, color = factor(artistliving))) + geom_point(alpha = 0.3) + geom_smooth(method = "lm") + labs(x = "Surface", y = "Log(price)", color = "Living artist") ``` <!-- --> ] --- ## Fit model with interaction effects - Response variable: log(price) - Explanatory variables: Surface area, artist living (0/1 variable), and their interaction .midi[ ```r m_int <- lm(log(price) ~ Surface + factor(artistliving) + Surface * factor(artistliving), data = pp_Surf_lt_5000) tidy(m_int) ``` ``` ## # A tibble: 4 x 5 ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) 4.91 0.0432 114. 0 ## 2 Surface 0.000206 0.0000442 4.65 0.00000337 ## 3 factor(artistliving)1 -0.126 0.119 -1.06 0.289 ## 4 Surface:factor(artistliving)1 0.000479 0.000126 3.81 0.000139 ``` ] - Linear model: $$ \widehat{log(price)} = 4.91 + 0.00021~surface - 0.126~artistliving $$ $$+ ~ 0.00048~surface \times artistliving $$ --- ## Interpretation of interaction effects - Rate of change in price as the surface area of the painting increases does vary between paintings by living and non-living artists (different slopes), - Some paintings by living artists are more expensive than paintings by non-living artists, and some are not (different intercept). .small[ .pull-left[ - Non-living artist: `\(\widehat{log(price)} = 4.91 + 0.00021~surface\)` `\(- 0.126 \times 0 + 0.00048~surface \times 0\)` `\(= 4.91 + 0.00021~surface\)` - Living artist: `\(\widehat{log(price)} = 4.91 + 0.00021~surface\)` `\(- 0.126 \times 1 + 0.00048~surface \times 1\)` `\(= 4.91 + 0.00021~surface\)` `\(- 0.126 + 0.00048~surface\)` `\(= 4.784 + 0.00069~surface\)` ] .pull-right[ <!-- --> ] ] --- ## Third order interactions - Can you? Yes - Should you? Probably not if you want to interpret these interactions in context of the data.