class: center, middle, inverse, title-slide # Multiple linear regression ## Intro to Data Science ### Shawn Santo ### 02-25-20 --- ## Announcements - Homework 3 assigned Thursday - Lab 5 due Friday, Feb 28 at 11:59pm --- class: center, middle, inverse # Linear model with multiple predictors --- ## Data and packages ```r library(tidyverse) library(broom) ``` ```r paris_paintings <- read_csv("data/paris_paintings.csv", na = c("n/a", "", "NA")) ``` <br/> - [Paris Paintings Codebook](http://www2.stat.duke.edu/courses/Spring20/sta199.001/data/code_books/paris_codebook.html) - Source: Printed catalogues of 28 auction sales in Paris, 1764- 1780 - 3,393 paintings, their prices, and descriptive details from sales catalogues over 60 variables --- ## Three variables What is the typical surface area for paintings? <img src="lec-08a-mlr_files/figure-html/unnamed-chunk-3-1.png" style="display: block; margin: auto;" /> -- Less than 1000 square inches (which is roughly a painting that is 31in x 31in). There are very few paintings that have surface area above 5000 square inches. --- For simplicity, let's focus on the paintings with `Surface < 5000`. <img src="lec-08a-mlr_files/figure-html/unnamed-chunk-4-1.png" style="display: block; margin: auto;" /> --- ## Price, surface, and living artist Does the relationship between surface and logged price vary by whether or not the artist is living? .tiny[ ```r ggplot(data = pp_surf_5000, mapping = aes(y = log(price), x = Surface, color = factor(artistliving))) + geom_point(alpha = 0.3) + labs(color = "Living artist", y = "Price (logarithm)") + theme_bw(base_size = 16) ``` <img src="lec-08a-mlr_files/figure-html/unnamed-chunk-5-1.png" style="display: block; margin: auto;" /> ] --- ## Modeling with main effects ```r m_main <- lm(log(price) ~ Surface + factor(artistliving), data = pp_surf_5000) m_main %>% tidy() %>% select(term, estimate) ``` ``` #> # A tibble: 3 x 2 #> term estimate #> <chr> <dbl> #> 1 (Intercept) 4.88 #> 2 Surface 0.000265 #> 3 factor(artistliving)1 0.137 ``` -- Linear model: $$ \widehat{\log(price)} = 4.88 + 0.00027~surface + 0.14~artistliving $$ -- - Plug in 0 for `artistliving` to get the linear model for paintings by non-living artists. - Plug in 1 for `artistliving` to get the linear model for paintings by living artists. --- ## Interpretation of main effects <img src="lec-08a-mlr_files/figure-html/unnamed-chunk-7-1.png" style="display: block; margin: auto;" /> --- - Non-living artist: `$$\begin{align}\widehat{\log(price)} &= 4.88 + 0.00027~surface + 0.14 \times 0\\ &= 4.88 + 0.00027~surface\\\end{align}$$` - Living artist: `$$\begin{align}\widehat{\log(price)} &= 4.88 + 0.00027~surface + 0.14 \times 1\\ &= 5.02 + 0.00027~surface\\\end{align}$$` -- - Rate of change in log price as the surface area of the painting increases does not vary between paintings by living and non-living artists (same slope) - Paintings by living artists are consistently more expensive than paintings by non-living artists (different intercept) --- ## Main effects, numerical and categorical predictors ``` #> # A tibble: 3 x 3 #> term estimate exp_estimate #> <chr> <dbl> <dbl> #> 1 (Intercept) 4.88 132. #> 2 Surface 0.000265 1.00 #> 3 factor(artistliving)1 0.137 1.15 ``` - **All else held constant**, for each additional square inch in a painting's surface area, the log price of the painting is predicted, on average, to increase by 0.0003 (the price is expected to be higher by a multiplicative factor of 1.0003. - **All else held constant**, paintings by a living artist are predicted, on average, to have a log price that is 0.137 greater compared to paintings by artists who are no longer alive (the price is expected to be higher by a multiplicative factor of 1.1471) - Paintings that are by an artist who is not alive and that have a surface area of 0 square inches are predicted, on average, to be 131.6417 livres. --- ## What went wrong? Why is our linear regression model different from what we got from `geom_smooth(method = "lm", se = FALSE)`? <img src="lec-08a-mlr_files/figure-html/unnamed-chunk-10-1.png" style="display: block; margin: auto;" /> --- ## What went wrong? Why is our linear regression model different from what we got from `geom_smooth(method = "lm", se = FALSE)`? <img src="lec-08a-mlr_files/figure-html/unnamed-chunk-11-1.png" style="display: block; margin: auto;" /> --- ## What went wrong? (cont.) - The way we specified our model only lets `artistliving` affect the intercept. - Model implicitly assumes that paintings with living and deceased artists have the *same slope* and only allows for *different intercepts*. - What seems more appropriate in this case? * Same slope and same intercept for both colors? * Same slope and different intercept for both colors? * Different slope and different intercept for both colors? --- ## Interacting explanatory variables - Including an interaction effect in the model allows for different slopes, i.e. nonparallel lines. - This implies that the regression coefficient for an explanatory variable would change as another explanatory variable changes. - This can be accomplished by adding an **interaction variable**: the product of two explanatory variables. --- ## Price vs. surface and artist living interacting .tiny[ ```r ggplot(data = pp_surf_5000, mapping = aes(y = log(price), x = Surface, color = factor(artistliving))) + geom_point(alpha = 0.3) + geom_smooth(method = "lm", se = FALSE) + labs(x = "Surface", y = "Log(price)", color = "Living artist") + theme_bw(base_size = 16) ``` <img src="lec-08a-mlr_files/figure-html/unnamed-chunk-12-1.png" style="display: block; margin: auto;" /> ] --- ## Modeling with interaction effects ```r *m_int <- lm(log(price) ~ Surface * factor(artistliving), data = pp_surf_5000) tidy(m_int) %>% select(estimate) ``` ``` #> # A tibble: 4 x 1 #> estimate #> <dbl> #> 1 4.91 #> 2 0.000206 #> 3 -0.126 #> 4 0.000479 ``` `$$\widehat{log(price)} = 4.91 + 0.00021~surface - 0.126~artistliving + 0.00048~surface \times artistliving$$` --- ## Interpretation of interaction effects **Non-living artist:** `$$\begin{align}\widehat{\log(price)} &= 4.91 + 0.00021~surface - 0.126 \times 0 + 0.00048~surface \times 0\\\\ &= 4.91 + 0.00021~surface\end{align}$$` -- **Living artist:** `$$\begin{align}\widehat{\log(price)} &= 4.91 + 0.00021~surface - 0.126 \times 1 + 0.00048~surface \times 1\\\\ &= 4.784 + 0.00069~surface\end{align}$$` --- ## Interpretation of interaction effects **Non-living artist:** `$$\widehat{log(price)} = 4.91 + 0.00021~surface$$` **Living artist:** `$$\widehat{log(price)} = 4.784 + 0.00069~surface$$` - Rate of change in price as the surface area of the painting increases does vary between paintings by living and non-living artists (<font class="vocab">different slopes</font>). - Some paintings by living artists are more expensive than paintings by non-living artists, and some are not (<font class="vocab">different intercept</font>). --- ## Interpretation of interaction effects - Non-living artist: `$$\widehat{log(price)} = 4.91 + 0.00021~surface$$` - Living artist: `$$\widehat{log(price)} = 4.784 + 0.00069~surface$$` <img src="lec-08a-mlr_files/figure-html/unnamed-chunk-14-1.png" style="display: block; margin: auto;" /> --- ## Continuous by continuous interactions - Interpretation becomes trickier. - Slopes conditional on values of explanatory variables. - Come see me if you encounter these. --- ## Third order interactions - Can you? Yes - Should you? Probably not if you want to interpret these interactions in context of the data. --- class: center, middle, inverse # Your turn --- ## Fitting and interpreting linear models Clone the repo at https://classroom.github.com/a/0sPhWFKl and create an RStudio Cloud project. Don't forget to commit and push your changes. Our goal is to predict the log price of paintings in this auction. **Task 1**: Fit a multiple linear regression model with the following main effects: `lrgfont`, `Height_in`, `Width_in`, and `relig`. Interpret each of the estimated model coefficients. **Task 2**: Fit a multiple linear regression model with main effects for `Height_in` and `lrgfont`, as well as their interaction. What is the model equation for paintings where the dealer devotes an additional paragraph? How about for paintings where the dealer does NOT devote an additional paragraph? --- class: center, middle, inverse # Code: cleaning up categorical data --- ## Shape and material Collapse levels of `Shape` and `mat`erial variables with **`forcats::fct_collapse`**: .tiny[ ```r paris_paintings <- paris_paintings %>% mutate( Shape = fct_collapse(Shape, oval = c("oval", "ovale"), round = c("round", "ronde"), squ_rect = "squ_rect", other = c("octogon", "octagon", "miniature")), mat = fct_collapse(mat, metal = c("a", "br", "c"), canvas = c("co", "t", "ta"), paper = c("p", "ca"), wood = "b", other = c("e", "g", "h", "mi", "o", "pa", "v", "al", "ar", "m")) ) ``` ] --- ## Review fixes .pull-left[ ```r paris_paintings %>% count(Shape) ``` ``` #> # A tibble: 5 x 2 #> Shape n #> <fct> <int> #> 1 other 12 #> 2 oval 52 #> 3 round 74 #> 4 squ_rect 3219 #> 5 <NA> 36 ``` ] .pull-right[ ```r paris_paintings %>% count(mat) ``` ``` #> # A tibble: 6 x 2 #> mat n #> <fct> <int> #> 1 metal 321 #> 2 other 59 #> 3 wood 886 #> 4 paper 38 #> 5 canvas 1783 #> 6 <NA> 306 ``` ]