class: center, middle, inverse, title-slide # The Language of Models ## Intro to Data Science ### Shawn Santo ### 02-18-20 --- ## Announcements - Roadmap for the next few weeks - Midterm grades --- class: middle, center, inverse # The language of models --- ## Modeling - Use models to explain the relationship between variables and to make predictions - For now we focus on **linear** models (but remember there are other types of models too!) --- class: center, middle, inverse # Packages --- ## Packages .pull-left[ ![](img/08/tidyverse.png) ![](img/08/broom.png) ] .pull-right[ - You're familiar with the tidyverse: ```r library(tidyverse) ``` <br/><br/><br/> - The broom package takes the messy output of built-in functions in R, such as `lm`, and turns them into tidy data frames. ```r library(broom) ``` ] --- class: center, middle, inverse # Data: Paris paintings --- ## Paris paintings ```r paris_paint <- read_csv("data/paris_paintings.csv", na = c("n/a", "", "NA")) ``` - [Paris Paintings Codebook](http://www2.stat.duke.edu/courses/Spring20/sta199.001/data/code_books/paris_codebook.html) .center[ ![](img/08/sandra-van-ginhoven.png) ![](img/08/hilary-coe-cronheim.png) Sandra van Ginhoven Hilary Coe Cronheim PhD students in the Duke Art, Law, and Markets Initiative in 2013 ] - Source: Printed catalogues of 28 auction sales in Paris, 1764- 1780 - 3,393 paintings, their prices, and descriptive details from sales catalogues over 60 variables --- ## Auctions today <center> <iframe width="750" height="400" src="https://www.youtube.com/embed/apaE1Q7r4so" frameborder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe> </center> <br/> *Source:* https://www.youtube.com/watch?v=apaE1Q7r4so --- ## Auctions back in the day ![](img/08/old-auction.png) <small> de Machy, Public Sale at the Hôtel Bullion, Musée Carnavalet, Paris (18th century) </small> --- ## Paris auction market ![](img/08/auction-trend-paris.png) --- ## Depart pour la chasse <center> <img src="img/08/depart-pour-la-chasse.png" width="600" height="500"> </center> --- ## Auction catalogue text .pull-left[ <img src="img/08/auction-catalogue.png" height="500"> ] .pull-right[ .small[ Two paintings very rich in composition, of a beautiful execution, and whose merit is very remarkable, each 17 inches 3 lines high, 23 inches wide; the first, painted on wood, comes from the Cabinet of Madame la Comtesse de Verrue; it represents a departure for the hunt: it shows in the front a child on a white horse, a man who gives the horn to gather the dogs, a falconer and other figures nicely distributed across the width of the painting; two horses drinking from a fountain; on the right in the corner a lovely country house topped by a terrace, on which people are at the table, others who play instruments; trees and fabriques pleasantly enrich the background. ] ] --- ```r paris_paint %>% filter(name == "R1777-89a") %>% select(name:endbuyer) %>% glimpse() ``` ``` #> Observations: 1 #> Variables: 21 #> $ name <chr> "R1777-89a" #> $ sale <chr> "R1777" #> $ lot <chr> "89" #> $ position <dbl> 0.3755274 #> $ dealer <chr> "R" #> $ year <dbl> 1777 #> $ origin_author <chr> "D/FL" #> $ origin_cat <chr> "D/FL" #> $ school_pntg <chr> "D/FL" #> $ diff_origin <dbl> 0 #> $ logprice <dbl> 8.575462 #> $ price <dbl> 5300 #> $ count <dbl> 1 #> $ subject <chr> "D\u008epart pour la chasse" #> $ authorstandard <chr> "Wouwerman, Philips" #> $ artistliving <dbl> 0 #> $ authorstyle <chr> NA #> $ author <chr> "Philippe Wouwermans" #> $ winningbidder <chr> "Langlier, Jacques for Poullain, Antoine" #> $ winningbiddertype <chr> "DC" #> $ endbuyer <chr> "C" ``` --- ```r paris_paint %>% filter(name == "R1777-89a") %>% select(Interm:finished) %>% glimpse() ``` ``` #> Observations: 1 #> Variables: 21 #> $ Interm <dbl> 1 #> $ type_intermed <chr> "D" #> $ Height_in <dbl> 17.25 #> $ Width_in <dbl> 23 #> $ Surface_Rect <dbl> 396.75 #> $ Diam_in <dbl> NA #> $ Surface_Rnd <dbl> NA #> $ Shape <chr> "squ_rect" #> $ Surface <dbl> 396.75 #> $ material <chr> "bois" #> $ mat <chr> "b" #> $ materialCat <chr> "wood" #> $ quantity <dbl> 1 #> $ nfigures <dbl> 0 #> $ engraved <dbl> 0 #> $ original <dbl> 0 #> $ prevcoll <dbl> 1 #> $ othartist <dbl> 0 #> $ paired <dbl> 1 #> $ figures <dbl> 0 #> $ finished <dbl> 0 ``` --- ```r paris_paint %>% filter(name == "R1777-89a") %>% select(lrgfont:other) %>% glimpse() ``` ``` #> Observations: 1 #> Variables: 19 #> $ lrgfont <dbl> 0 #> $ relig <dbl> 0 #> $ landsALL <dbl> 1 #> $ lands_sc <dbl> 0 #> $ lands_elem <dbl> 1 #> $ lands_figs <dbl> 1 #> $ lands_ment <dbl> 0 #> $ arch <dbl> 1 #> $ mytho <dbl> 0 #> $ peasant <dbl> 0 #> $ othgenre <dbl> 0 #> $ singlefig <dbl> 0 #> $ portrait <dbl> 0 #> $ still_life <dbl> 0 #> $ discauth <dbl> 0 #> $ history <dbl> 0 #> $ allegory <dbl> 0 #> $ pastorale <dbl> 0 #> $ other <dbl> 0 ``` --- class: center, middle, inverse ## Modeling the relationship between variables --- ## Prices **Describe the distribution of prices of paintings.** .tiny[ ```r ggplot(data = paris_paint, aes(x = price)) + geom_histogram(binwidth = 1000) + labs(title="Distribution of Price (in Livres)") + theme_minimal() ``` <img src="lec-07a-language-of-models_files/figure-html/unnamed-chunk-7-1.png" style="display: block; margin: auto;" /> ] --- ## Height .tiny[ ```r ggplot(data = paris_paint, aes(x = Height_in)) + geom_histogram() + theme_minimal() ``` <img src="lec-07a-language-of-models_files/figure-html/unnamed-chunk-8-1.png" style="display: block; margin: auto;" /> ] --- ## Width .tiny[ ```r ggplot(data = paris_paint, aes(x = Width_in)) + geom_histogram() + theme_minimal() ``` <img src="lec-07a-language-of-models_files/figure-html/unnamed-chunk-9-1.png" style="display: block; margin: auto;" /> ] --- ## Models as functions - We can represent relationships between variables using **functions**. - A **function** in the *mathematical* sense is the relationship between one or more inputs and an output created from that (those) input(s). - Plug in the inputs and receive back the output. -- - Example: the formula `\(y = 3x + 7\)` is a function with input `\(x\)` and output `\(y\)`, when `\(x\)` is `\(5\)`, the output `\(y\)` is `\(22\)` ``` y = 3 * 5 + 7 = 22 ``` --- ## Consider height as a function of width What is the input? What is the output? **Describe the relationship between height and width of paintings.** <img src="lec-07a-language-of-models_files/figure-html/unnamed-chunk-10-1.png" style="display: block; margin: auto;" /> --- ## Visualizing the linear model .tiny[ ```r ggplot(data = paris_paint, aes(x = Width_in, y = Height_in)) + geom_point() + geom_smooth(method = "lm", se = FALSE) + # lm for linear model theme_minimal() ``` <img src="lec-07a-language-of-models_files/figure-html/unnamed-chunk-11-1.png" style="display: block; margin: auto;" /> ] --- ## Visualizing the linear model .tiny[ ```r ggplot(data = paris_paint, aes(x = Width_in, y = Height_in)) + geom_point() + geom_smooth(method = "lm", se = FALSE, # color #line type #line weight col = "pink", lty = 2, lwd = 3) + theme_minimal() ``` <img src="lec-07a-language-of-models_files/figure-html/unnamed-chunk-12-1.png" style="display: block; margin: auto;" /> ] --- ## Vocabulary - **Response** variable: variable whose behavior or variation you are trying to understand, on the y-axis (dependent variable) -- - **Explanatory** variables: other variables that you want to use to explain the variation in the response, on the x-axis (independent variables); these are also referred to as predictors or features -- - **Predicted** value: output of the **model function** - The model function gives the typical value of the response variable *conditioning* on the explanatory variables (what does this mean?) -- - **Residuals:** show how far each case is from its model value - **Residual = Observed value - Predicted value** - Tells how far above/below the model function each case is --- ## Residuals What does a negative residual mean? Which paintings on the plot have have negative residuals, those below or above the line? <img src="lec-07a-language-of-models_files/figure-html/unnamed-chunk-13-1.png" style="display: block; margin: auto;" /> --- The plot below displays the relationship between height and width of paintings. It uses a lower alpha level for the points than the previous plots we looked at. What feature is apparent in this plot that was not (as) apparent in the previous plots? What might be the reason for this feature? <img src="lec-07a-language-of-models_files/figure-html/unnamed-chunk-14-1.png" style="display: block; margin: auto;" /> --- ## Landscape paintings - **Landscape painting** is the depiction in art of landscapes – natural scenery such as mountains, valleys, trees, rivers, and forests, especially where the main subject is a wide view – with its elements arranged into a coherent composition.<sup>1</sup> - Landscape paintings tend to be wider than longer. - **Portrait painting** is a genre in painting, where the intent is to depict a human subject.<sup>2</sup> - Portrait paintings tend to be longer than wider. .footnote[ [1] Source: Wikipedia, [Landscape painting](https://en.wikipedia.org/wiki/Landscape_painting) [2] Source: Wikipedia, [Portait painting](https://en.wikipedia.org/wiki/Portrait_painting) ] --- ## Multiple explanatory variables How, if at all, do the relationship between width and height of paintings vary by whether or not they have any landscape elements? .tiny[ ```r ggplot(data = paris_paint, aes(x = Width_in, y = Height_in, color = factor(landsALL))) + geom_point(alpha = 0.2) + geom_smooth(method = "lm", se = FALSE) + labs(color = "landscape") + theme_minimal() ``` <img src="lec-07a-language-of-models_files/figure-html/unnamed-chunk-15-1.png" style="display: block; margin: auto;" /> ] .small-text[ We will talk about multiple explanatory variables in the following weeks. ] --- ## Models - upsides and downsides - Models can sometimes reveal patterns that are not evident in a graph of the data. This is a great advantage of modelling over simple visual inspection of data. - There is a real risk, however, that a model is imposing structure that is not really there on the scatter of data, just as people imagine animal shapes in the stars. A skeptical approach is always warranted. --- ## Variation around the model... is just as important as the model, if not more! .alert[ *Statistics is the explanation of variation in the context of what remains unexplained.* ] - The scatter suggests that there might be other factors that account for large parts of painting-to-painting variability, or perhaps just that randomness plays a big role. - Adding more explanatory variables to a model can sometimes usefully reduce the size of the scatter around the model. (We'll talk more about this later.) --- ## How do we use models? 1. **Explanation**: Characterize the relationship between `\(y\)` and `\(x\)` via *slopes* for numerical explanatory variables or *differences* for categorical explanatory variables. 2. **Prediction**: Plug in `\(x\)`, get the predicted `\(y\)` --- class: middle, center, inverse # Interpreting Models --- ## Follow along Link at https://classroom.github.com/a/kG7JU7jq --- ## Height & width ```r m_ht_wt <- lm(Height_in ~ Width_in, data = paris_paint) tidy(m_ht_wt) ``` ``` #> # A tibble: 2 x 5 #> term estimate std.error statistic p.value #> <chr> <dbl> <dbl> <dbl> <dbl> #> 1 (Intercept) 3.62 0.254 14.3 8.82e-45 #> 2 Width_in 0.781 0.00950 82.1 0. ``` -- `$$\widehat{Height_{in}} = 3.62 + 0.78~Width_{in}$$` -- - **Slope:** For each additional inch the painting is wider, the height is expected to be higher, on average, by 0.78 inches. -- - **Intercept:** Paintings that are 0 inches wide are expected to be 3.62 inches high, on average. - Does this make sense? --- ## Package `broom` .pull-left[ .middle[ ![](img/08/broom.png) ] ] .pull-right[ - **`tidy`**: Constructs a tidy data frame summarizing model's statistical findings - **`glance`**: Constructs a concise one-row summary of the model - **`augment`**: Adds columns (e.g. predictions, residuals) to the original data that was modeled ] [https://broom.tidyverse.org/](https://broom.tidyverse.org/) --- ## The linear model with a single predictor - We're interested in the `\(\beta_0\)` (population parameter for the intercept) and the `\(\beta_1\)` (population parameter for the slope) in the following model: $$ \hat{y} = \beta_0 + \beta_1~x $$ -- - Unfortunately, we can't get these values -- - So we use sample statistics to estimate them: $$ \hat{y} = b_0 + b_1~x $$ --- ## Least squares regression The regression line minimizes the sum of squared residuals. -- - <font class="vocab">Residuals: </font> `\(e_i = y_i - \hat{y}_i\)`, - The regression line minimizes `\(\sum_{i = 1}^n e_i^2\)`. - Equivalently, minimizing `\(\sum_{i = 1}^n [y_i - (b_0 + b_1~x_i)]^2\)` .question[ Why do we minimize the *squares* of the residuals? ] --- ## Visualizing residuals <img src="lec-07a-language-of-models_files/figure-html/unnamed-chunk-17-1.png" style="display: block; margin: auto;" /> --- ## Visualizing residuals (cont.) <img src="lec-07a-language-of-models_files/figure-html/unnamed-chunk-18-1.png" style="display: block; margin: auto;" /> --- ## Visualizing residuals (cont.) <img src="lec-07a-language-of-models_files/figure-html/unnamed-chunk-19-1.png" style="display: block; margin: auto;" /> --- ## Properties of the least squares regression line - The estimate for the slope, `\(b_1\)`, has the same sign as the correlation between the two variables. - The regression line goes through the center of mass point, the coordinates corresponding to average `\(x\)` and average `\(y\)`: `\((\bar{x}, \bar{y})\)` - The sum of the residuals is zero: `$$\sum_{i = 1}^n e_i = 0$$` - The residuals and `\(x\)` values are uncorrelated. --- ## What about non-continuous predictors? Height & landscape features ```r m_ht_lands <- lm(Height_in ~ factor(landsALL), data = paris_paint) tidy(m_ht_lands) ``` ``` #> # A tibble: 2 x 5 #> term estimate std.error statistic p.value #> <chr> <dbl> <dbl> <dbl> <dbl> #> 1 (Intercept) 22.7 0.328 69.1 0. #> 2 factor(landsALL)1 -5.65 0.532 -10.6 7.97e-26 ``` -- <br> `$$\widehat{Height_{in}} = 22.68 - 5.65~landsALL$$` --- ## (cont.) - **Slope:** Paintings with landscape features are expected, on average, to be 5.65 inches shorter than paintings that without landscape features. - Compares baseline level (`landsALL = 0`) to other level (`landsALL = 1`). - **Intercept:** Paintings that don't have landscape features are expected, on average, to be 22.68 inches tall. --- ## Categorical predictor with 2 levels ``` #> # A tibble: 8 x 3 #> name price landsALL #> <chr> <dbl> <dbl> #> 1 L1764-2 360 0 #> 2 L1764-3 6 0 #> 3 L1764-4 12 1 #> 4 L1764-5a 6 1 #> 5 L1764-5b 6 1 #> 6 L1764-6 9 0 #> 7 L1764-7a 12 0 #> 8 L1764-7b 12 0 ``` --- ## What about categorical variables with more than two levels? Relationship between height and school ```r paris_paint %>% lm(Height_in ~ school_pntg, data = .) %>% tidy() ``` ``` #> # A tibble: 7 x 5 #> term estimate std.error statistic p.value #> <chr> <dbl> <dbl> <dbl> <dbl> #> 1 (Intercept) 14. 10.0 1.40 0.162 #> 2 school_pntgD/FL 2.33 10.0 0.232 0.816 #> 3 school_pntgF 10.2 10.0 1.02 0.309 #> 4 school_pntgG 1.65 11.9 0.139 0.889 #> 5 school_pntgI 10.3 10.0 1.02 0.306 #> 6 school_pntgS 30.4 11.4 2.68 0.00744 #> 7 school_pntgX 2.87 10.3 0.279 0.780 ``` What do these rows correspond to? Why are there only six schools listed, but seven schools total (what happened to the Austrian school?) --- ## What about categorical variables with more than two levels? Relationship between height and school ```r paris_paint %>% lm(Height_in ~ school_pntg, data = .) %>% tidy() ``` ``` #> # A tibble: 7 x 5 #> term estimate std.error statistic p.value #> <chr> <dbl> <dbl> <dbl> <dbl> #> 1 (Intercept) 14. 10.0 1.40 0.162 #> 2 school_pntgD/FL 2.33 10.0 0.232 0.816 #> 3 school_pntgF 10.2 10.0 1.02 0.309 #> 4 school_pntgG 1.65 11.9 0.139 0.889 #> 5 school_pntgI 10.3 10.0 1.02 0.306 #> 6 school_pntgS 30.4 11.4 2.68 0.00744 #> 7 school_pntgX 2.87 10.3 0.279 0.780 ``` - When the categorical explanatory variable has many levels, the levels are encoded to **dummy variables**. - Each coefficient describes the expected difference between heights in that particular school compared to the baseline level. --- ## How dummy variables are made .small[ ``` #> # A tibble: 7 x 7 #> # Groups: school_pntg [7] #> school_pntg D_FL F G I S X #> <chr> <int> <int> <int> <int> <int> <int> #> 1 A 0 0 0 0 0 0 #> 2 D/FL 1 0 0 0 0 0 #> 3 F 0 1 0 0 0 0 #> 4 G 0 0 1 0 0 0 #> 5 I 0 0 0 1 0 0 #> 6 S 0 0 0 0 1 0 #> 7 X 0 0 0 0 0 1 ``` ] --- ## The linear model with multiple predictors - Population model: $$ \hat{y} = \beta_0 + \beta_1~x_1 + \beta_2~x_2 + \cdots + \beta_k~x_k $$ -- - Sample model that we use to estimate the population model: $$ \hat{y} = b_0 + b_1~x_1 + b_2~x_2 + \cdots + b_k~x_k $$ --- ## Correlation does not imply causation! Remember this when interpreting model coefficients --- ## References 1. Convert Statistical Objects into Tidy Tibbles. (2020). Broom.tidyverse.org. Retrieved 18 February 2020, from https://broom.tidyverse.org/