class: center, middle, inverse, title-slide # The language of models ### Dr. Çetinkaya-Rundel ### 2018-02-19 --- ## Announcements - Reading assigned for Wed and next Mon - Lab 05 due on Thursday at noon - OH this week (last change for a while!): + Tue 11-12:30 (as usual) + Wed 9-10:30 (instead of Thur) - Any questions from last time? --- class: center, middle # From last time: Factors --- ## Factors Factor objects are how R stores data for categorical variables (fixed numbers of discrete values). ```r (x = factor(c("BS", "MS", "PhD", "MS"))) ``` ``` ## [1] BS MS PhD MS ## Levels: BS MS PhD ``` ```r glimpse(x) ``` ``` ## Factor w/ 3 levels "BS","MS","PhD": 1 2 3 2 ``` ```r typeof(x) ``` ``` ## [1] "integer" ``` --- ## Read data in as character strings ```r cat_lovers <- read_csv("data/cat-lovers.csv") glimpse(cat_lovers) ``` ``` ## Observations: 60 ## Variables: 3 ## $ name <chr> "Bernice Warren", "Woodrow Stone", "Willie Bass... ## $ number_of_cats <chr> "0", "0", "1", "3", "3", "2", "1", "1", "0", "0... ## $ handedness <chr> "left", "left", "left", "left", "left", "left",... ``` --- ## But coerce when plotting ```r ggplot(cat_lovers, mapping = aes(x = handedness)) + geom_bar() ``` <!-- --> --- ## Use forcats to manipulate factors .midi[ ```r cat_lovers <- cat_lovers %>% mutate(handedness = fct_relevel(handedness, "right", "left", "ambidextrous")) ggplot(cat_lovers, mapping = aes(x = handedness)) + geom_bar() ``` <!-- --> ] --- ## Come for the functionality .pull-left[ ... stay for the logo ] .pull-right[  ] .midi[ - R uses factors to handle categorical variables, variables that have a fixed and known set of possible values. - Historically, factors were much easier to work with than character vectors, so many base R functions automatically convert character vectors to factors. But they can be a pain to work with. - However, factors are useful when you have true categorical data, and when you want to override the ordering of character vectors to improve display. The forcats package provides a suite of useful tools that solve common problems with factors.<sup>1</sup> ] .footnote[ [1] Source: [forcats.tidyverse.org](http://forcats.tidyverse.org/) ] --- ## Recap - Always best to think of data as part of a tibble + This plays nicely with the `tidyverse` as well + Rows are observations, columns are variables - Be careful about data types / classes + Sometimes `R` makes silly assumptions about your data class + Using `tibble`s help, but it might not solve all issues + Think about your data in context, e.g. 0/1 variable is most likely a `factor` + If a plot/output is not behaving the way you expect, first investigate the data class + If you are absolutely sure of a data class, over-write it in your tibble so that you don't need to keep having to keep track of it + `mutate` the variable with the correct class --- class: center, middle # Today: The language of models --- ## Modelling - Use models to explain the relationship between variables and to make predictions - For now we focus on **linear** models (but remember there are other types of models too!) --- class: center, middle # Packages --- ## Packages .pull-left[   ] .pull-right[ - You're familiar with the tidyverse: ```r library(tidyverse) ``` - The broom package takes the messy output of built-in functions in R, such as `lm`, and turns them into tidy data frames. ```r library(broom) ``` ] --- class: center, middle # Data: Paris Paintings --- ## Paris Paintings ```r pp <- read_csv("data/paris_paintings.csv", na = c("n/a", "", "NA")) ``` .question[ What does the `data/` mean in the code above? Hint: Where is the data file located? ] --- ## Meet the data curators .center[   Sandra van Ginhoven Hilary Coe Cronheim PhD students in the Duke Art, Law, and Markets Initiative in 2013 ] - Source: Printed catalogues of 28 auction sales in Paris, 1764- 1780 - 3,393 paintings, their prices, and descriptive details from sales catalogues over 60 variables --- ## Auctions today  http://www.sothebys.com/en/news-video/videos/2014/07/Old-master-british-paintings-evening-sale-soars-over-estimate.html --- ## Auctions back in the day  Pierre-Antoine de Machy, Public Sale at the Hôtel Bullion, Musée Carnavalet, Paris (18th century) --- ## Paris auction market  --- ## Depart pour la chasse  --- ## Auction catalogue text .pull-left[  ] .pull-right[ .small[ Two paintings very rich in composition, of a beautiful execution, and whose merit is very remarkable, each 17 inches 3 lines high, 23 inches wide; the first, painted on wood, comes from the Cabinet of Madame la Comtesse de Verrue; it represents a departure for the hunt: it shows in the front a child on a white horse, a man who gives the horn to gather the dogs, a falconer and other figures nicely distributed across the width of the painting; two horses drinking from a fountain; on the right in the corner a lovely country house topped by a terrace, on which people are at the table, others who play instruments; trees and fabriques pleasantly enrich the background. ] ] --- ```r pp %>% filter(name == "R1777-89a") %>% select(name:endbuyer) %>% t() ``` ``` ## [,1] ## name "R1777-89a" ## sale "R1777" ## lot "89" ## position "0.3755274" ## dealer "R" ## year "1777" ## origin_author "D/FL" ## origin_cat "D/FL" ## school_pntg "D/FL" ## diff_origin "0" ## logprice "8.575462" ## price "5300" ## count "1" ## subject "D\x8epart pour la chasse" ## authorstandard "Wouwerman, Philips" ## artistliving "0" ## authorstyle NA ## author "Philippe Wouwermans" ## winningbidder "Langlier, Jacques for Poullain, Antoine" ## winningbiddertype "DC" ## endbuyer "C" ``` --- ```r pp %>% filter(name == "R1777-89a") %>% select(Interm:finished) %>% t() ``` ``` ## [,1] ## Interm "1" ## type_intermed "D" ## Height_in "17.25" ## Width_in "23" ## Surface_Rect "396.75" ## Diam_in NA ## Surface_Rnd NA ## Shape "squ_rect" ## Surface "396.75" ## material "bois" ## mat "b" ## materialCat "wood" ## quantity "1" ## nfigures "0" ## engraved "0" ## original "0" ## prevcoll "1" ## othartist "0" ## paired "1" ## figures "0" ## finished "0" ``` --- ```r pp %>% filter(name == "R1777-89a") %>% select(lrgfont:other) %>% t() ``` ``` ## [,1] ## lrgfont 0 ## relig 0 ## landsALL 1 ## lands_sc 0 ## lands_elem 1 ## lands_figs 1 ## lands_ment 0 ## arch 1 ## mytho 0 ## peasant 0 ## othgenre 0 ## singlefig 0 ## portrait 0 ## still_life 0 ## discauth 0 ## history 0 ## allegory 0 ## pastorale 0 ## other 0 ``` --- class: center, middle # Modelling the relationship between variables --- ## Prices .question[ Describe the distribution of prices of paintings. ] ```r ggplot(data = pp, aes(x = price)) + geom_histogram(binwidth = 1000) ``` <!-- --> --- ## Models as functions - We can represent relationships between variables using **functions** - A function is a mathematical concept: the relationship between an output and one or more inputs. - Plug in the inputs and receive back the output - Example: the formula `\(y = 3x + 7\)` is a function with input `\(x\)` and output `\(y\)`, when `\(x\)` is `\(5\)`, the output `\(y\)` is `\(22\)` ``` y = 3 * 5 + 7 = 22 ``` --- ## Height as a function of width .question[ Describe the relationship between height and width of paintings. ] <!-- --> --- ## Visualizing the linear model ```r ggplot(data = pp, aes(x = Width_in, y = Height_in)) + geom_point() + geom_smooth(method = "lm") # lm for linear model ``` <!-- --> --- ## Visualizing the linear model ... without the measure of uncertainty around the line ```r ggplot(data = pp, aes(x = Width_in, y = Height_in)) + geom_point() + geom_smooth(method = "lm", se = FALSE) # lm for linear model ``` <!-- --> --- ## Visualizing the linear model ... with different cosmetic choices for the line ```r ggplot(data = pp, aes(x = Width_in, y = Height_in)) + geom_point() + geom_smooth(method = "lm", se = FALSE, col = "pink", # color lty = 2, # line type lwd = 3) # line weight ``` <!-- --> --- ## Vocabulary - **Response variable:** Variable whose behavior or variation you are trying to understand, on the y-axis (dependent variable) -- - **Explanatory variables:** Other variables that you want to use to explain the variation in the response, on the x-axis (independent variables) -- - **Predicted value:** Output of the function **model function** - The model function gives the typical value of the response variable *conditioning* on the explanatory variables -- - **Residuals:** Show how far each case is from its model value - Residual = Observed value - Predicted value - Tells how far above/below the model function each case is --- ## Residuals .question[ What does a negative residual mean? Which paintings on the plot have have negative residuals, those below or above the line? ] <!-- --> --- .question[ The plot below displays the relationship between height and width of paintings. It uses a lower alpha level for the points than the previous plots we looked at. What feature is apparent in this plot that was not (as) apparent in the previous plots? What might be the reason for this feature? ] <!-- --> --- ## Landscape paintings - Landscape painting is the depiction in art of landscapes – natural scenery such as mountains, valleys, trees, rivers, and forests, especially where the main subject is a wide view – with its elements arranged into a coherent composition.<sup>1</sup> - Landscape paintings tend to be wider than longer. - Portrait painting is a genre in painting, where the intent is to depict a human subject.<sup>2</sup> - Portrait paintings tend to be longer than wider. .footnote[ [1] Source: Wikipedia, [Landscape painting](https://en.wikipedia.org/wiki/Landscape_painting) [2] Source: Wikipedia, [Portait painting](https://en.wikipedia.org/wiki/Portrait_painting) ] --- ## Multiple explanatory variables .question[ How, if at all, the relatonship between width and height of paintings vary by whether or not they have any landscape elements? ] .small[ ```r ggplot(data = pp, aes(x = Width_in, y = Height_in, color = factor(landsALL))) + geom_point(alpha = 0.4) + geom_smooth(method = "lm", se = FALSE) + labs(color = "landscape") ``` <!-- --> ] --- ## Models - upsides and downsides - Models can sometimes reveal patterns that are not evident in a graph of the data. This is a great advantage of modelling over simple visual inspection of data. - There is a real risk, however, that a model is imposing structure that is not really there on the scatter of data, just as people imagine animal shapes in the stars. A skeptical approach is always warranted. --- ## Variation around the model... is just as important as the model, if not more! *Statistics is the explanation of variation in the context of what remains unexplained.* - The scatter suggests that there might be other factors that account for large parts of painting-to-painting variability, or perhaps just that randomness plays a big role. - Adding more explanatory variables to a model can sometimes usefully reduce the size of the scatter around the model. (We'll talk more about this later.) --- ## How do we use models? 1. Explanation: Characterize the relationship between `\(y\)` and `\(x\)` via *slopes* for numerical explanatory variables or *differences* for categorical explanatory variables 2. Prediction: Plug in `\(x\)`, get the predicted `\(y\)`