class: center, middle, inverse, title-slide # Formalizing linear models ### Dr. Çetinkaya-Rundel ### 2018-02-21 --- ## Announcements - Reading assigned for Mon - Future HW assignment due dates posted - Any questions from last time? --- class: center, middle # Characterizing relationships with models --- ## Data & packages ```r library(tidyverse) library(broom) ``` ```r pp <- read_csv("data/paris_paintings.csv", na = c("n/a", "", "NA")) ``` --- ## Want to follow along? Go to RStudio Cloud -> make a copy of "Modelling Paris Paintings" --- ## Height & width ```r (m_ht_wt <- lm(Height_in ~ Width_in, data = pp)) ``` ``` ## ## Call: ## lm(formula = Height_in ~ Width_in, data = pp) ## ## Coefficients: ## (Intercept) Width_in ## 3.6214 0.7808 ``` -- <br> `$$\widehat{Height_{in}} = 3.62 + 0.78~Width_{in}$$` -- - **Slope:** For each additional inch the painting is wider, the height is expected to be higher, on average, by 0.78 inches. -- - **Intercept:** Paintings that are 0 inches wide are expected to be 3.62 inches high, on average. - Does this make sense? --- ## The linear model with a single predictor - We're interested in the `\(\beta_0\)` (population parameter for the intercept) and the `\(\beta_1\)` (population parameter for the slope) in the following model: $$ \hat{y} = \beta_0 + \beta_1~x $$ -- - Tough luck, you can't have them... -- - So we use the sample statistics to estimate them: $$ \hat{y} = b_0 + b_1~x $$ --- ## Least squares regression The regression line minimizes the sum of squared residuals. -- If `\(e_i = y - \hat{y}\)`, then, the regression line minimizes `\(\sum_{i = 1}^n e_i^2\)`. --- ## Visualizing residuals <!-- --> --- ## Visualizing residuals (cont.) <!-- --> --- ## Visualizing residuals (cont.) <!-- --> --- ## Properties of the least squares regression line - The regression line goes through the center of mass point, the coordinates corresponding to average `\(x\)` and average `\(y\)`: `\((\bar{x}, \bar{y})\)`: `$$\hat{y} = b_0 + b_1 x ~ \rightarrow ~ b_0 = \hat{y} - b_1 x$$` - The slope has the same sign as the correlation coefficient: `$$b_1 = r \frac{s_y}{s_x}$$` - The sum of the residuals is zero: `$$\sum_{i = 1}^n e_i = 0$$` - The residuals and `\(x\)` values are uncorrelated. --- ## Height & landscape features ```r (m_ht_lands <- lm(Height_in ~ factor(landsALL), data = pp)) ``` ``` ## ## Call: ## lm(formula = Height_in ~ factor(landsALL), data = pp) ## ## Coefficients: ## (Intercept) factor(landsALL)1 ## 22.680 -5.645 ``` -- <br> `$$\widehat{Height_{in}} = 22.68 - 5.65~landsALL$$` --- ## Height & landscape features (cont.) - **Slope:** Paintings with landscape features are expected, on average, to be 5.65 inches shorter than paintings that without landscape features. - Compares baseline level (`landsALL = 0`) to other level (`landsALL = 1`). - **Intercept:** Paintings that don't have landscape features are expected, on average, to be 22.68 inches tall. --- ## Categorical predictor with 2 levels
--- ## Relationship between height and school ```r (m_ht_sch <- lm(Height_in ~ school_pntg, data = pp)) ``` ``` ## ## Call: ## lm(formula = Height_in ~ school_pntg, data = pp) ## ## Coefficients: ## (Intercept) school_pntgD/FL school_pntgF school_pntgG ## 14.000 2.329 10.197 1.650 ## school_pntgI school_pntgS school_pntgX ## 10.287 30.429 2.869 ``` -- - When the categorical explanatory variable has many levels, they're encoded to **dummy variables**. - Each coefficient describes the expected difference between heights in that particular school compared to the baseline level. --- ## Categorical predictor with >2 levels .small[
] --- ## The linear model with multiple predictors - Population model: $$ \hat{y} = \beta_0 + \beta_1~x_1 + \beta_2~x_2 + \cdots + \beta_k~x_k $$ -- - Sample model that we use to estimate the population model: $$ \hat{y} = b_0 + b_1~x_1 + b_2~x_2 + \cdots + b_k~x_k $$ --- ## Correlation does not imply causation! Remember this when interpreting model coefficients --- class: center, middle # Prediction with models --- ## Predict height from width .question[ On average, how tall are paintings that are 60 inches wide? `$$\widehat{Height_{in}} = 3.62 + 0.78~Width_{in}$$` ] -- ```r 3.62 + 0.78 * 60 ``` ``` ## [1] 50.42 ``` "On average, we expect paintings that are 60 inches wide to be 50.42 inches high." **Warning:** We "expect" this to happen, but there will be some variability. (We'll learn about measuring the variability around the prediction later.) --- ## Prediction vs. extrapolation .question[ On average, how tall are paintings that are 400 inches wide? `$$\widehat{Height_{in}} = 3.62 + 0.78~Width_{in}$$` ] <!-- --> --- ## Watch out for extrapolation! > "When those blizzards hit the East Coast this winter, it proved to my satisfaction that global warming was a fraud. That snow was freezing cold. But in an alarming trend, temperatures this spring have risen. Consider this: On February 6th it was 10 degrees. Today it hit almost 80. At this rate, by August it will be 220 degrees. So clearly folks the climate debate rages on."<sup>1</sup> <br> Stephen Colbert, April 6th, 2010 .footnote[ [1] OpenIntro Statistics. "Extrapolation is treacherous." OpenIntro Statistics. ] --- class: center, middle # Measuring model fit --- ## Measuring the strength of the fit - The strength of the fit of a linear model is most commonly evaluated using `\(R^2\)`. - It tells us what percent of variability in the response variable is explained by the model. - The remainder of the variability is explained by variables not included in the model. - `\(R^2\)` is sometimes called the coefficient of determination. --- ## Obtaining `\(R^2\)` in R - Height vs. width .small[ ```r glance(m_ht_wt) ``` ``` ## r.squared adj.r.squared sigma statistic p.value df logLik AIC ## 1 0.6829468 0.6828456 8.30427 6748.621 0 2 -11083.45 22172.89 ## BIC deviance df.residual ## 1 22191.04 216054.5 3133 ``` ```r glance(m_ht_wt)$r.squared # extract R-squared ``` ``` ## [1] 0.6829468 ``` ] Roughly 68% of the variability in heights of paintings can be explained by their widths. - Height vs. lanscape features .small[ ```r glance(m_ht_lands)$r.squared ``` ``` ## [1] 0.03456724 ``` ]