class: center, middle, inverse, title-slide # Simple linear regression
🍡 --- layout: true <div class="my-footer"> <span> Dr. Mine Çetinkaya-Rundel - <a href="http://www2.stat.duke.edu/courses/Fall18/sta112.01/schedule" target="_blank">stat.duke.edu/courses/Fall18/sta112.01 </a> </span> </div> --- ## Announcements - Midterm evaluations due 5pm tomorrow (Wednesday) - Is there a better time for my office hours? - Monday mornings? - Thursday afternoons? - Reading + application exercise due Thursday --- class: center, middle # Prediction with models --- ## Getting started - **Data:** Paris Paintings ```r pp <- read_csv("data/paris_paintings.csv", na = c("n/a", "", "NA")) ``` - **Model:** Predicting height from width ```r m_ht_wt <- lm(Height_in ~ Width_in, data = pp) tidy(m_ht_wt) ``` ``` ## # A tibble: 2 x 5 ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) 3.62 0.254 14.3 8.82e-45 ## 2 Width_in 0.781 0.00950 82.1 0. ``` --- ## Predict height from width .question[ On average, how tall are paintings that are 60 inches wide? `\(\widehat{Height_{in}} = 3.62 + 0.78~Width_{in}\)` ] -- ```r 3.62 + 0.78 * 60 ``` ``` ## [1] 50.42 ``` -- "On average, we expect paintings that are 60 inches wide to be 50.42 inches high." -- **Warning:** We "expect" this to happen, but there will be some variability. (We'll learn about measuring the variability around the prediction later.) --- ## Prediction in R Use the `predict()` function to make a prediction: - First argument: model - Second argument: data frame representing the new observation(s) for which you want to make a prediction, with variables with the same names as those to build the model, except for the response variables --- ## Prediction in R (cont.) - Making a prediction for a single new observation: ```r newpainting <- tibble(Width_in = 60) predict(m_ht_wt, newdata = newpainting) ``` ``` ## 1 ## 50.46919 ``` - Making a prediction for multiple new observations: ```r newpaintings <- tibble(Width_in = c(50, 60, 70)) predict(m_ht_wt, newdata = newpaintings) ``` ``` ## 1 2 3 ## 42.66123 50.46919 58.27715 ``` --- ## Prediction vs. extrapolation .question[ On average, how tall are paintings that are 400 inches wide? `$$\widehat{Height_{in}} = 3.62 + 0.78~Width_{in}$$` ] <!-- --> --- ## Watch out for extrapolation! > "When those blizzards hit the East Coast this winter, it proved to my satisfaction that global warming was a fraud. That snow was freezing cold. But in an alarming trend, temperatures this spring have risen. Consider this: On February 6th it was 10 degrees. Today it hit almost 80. At this rate, by August it will be 220 degrees. So clearly folks the climate debate rages on."<sup>1</sup> <br> Stephen Colbert, April 6th, 2010 .footnote[ [1] OpenIntro Statistics. "Extrapolation is treacherous." OpenIntro Statistics. ] --- class: center, middle # Measuring model fit --- ## Measuring the strength of the fit - The strength of the fit of a linear model is most commonly evaluated using `\(R^2\)`. - It tells us what percent of variability in the response variable is explained by the model. - The remainder of the variability is explained by variables not included in the model. --- ## Obtaining `\(R^2\)` in R #### Height vs. width .small[ ```r glance(m_ht_wt) ``` ``` ## # A tibble: 1 x 11 ## r.squared adj.r.squared sigma statistic p.value df logLik AIC ## * <dbl> <dbl> <dbl> <dbl> <dbl> <int> <dbl> <dbl> ## 1 0.683 0.683 8.30 6749. 0 2 -11083. 22173. ## # ... with 3 more variables: BIC <dbl>, deviance <dbl>, df.residual <int> ``` ```r glance(m_ht_wt)$r.squared # extract R-squared ``` ``` ## [1] 0.6829468 ``` ] Roughly 68% of the variability in heights of paintings can be explained by their widths. #### Height vs. lanscape features .small[ ```r m_ht_lands <- lm(Height_in ~ factor(landsALL), data = pp) glance(m_ht_lands)$r.squared ``` ``` ## [1] 0.03456724 ``` ] --- class: center, middle # Tidy regression output --- ## Not-so-tidy regression output - You might come accross these in your googling adventures, but we'll try to stay away from them - Not because they are wrong, but because they don't result in tidy data frames as results. --- ## Not-so-tidy regression output (1) #### Option 1: ```r m_ht_wt ``` ``` ## ## Call: ## lm(formula = Height_in ~ Width_in, data = pp) ## ## Coefficients: ## (Intercept) Width_in ## 3.6214 0.7808 ``` --- ## Not-so-tidy regression output (2) #### Option 2: ```r summary(m_ht_wt) ``` ``` ## ## Call: ## lm(formula = Height_in ~ Width_in, data = pp) ## ## Residuals: ## Min 1Q Median 3Q Max ## -86.714 -4.384 -2.422 3.169 85.084 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 3.621406 0.253860 14.27 <2e-16 ## Width_in 0.780796 0.009505 82.15 <2e-16 ## ## Residual standard error: 8.304 on 3133 degrees of freedom ## (258 observations deleted due to missingness) ## Multiple R-squared: 0.6829, Adjusted R-squared: 0.6828 ## F-statistic: 6749 on 1 and 3133 DF, p-value: < 2.2e-16 ``` --- ## Review .question[ What makes a data frame tidy? ] -- - Each variable forms a column. - Each observation forms a row. - Each type of observational unit forms a table. --- ## Tidy regression output Achieved with functions from the broom package: - `tidy`: Constructs a data frame that summarizes the model's statistical findings: coefficient estimates, *standard errors, test statistics, p-values*. - `glance`: Constructs a concise one-row summary of the model. This typically contains values such as `\(R^2\)`, adjusted `\(R^2\)`, *and residual standard error that are computed once for the entire model*. - `augment`: Adds columns to the original data that was modeled. This includes predictions and residuals. --- ## Tidy your model's statistical findings ```r tidy(m_ht_wt) ``` ``` ## # A tibble: 2 x 5 ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) 3.62 0.254 14.3 8.82e-45 ## 2 Width_in 0.781 0.00950 82.1 0. ``` --- ## Glance to assess model fit ```r glance(m_ht_wt) ``` ``` ## # A tibble: 1 x 11 ## r.squared adj.r.squared sigma statistic p.value df logLik AIC ## * <dbl> <dbl> <dbl> <dbl> <dbl> <int> <dbl> <dbl> ## 1 0.683 0.683 8.30 6749. 0 2 -11083. 22173. ## # ... with 3 more variables: BIC <dbl>, deviance <dbl>, df.residual <int> ``` -- ```r glance(m_ht_wt)$r.squared ``` ``` ## [1] 0.6829468 ``` -- ``` The $R^2$ is 68.29%. ``` The `\(R^2\)` is 68.29%. -- ```r glance(m_ht_wt)$adj.r.squared ``` ``` ## [1] 0.6828456 ``` --- ## Augment data with model results New variables of note (for now): - `.fitted`: Predicted value of the response variable - `.resid`: Residuals .small[ ```r augment(m_ht_wt) %>% slice(1:5) ``` ``` ## # A tibble: 5 x 10 ## .rownames Height_in Width_in .fitted .se.fit .resid .hat .sigma ## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 1 37 29.5 26.7 0.166 10.3 3.99e-4 8.30 ## 2 2 18 14 14.6 0.165 3.45 3.96e-4 8.31 ## 3 3 13 16 16.1 0.158 -3.11 3.61e-4 8.31 ## 4 4 14 18 17.7 0.152 -3.68 3.37e-4 8.31 ## 5 5 14 18 17.7 0.152 -3.68 3.37e-4 8.31 ## # ... with 2 more variables: .cooksd <dbl>, .std.resid <dbl> ``` ] -- .question[ Why might we be interested in these new variables? ] --- ## Residuals plot .small[ ```r m_ht_wt_aug <- augment(m_ht_wt) ggplot(m_ht_wt_aug, mapping = aes(x = .fitted, y = .resid)) + geom_point(alpha = 0.5) + geom_hline(yintercept = 0, color = "gray", lty = 2) + labs(x = "Predicted height", y = "Residuals") ``` <!-- --> ] -- .question[ What does this plot tell us about the fit of the linear model? ] --- class: center, middle # Exploring linearity --- ## Data: Paris Paintings <!-- --> --- ## Price vs. width .question[ Describe the relationship between price and width of painting. ] ``` ## Warning: Removed 256 rows containing missing values (geom_point). ``` <!-- --> --- ## Let's focus on paintings with `Width_in < 100` ```r pp_wt_lt_100 <- pp %>% filter(Width_in < 100) ``` --- ## Price vs. width .question[ Which plot shows a more linear relationship? ] .small[ .pull-left[ <!-- --> ] .pull-right[ <!-- --> ] ] --- ## Price vs. width, residuals .question[ Which plot shows a residuals that are uncorrelated with predicted values from the model? ] .pull-left[ <!-- --> ] .pull-right[ <!-- --> ] -- <br> .question[ What is the unit of residuals? ] --- ## Transforming the data - We saw that `price` has a right-skewed distribution, and the relationship between price and width of painting is non-linear. -- - In these situations a transformation applied to the response variable may be useful. -- - In order to decide which transformation to use, we should examine the distribution of the response variable. -- - The extremely right skewed distribution suggests that a log transformation may be useful. - log = natural log, `\(ln\)` - Default base of the `log` function in R is the natural log: <br> `log(x, base = exp(1))` --- ## Logged price vs. width .question[ How do we interpret the slope of this model? ] <!-- --> --- ## Interpreting models with log transformation ```r m_lprice_wt <- lm(log(price) ~ Width_in, data = pp_wt_lt_100) tidy(m_lprice_wt) ``` ``` ## # A tibble: 2 x 5 ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) 4.67 0.0585 79.9 0. ## 2 Width_in 0.0192 0.00226 8.48 3.36e-17 ``` --- ## Linear model with log transformation $$ \widehat{log(price)} = 4.67 + 0.02 Width $$ -- - For each additional inch the painting is wider, the log price of the painting is expected to be higher, on average, by 0.02 livres. -- - which is not a very useful statement... --- ## Working with logs - Subtraction and logs: `\(log(a) − log(b) = log(a / b)\)` -- - Natural logarithm: `\(e^{log(x)} = x\)` -- - We can use these identities to "undo" the log transformation --- ## Interpreting models with log transformation The slope coefficient for the log transformed model is 0.02, meaning the log price difference between paintings whose widths are one inch apart is predicted to be 0.02 log livres. -- $$ log(\text{price for width x+1}) - log(\text{price for width x}) = 0.02 $$ -- $$ log\left(\frac{\text{price for width x+1}}{\text{price for width x}}\right) = 0.02 $$ -- $$ e^{log\left(\frac{\text{price for width x+1}}{\text{price for width x}}\right)} = e^{0.02} $$ -- $$ \frac{\text{price for width x+1}}{\text{price for width x}} \approx 1.02 $$ -- For each additional inch the painting is wider, the price of the painting is expected to be higher, on average, by a factor of 1.02. --- ## Shortcuts in R ```r m_lprice_wt %>% tidy() %>% select(term, estimate) ``` ``` ## # A tibble: 2 x 2 ## term estimate ## <chr> <dbl> ## 1 (Intercept) 4.67 ## 2 Width_in 0.0192 ``` ```r m_lprice_wt %>% tidy() %>% select(term, estimate) ``` ``` ## # A tibble: 2 x 2 ## term estimate ## <chr> <dbl> ## 1 (Intercept) 4.67 ## 2 Width_in 0.0192 ``` --- ## Recap - Non-constant variance is one of the most common model violations, however it is usually fixable by transforming the response (y) variable. -- - The most common transformation when the response variable is right skewed is the log transform: `\(log(y)\)`, especially useful when the response variable is (extremely) right skewed. -- - This transformation is also useful for variance stabilization. -- - When using a log transformation on the response variable the interpretation of the slope changes: *"For each unit increase in x, y is expected on average to be higher/lower <br> by a factor of `\(e^{b_1}\)`."* -- - Another useful transformation is the square root: `\(\sqrt{y}\)`, especially useful when the response variable is counts. --- ## Transform, or learn more? - Data transformations may also be useful when the relationship is non-linear - However in those cases a polynomial regression may be more appropriate + This is beyond the scope of this course, but you’re welcomed to try it for your final project, and I’d be happy to provide further guidance --- ## Aside: when `\(y = 0\)` In some cases the value of the response variable might be 0, and ```r log(0) ``` ``` ## [1] -Inf ``` -- The trick is to add a very small number to the value of the response variable for these cases so that the `log` function can still be applied: ```r log(0 + 0.00001) ``` ``` ## [1] -11.51293 ```