class: center, middle, inverse, title-slide # Linear Model Assumptions ### Yue Jiang ### STA 210 / Duke University / Spring 2024 --- ### Review: Multiple regression The model is given by `\begin{align*} y_i = \beta_0 + \beta_1x_{i1} + \beta_2x_{i2} + \cdots + \beta_px_{ip} + \epsilon_i \end{align*}` - `\(p\)` is the total number of predictor or explanatory variables - `\(y_i\)` is the outcome (dependent variable) of interest - `\(\beta_0\)` is the intercept parameter - `\(\beta_1, \beta_2, \cdots, \beta_p\)` are the slope parameters - `\(x_{i1}, x_{i2}, \cdots, x_{ip}\)` are predictor variables - `\(\epsilon_i\)` is the error (like the `\(\beta\)`s, it is not observed) .vocab[ So far, we haven't said anything about `\(\epsilon_i\)`. What if we were to make some statements about them? ] --- ### Well actually... `\begin{align*} y_i &= \beta_0 + \beta_1x_{i1} + \cdots + \beta_px_{ip} + \epsilon_i,\\ \epsilon_i &\stackrel{i.i.d.}{\sim} N(0, \sigma^2) \end{align*}` .question[ We've made a pretty strong statement about `\(\epsilon_i\)`. What are we saying? Does this imply anything about `\(y_i\)` given the predictors? ] --- ### Well actually... `\begin{align*} y_i|x_{i1}, \cdots, x_{ip} &\stackrel{i.i.d.}{\sim} N(\beta_0 + \beta_1x_{i1} + \cdots + \beta_pX_{ip}, \sigma^2) \end{align*}` Thinking about conditional .vocab[means] and .vocab[variances]: `\begin{align*} E(y_i | x_{i1}, x_{i2}, \cdots, x_{ip}) &= \beta_0 + \beta_1x_{i1} + \cdots + \beta_pX_{ip}\\ Var(y_i | x_{i1}, x_{i2}, \cdots, x_{ip}) &= \sigma^2 \end{align*}` --- ### Model assumptions <!-- --> --- ### Model assumptions <!-- --> --- ### Model assumptions Let's unpack this. `\begin{align*} y_i &= \beta_0 + \beta_1x_{i1} + \cdots + \beta_px_{ip} + \epsilon_i,\\ \epsilon_i &\stackrel{i.i.d.}{\sim} N(0, \sigma^2) \end{align*}` **Quick caveat**: these are a fairly strict set of assumptions, but if the model formulation above is satisfied, then the linear regression model is guaranteed to have certain "nice" properties. We won't discuss those properties in STA 210, but in other courses (notably STA 211), we will talk about what they are, as well as specific implications of their being violatied, what we can do when they *are* violated, and some other technical conditions that may be useful to keep in mind. Long story short, for the purposes of STA 210, the following **four conditions** will be considered the required assumptions of the linear regression model. --- ### Model assumptions Let's unpack this. `\begin{align*} y_i &= \beta_0 + \beta_1x_{i1} + \cdots + \beta_px_{ip} + \epsilon_i,\\ \epsilon_i &\stackrel{ {\color{red}{i.i.d.}}}{\sim} N(0, \sigma^2) \end{align*}` **Independence**: The errors `\(\epsilon_i\)` are .vocab[independent]. Specifically, this also implies that the outcomes `\(y_i\)` are independent. .vocab[ How might independence be violated, and what might some of the consequences be? ] --- ### Model assumptions Let's unpack this. `\begin{align*} y_i &= \color{red}{\beta_0 + \beta_1x_{i1} + \cdots + \beta_px_{ip}} + \epsilon_i,\\ \epsilon_i &\stackrel{i.i.d.}{\sim} N(0, \sigma^2) \end{align*}` **Linearity**: The regression model is .vocab[linear in the parameters]. Specifically, this is a linearity assumption on the *parameters* (the `\(\beta\)` terms), NOT the predictors themselves (the observed `\(x\)` terms). .vocab[ How might linearity be violated, and what might some of the consequences be? ] --- ### Model assumptions Let's unpack this. `\begin{align*} y_i &= \beta_0 + \beta_1x_{i1} + \cdots + \beta_px_{ip} + \epsilon_i,\\ \color{red}{\epsilon_i} &\stackrel{i.i.d.}{\sim} N(0, \color{red}{\sigma^2}) \end{align*}` **Constant variance** (homoscedasticity of errors): The .vocab[variance of the errors is constant]: `\(\sigma^2\)`, regardless of what the predictor values are. Specifically, this also implies that the variance of the errors is independent of the predictors. .vocab[ How might constant variance be violated, and what might some of the consequences be? ] --- ### Model assumptions Let's unpack this. `\begin{align*} y_i &= \beta_0 + \beta_1x_{i1} + \cdots + \beta_px_{ip} + \epsilon_i,\\ \epsilon_i &\stackrel{i.i.d.}{\sim} \color{red}{N(0}, \sigma^2\color{red}{)} \end{align*}` **Normality**: The errors follow a .vocab[normal (Gaussian) distribution]. Specifically, one centered at zero (so the expectation of the errors is also zero). --- ### Diamonds...again <!-- --> --- ### Diagnostics for model assumptions `\begin{align*} y_i &= \beta_0 + \beta_1x_{i1} + \cdots + \beta_px_{ip} + \epsilon_i,\\ \epsilon_i &\stackrel{ {\color{red}{i.i.d.}}}{\sim} N(0, \sigma^2) \end{align*}` **Independence**: We usually have to think through this one, unfortunately. In some situations there might be some plots that will specifically give us an indication that independence is violated, but it's best practice to think about the data generating mechanism and how the observations actually came about. --- ### Residual plots Many of the assumptions deal with the error term, which we estimate by using .vocab[residuals]. We can often look at the behavior of the .vocab[predicted values] or .vocab[fitted values], especially as they relate to the residuals, to evaluate our assumptions. --- ### Residual plots ```r m1 <- lm(price ~ carat, data = diamonds) m1_aug <- augment(m1) # part of the tidymodels package! head(m1_aug) ``` ``` ## # A tibble: 6 x 8 ## price carat .fitted .resid .hat .sigma .cooksd .std.resid ## <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 2397 0.73 3403. -1006. 0.00102 1530. 0.000221 -0.658 ## 2 3300 0.7 3162. 138. 0.00104 1530. 0.00000426 0.0903 ## 3 713 0.31 30.7 682. 0.00213 1530. 0.000213 0.447 ## 4 707 0.31 30.7 676. 0.00213 1530. 0.000209 0.443 ## 5 987 0.31 30.7 956. 0.00213 1530. 0.000418 0.626 ## 6 3250 0.83 4206. -956. 0.00101 1530. 0.000197 -0.625 ``` --- ### Residual plots <img src="assumptions_files/figure-html/unnamed-chunk-6-1.png" width="90%" style="display: block; margin: auto;" /> --- ### Residual plots <img src="assumptions_files/figure-html/unnamed-chunk-7-1.png" width="90%" style="display: block; margin: auto;" /> --- ### Residual plots (code) ```r ggplot(m1_aug, aes(x = .fitted, y = .resid)) + geom_point() + geom_hline(yintercept = 0, color = "darkred") + labs(x = "Fitted (predicted) value", y = "Residual") + theme_bw() ``` --- ### Diagnostics for model assumptions `\begin{align*} y_i &= \color{red}{\beta_0 + \beta_1x_{i1} + \cdots + \beta_px_{ip}} + \epsilon_i,\\ \epsilon_i &\stackrel{i.i.d.}{\sim} N(0, \sigma^2) \end{align*}` **Linearity**: The regression model is .vocab[linear in the parameters]. Specifically, this is a linearity assumption on the *parameters* (the `\(\beta\)` terms), NOT the predictors themselves (the observed `\(x\)` terms). In the residual plot, we expect to see symmetrically distributed observations around the horizontal axis. Was that the case from the linear model we fit? --- ### Diagnostics for model assumptions `\begin{align*} y_i &= \beta_0 + \beta_1x_{i1} + \cdots + \beta_px_{ip} + \epsilon_i,\\ \color{red}{\epsilon_i} &\stackrel{i.i.d.}{\sim} N(0, \color{red}{\sigma^2}) \end{align*}` **Constant variance** (homoscedasticity of errors): The .vocab[variance of the errors is constant]: `\(\sigma^2\)`, regardless of what the predictor values are. Specifically, this also implies that the variance of the errors is independent of the predictors. In the residual plot, we expect to see "evenly-spaced" dots (in the vertical sense) along the `\(y\)` axis, *regardless of what the predicted values are*. Was that the case from the linear model we fit? --- ### The distribution of the residuals We can take a look at a histogram of the residuals to get an overall feel for the distribution of them, or we can use a .vocab[quantile-quantile plot] (Q-Q plot) to get a more accurate picture. The Q-Q plot compares the quantiles of some sample to the theoretical quantiles from a pre-specified distribution (such as the normal distribution). --- ### Histogram of residuals <img src="assumptions_files/figure-html/unnamed-chunk-9-1.png" width="90%" style="display: block; margin: auto;" /> --- ### Histogram of residuals (code) ```r ggplot(m1_aug, aes(x = .resid)) + geom_histogram(aes(y = ..density..), fill = "deepskyblue", color = "darkblue") + stat_function(fun = dnorm, args = list(mean = mean(m1_aug$.resid), sd = sd(m1_aug$.resid)), color = "darkred", lwd = 2) + labs(x = "Residual", y = "Density") + theme_bw() ``` --- ### Q-Q plot of residuals <img src="assumptions_files/figure-html/unnamed-chunk-11-1.png" width="90%" style="display: block; margin: auto;" /> --- ### Q-Q plot of residuals (code) ```r ggplot(m1_aug, aes(sample = .resid)) + stat_qq() + stat_qq_line() + theme_bw() + labs(x = "Theoretical quantiles", y = "Sample quantiles") ``` --- ### Diagnostics for model assumptions `\begin{align*} y_i &= \beta_0 + \beta_1x_{i1} + \cdots + \beta_px_{ip} + \epsilon_i,\\ \epsilon_i &\stackrel{i.i.d.}{\sim} \color{red}{N(0}, \sigma^2\color{red}{)} \end{align*}` **Normality**: The errors follow a .vocab[normal (Gaussian) distribution]. Specifically, one centered at zero (so the expectation of the errors is also zero). We expect to see normally distributed residuals. Was that the case here? Note that in the Q-Q plot, it's ok if we see "some degree" of deviation (especially in the tails), but if there is a clear pattern or many observations which appear to be "off," then that's a clue that normality might be violated. --- ### A note on the diagnostic plots We fit a simple linear regression model with only one predictor. However, these same sort of model assumption diagnostics that we saw (residual plots, histograms of residuals, Q-Q plots) also apply to the multiple regression setting, since the residuals themselves are a function of all of the predictors. --- ### What happens when assumptions are violated? It's pretty bad, but there are some rays of hope. **Independence**: Often a threat to the validity of the entire analysis, and inferential procedures will generally be completely inappropriate (in the sense of not actually being what you expect). There are some ways around certain types of independence violations (clustering, time-dependence, etc.), but those are beyond the scope of STA 210. --- ### What happens when assumptions are violated? It's pretty bad, but there are some rays of hope. **Linearity**: Also pretty bad :( - we *are* trying to interpret *linear models* after all (these are what our slope/intercept interpretations are based on!), and so if linearity is violated, then the *model itself* is inappropriate. However, the linearity condition is only in the parameters, NOT the predictors themselves. The predictors can have non-linear relationships with the response, and that's totally alright. We'll talk about some ways to deal with these situations later this semester. --- ### What happens when assumptions are violated? It's pretty bad, but there are some rays of hope. **Constant variance**: Inference will generally be affected (so things like p-values, confidence intervals, etc.) if the results based on the sampling distribution of `\(\widehat{\beta}\)` terms are used. However, we still might be able to use bootstrap methods to come up with valid confidence intervals, or a randomization-based test for valid hypothesis testing. Even better news is that the *estimates themselves* (of slopes, intercepts, etc., not of their standard errors) will still be estimated accurately! Finally, there are also some more advanced modeling techniques that "get around" non-constant variance, but these are also beyond the scope of STA 210. --- ### What happens when assumptions are violated? It's pretty bad, but there are some rays of hope. **Normality**: Probably the "least important" (in some sense - don't go around saying that normality isn't important!) assumption in that once again, if the errors have mean zero (but are not normally distributed), our estimates of the model parameters will still be accurate. Similarly to violations of constant variance, inference will be affected, but if our sample is large enough, the Central Limit Theorem will still reassure us that things will be ok for the `\(\widehat{\beta}\)` terms. Once again, we can use bootstrap confidence intervals or randomization-based hypothesis tests to conduct valid inference without worrying about whether the sample size is large enough for the CLT to kick in. (small caveat - violation of normality results in invalid *prediction intervals*, but more on these later)