class: center, middle, inverse, title-slide # Linear Regression (2) ### Yue Jiang ### Duke University --- ### Back to Lab 01... <img src="lec-18_files/figure-html/unnamed-chunk-2-1.png" width="90%" style="display: block; margin: auto;" /> Is this the only factor that can help us predict obesity percentage? --- ### Multiple regression The multiple regression model extends the simple linear regression model by incorporating more than one explanatory variable. The assumptions are similar to those of the simple linear regression model. This type of model is often called a .vocab[multivariable] (**not multivariate**) model. Multiple regression is often used to control for confounders or predictors that explain variabiltiy in the response: - Knowing a state has above average exercise percentage might tell you something about the obesity percentage - If you also knew that state's HDI category, you might be able to do even better! --- ### Multiple regression Importantly, accounting for multiple predictors allows us to address potential confounding. We are looking at relationships **while holding others constant** - for instance, we know that exercise percentage and HDI category might be correlated. Perhaps any associations between obesity and exercise percentage we see are actually driven by HDI category. By fitting a multiple linear regression model, we can look at associations between obesity and exercise percentage **while holding HDI constant** (that is, at each possible value of HDI, what is the "remaining relationship" between exercise and obesity percentage?). Similarly, we can also look at associations between HDI and obesity **while holding exercise constant** (that is, at each possible value of exercise %, what is the "remaining relationsip" between HDI and obesity percentage?). --- ### Multiple regression The model is given by `\begin{align*} y_i = \beta_0 + \beta_1x_{1i} + \beta_2x_{2i} + \cdots + \beta_px_{pi} + \epsilon_i \end{align*}` - `\(p\)` is the total number of predictor or explanatory variables - `\(y_i\)` is the outcome (dependent variable) of interest - `\(\beta_0\)` is the intercept parameter - `\(\beta_1, \beta_2, \cdots, \beta_p\)` are the slope parameters - `\(x_{1i}, x_{2i}, \cdots, x_{pi}\)` are predictor variables - `\(\epsilon_i\)` is the error (like the `\(\beta\)`s, it is not observed) Assumptions are essentially the same as in simple linear regression --- ### Multiple regression Consider the model `\begin{align*} Obesity_i = \beta_0 + \beta_1 Exercise_i + \beta_2 Smoking_i + \epsilon_i \end{align*}` How might you interpret the parameters `\(\beta_0\)`, `\(\beta_1\)`, and `\(\beta_2\)`? --- ### Multiple regression .question[ How might we interpret these parameter estimates (watch out for the scale!) ] --- ### Hypotheses of interest Hypotheses of interest may include hypotheses for single parameters: - For instance, `\(H_0: \beta_1 = 0\)` vs. `\(H_1: \beta_1 \neq 0\)`. In our previous model, this would test whether there is a linear association between exercise % and obesity %, **while controlling for smoking %** This is tested using a t-test with `\(n-k\)` degrees of freedom, where `\(n\)` is the number of observations in the model and `\(k\)` is the number of estimated model parameters (including the intercept and all slope terms). .question[ Say we were to test this hypothesis for our model. What might we conclude? ] --- ### Hypotheses of interest In this case, the p-value was lower than the pre-specified significance cut-off. There is sufficient evidence to suggest that, **controlling for smoking %**, there is a non-zero linear association between exercise % and obesity % by state. --- ### Hypotheses of interest We might also test multiple parameters at once (for instance, all the slopes): - `\(H_0: \beta_1 = \beta_2 = 0\)` vs. `\(H_1:\)` at least one of the `\(\beta_k\)` is not `\(0\)`. is a linear association between exercise % This is given by an F test (much like ANOVA!) with numerator df equal to the number of parameters being tested (here, the number of slopes) and denominator df equal to `\(n-k\)` (number of observations minus number of estimated model parameters, *including* the intercept). .question[ What might we conclude for the overall F test for our model? ] --- ### Hypotheses of interest In this case, the p-value was lower than the pre-specified significance cut-off. There is sufficient evidence to suggest that at least one predictor has a non-zero linear association with obesity %. --- ### Confidence intervals For individual predictors, we can also use the standard error estimate to construct confidence intervals. These intervals are constructed using a critical value from a t distribution: `\begin{align*} \left( \hat{\beta}_k - t^\star_{1 - \alpha/2; n-k} \times SE(\hat{\beta}_k), \hat{\beta}_k + t^\star_{1 - \alpha/2; n-k} \times SE(\hat{\beta}_k) \right) \end{align*}` For instance, `\(SE(\hat{\beta}_1) = 0.06\)`. Thus, we have that a 95% confidence interval for `\(\beta_1\)` is (-0.43, -0.19). If we were to interpret this interval, it would have to be conditionally on the other variables in the model. --- ### R-squared In our model, `\(R^2 = 0.7546\)`, suggesting that about 75% of the variability in obesity percentage can be explained by our model. However, `\(R^2\)` can never decrease when variables are added to a model, even if they are useless. Thus, we can use *adjusted* `\(R^2 \le R^2\)`, where the adjustment is made to account for the number of predictors. The adjusted `\(R^2\)` incorporates a penalty for each additional variable in a model, so that the adjusted `\(R^2\)` will go down if a new variable does not improve prediction much, and it will go up if the new variable does improve prediction, conditional on the other variables already in the model. With that said, `\(R^2\)` or adj. `\(R^2\)` should never be used as the only reasons to select variables for your model - you must rely on scientific knowledge and context! --- ### Categorical predictors We often have categorical predictors in modeling settings (for instance, here we have HDI). However, it might not make sense to think about a "one unit increase" in a categorical variable (how would that even work?) In regression settings, we can account for categorical variables by creating .vocab[dummy variables], which are indicator variables for certain conditions happening. For instance, there are three categories of HDI in the dataset: bottom ten, middle, and top ten. When considering categorical variables, one variable is taken to be the .vocab[baseline] or .vocab[reference] value. All other categories will be compared to it. --- ### Categorical predictors Suppose the "top ten" category is taken to be the referent value. Then we can create two dummy variables: - HDI == Middle: 1 if this condition is true; 0 otherwise - HDI == Bottom Ten: 1 if this condition is true; 0 otherwise --- ### Interpretation of dummy variables Consider the model `\begin{align*} Obesity_i &= \beta_0 + \beta_1(HDI == Middle)_i + \\ &\mathrel{\phantom{=}} \beta_2(HDI == Bottom Ten)_i + \epsilon_i \end{align*}` The parameter interpretations are below. - `\(\beta_0\)` represents the expected obesity percentage for a state with 0 for the two dummy variables. That is, in the top ten HDI - `\(\beta_1\)` represents the expected difference in obesity percentage for a state in the middle HDI category, compared to the top ten - `\(\beta_2\)` represents the expected difference in obesity percentage for a state in the bottom ten HDI category, compared to the top ten Note that we had to estimate *multiple* "slopes" for this one variable - one corresponding to each non-reference level. --- ### Interpretation of dummy variables ``` ## ## Call: ## lm(formula = Obesity ~ HDI, data = dat) ## ## Residuals: ## Min 1Q Median 3Q Max ## -6.1300 -2.0808 0.3183 1.8417 5.5500 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 34.4300 0.8160 42.195 < 2e-16 ## HDIMiddle -4.7967 0.9422 -5.091 6.18e-06 ## HDITop ten -8.0800 1.1540 -7.002 8.12e-09 ## ## Residual standard error: 2.58 on 47 degrees of freedom ## Multiple R-squared: 0.5158, Adjusted R-squared: 0.4952 ## F-statistic: 25.03 on 2 and 47 DF, p-value: 3.971e-08 ``` .question[ How would you interpret the following estimates? ] --- ### Interpretation of dummy variables ``` ## ## Call: ## lm(formula = Obesity ~ Exercise + HDI, data = dat) ## ## Residuals: ## Min 1Q Median 3Q Max ## -3.9801 -1.4340 0.1757 1.1585 4.2260 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 50.0221 2.8838 17.346 < 2e-16 ## Exercise -0.3477 0.0627 -5.545 1.39e-06 ## HDIMiddle -2.3724 0.8572 -2.767 0.00811 ## HDITop ten -5.1250 1.0486 -4.887 1.28e-05 ## ## Residual standard error: 2.019 on 46 degrees of freedom ## Multiple R-squared: 0.7097, Adjusted R-squared: 0.6908 ## F-statistic: 37.49 on 3 and 46 DF, p-value: 2.059e-12 ``` .question[ How would you interpret these estimates? ] --- ### Interactions Sometimes, the relationship between one predictor and the outcome depends on the value of another predictor variable. For example, the effect of exercise % on obesity may be different for smokers vs. non-smokers. To model such a relationship, we create an .vocab[interaction term]. This is created simply by multiplying two predictors `\(x_1\)` and `\(x_2\)` to create a new predictor, `\(x_1x_2\)`. When interaction terms are in a model, interpretations can become tricky. --- ### Interactions Let's consider a model with .vocab[main effects] of exercise and smoking and an .vocab[interaction term] between them: `\begin{align*} Obesity_i &= \beta_0 + \beta_1Exercise_i + \beta_2Smoking_i + \\ &\mathrel{\phantom{=}} \beta_3Exercise_iSmoking_i + \epsilon_i \end{align*}` .question[ What is the expected change in obesity % given a one percentage point increase in exercise %? ] --- ### Interactions For a given exercise (ex) and smoking (sm) percentage, our predicted obesity (ob) percentage is `\begin{align*} \widehat{ob}_i = \hat{\beta}_0 + \hat{\beta}_1ex_i + \hat{\beta}_2sm_i + \hat{\beta}_3ex_ism_i \end{align*}` and for a state at the same smoking % but 1 percentage point higher in exercise %, the predicted obesity percentage is `\begin{align*} \widehat{ob}_{i^\prime} &= \hat{\beta}_0 + \hat{\beta}_1(ex_i + 1) + \hat{\beta}_2(sm_i) + \hat{\beta}_3(ex_i + 1)sm_i\\ &= \hat{\beta}_0 + \hat{\beta}_iex_i + \hat{\beta}_1 + \hat{\beta}_2sm_i + \hat{\beta}_3ex_ism_i + \hat{\beta}_3sm_i. \end{align*}` Subtracting, we have `\(\widehat{ob}_{i^\prime} - \widehat{ob}_i = \hat{\beta}_1 + \hat{\beta}_3sm_i\)`, which is the expected change in obesity % for a 1% change in exercise %. We see that the relationship of exercise and obesity depends on the level of smoking in that state. --- ### Interactions Luckily, interpretation of interaction terms with categorical predictors is easier than with continuous predictors. Since categorical predictors are based on dummy variables, they can only take on the values of 0 or 1 in the model. Again, an interaction effect implies that the regression coefficient for an explanatory variable would change depending on the value of another predictor (for instance, the relationship between exercise and obesity might depend on whether a state is in High, Medium, of Low HDI groups). --- ### Collinearity One common problem in multiple regression is .vocab[collinearity], which occurs when multiple highly correlated variables are used as predictors. In this case, the model can become unstable (often seen as standard errors that get huge and lead to huge confidence interval estimates), and it can be hard to assess the relationships of the predictors. --- ### Diagnosing collinearity If nothing is significant when you "expect" something to be, we have some clues: - Individual predictors are significant in simple linear regression models, - but standard errors and interval estimates are huge, - and the overall F test is significant A significant overall F test with no significant individual variable test is a typical sign of collinearity. We can check out the correlations among the three predictors. There is no fixed criterion for correlation to exclude a variable for collinearity. It is possible to construct examples where the correlation is very high, but collinearity is not a problem because the information about the outcome in the two variables is different.