class: center, middle, inverse, title-slide # Multiple Predictors ### Yue Jiang ### STA 210 / Duke University / Spring 2024 --- ### Multiple predictors ...so I know today's lecture is titled "multiple predictors," but I wanted to make sure that we were all coming from the same place in terms of familiarity with fundamental issues re: interval estimation and hypothesis testing. We'll be reviewing these two concepts for the bulk of class today; luckily, going from a single predictor to multiple predictors is straightforward enough not to need a ton of time in class today. Any questions before we begin? --- ### Jeans <img src="img/jeans.jpg" width="80%" style="display: block; margin: auto;" /> --- ### Review: Statistical inference The .vocab[population] is the group we'd like to learn something about. If we had data from every unit in the population, we could just calculate what we wanted and be done! Unfortunately, we (usually) have to settle with a .vocab[sample] from the population. Ideally, the sample is .vocab[representative], allowing us to use .vocab[probability and statistical inference] to make conclusions that are .vocab[generalizable] to the broader population of interest. We want to make inferences regarding population .vocab[parameters], which we do with .vocab[sample statistics]. --- ### Review: The hypothesis testing framework 1. Start with two hypotheses about the population: the .vocab[null hypothesis] and the .vocab[alternative hypothesis] 2. Choose a sample, collect data, and analyze the data 3. Figure out how likely it is to see data like what we got/observed, IF the null hypothesis were true 4. If our data would have been extremely unlikely if the null claim were true, then we reject it and deem the alternative claim worthy of further study. Otherwise, we cannot reject the null claim --- ### Review: Pricy jeans, different pockets? <img src="multiple_files/figure-html/unnamed-chunk-3-1.png" style="display: block; margin: auto;" /> --- ### Review: Setting up the hypotheses First, what parameter(s) do we *actually* care about in the linear model? `\(Height_i = \beta_0 + \beta_1(Price)_i + \epsilon_i\)` - `\(H_0: \beta_1 = 0\)` - `\(H_1: \beta_1 \neq 0\)` .question[ What are these null and alternative hypotheses saying *in words*? ] --- ### Review: Collecting our data ```r jeans |> filter(menWomen == "women") |> select(price, maxHeightFront) ``` ``` ## price maxHeightFront ## 1 42.00 14.5 ## 2 42.00 14.5 ## 3 89.50 13.0 ## 4 89.50 13.0 ## 5 39.90 13.0 ## 6 39.90 15.5 ## 7 79.50 12.0 ## 8 69.50 14.0 ## 9 99.00 13.0 ## 10 79.50 15.0 ## 11 54.50 12.5 ## 12 54.50 12.2 ## 13 19.94 13.7 ## 14 19.94 13.5 ## 15 199.00 13.0 ## 16 159.00 14.0 ## 17 179.00 14.0 ## 18 249.00 14.5 ## 19 98.00 11.5 ## 20 89.00 12.5 ## 21 69.50 12.5 ## 22 69.50 15.0 ## 23 79.90 15.7 ## 24 79.90 15.5 ## 25 49.99 14.5 ## 26 9.99 19.0 ## 27 29.99 12.0 ## 28 29.99 14.5 ## 29 125.00 14.0 ## 30 110.00 11.5 ## 31 69.95 14.0 ## 32 69.95 13.0 ## 33 39.95 19.0 ## 34 39.95 16.0 ## 35 88.00 21.5 ## 36 78.00 18.0 ## 37 99.00 14.0 ## 38 99.00 15.0 ## 39 89.95 15.0 ## 40 92.95 14.5 ``` --- ### Review: Collecting our data <img src="multiple_files/figure-html/unnamed-chunk-5-1.png" style="display: block; margin: auto;" /> --- ### Review: What is a "test statistic"? `\begin{align*} t_{n-2} = \frac{\hat{\beta}_1 - \beta_{1, H_0}}{SE(\widehat{\beta}_1)} \end{align*}` --- ### Review: Making a conclusion ```r options(digits = 3) linear_reg() |> set_engine("lm") |> fit(maxHeightFront ~ price, data = jeans |> filter(menWomen == "women")) |> tidy() ``` ``` ## # A tibble: 2 x 5 ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) 14.8 0.646 23.0 6.69e-24 ## 2 price -0.00650 0.00690 -0.943 3.52e- 1 ``` --- ### Review: Statistical inference The .vocab[population] is the group we'd like to learn something about. If we had data from every unit in the population, we could just calculate what we wanted and be done! Unfortunately, we (usually) have to settle with a .vocab[sample] from the population. Ideally, the sample is .vocab[representative], allowing us to use .vocab[probability and statistical inference] to make conclusions that are .vocab[generalizable] to the broader population of interest. We want to make inferences regarding population .vocab[parameters], which we do with .vocab[sample statistics]. --- ### What is a confidence interval, anyway? <img src="img/confidence.png" width="80%" style="display: block; margin: auto;" /> A .copper[confidence interval] gives a range of values that is intended to cover the parameter of interest to a certain degree of "confidence" Confidence interval = .copper[point estimate] `\(\pm\)` .copper[margin of error] --- ### How do you interpret a confidence interval? As it turns out, a 95% confidence interval for the slope in our jeans example is from -11.07 to 4.03. .question[ How do you interpret this interval? (more on this very soon) ] --- ### Brief caveat As before when we talked about the Central Limit Theorem, let's assume that we know `\(\sigma\)` (this very rarely ever happens after all, since it is a population parameter) --- ### Two-sided confidence intervals Given a random variable `\(X\)` with mean `\(\mu\)` and standard deviation `\(\sigma\)`, the CLT tells us that the random variable `\(Z\)`, defined as `\begin{align*} Z = \frac{\bar{X} - \mu}{SE}, \end{align*}` has a standard normal distribution if `\(X\)` is normal (and is approximately normal if `\(X\)` is not normal, but `\(n\)` is large enough). --- ### Deriving the two-sided interval <img src="multiple_files/figure-html/unnamed-chunk-8-1.png" style="display: block; margin: auto;" /> --- ### Deriving the two-sided interval For a standard normal random variable, 95% of the observations lie between -1.96 and 1.96 for `\(Z \sim N(0, 1)\)`, so `\begin{align*} 0.95 &= P(-1.96 \le Z \le 1.96)\\ &= P(-1.96 \le \frac{\bar{X} - \mu}{SE} \le 1.96) \end{align*}` So, a 95% CI is given by `\begin{align*} \left( \bar{X} - 1.96\frac{\sigma}{\sqrt{n}}, \bar{X} + 1.96\frac{\sigma}{\sqrt{n}} \right) \end{align*}` --- ### Confidence intervals for the mean `\begin{align*} \left( \bar{X} - z^\star_{1 - \alpha/2}\frac{\sigma}{\sqrt{n}}, \bar{X} + z^\star_{1 - \alpha/2}\frac{\sigma}{\sqrt{n}} \right) \end{align*}` Point estimate `\(\pm\)` .vocab[margin of error] Point estimate `\(\pm\)` .vocab[confidence multiplier] `\(\times\)` .vocab[SE] --- ### Other coverage probabilities The .copper[confidence multiplier], `\(z^\star_{1-\alpha/2}\)`, is the z-score that cuts off the upper 100% `\(\times \alpha/2\)` of the distribution (the `\(1 - \alpha/2\)` percentile) For `\(\alpha = 0.05\)`, we have `\(1 - \alpha/2 = 0.975\)`, and so `\(z^\star\)` is the 97.5th quantile of the standard normal distribution (calculated using software packages) Compromising on confidence level to obtain narrower CIs is highly frowned upon! --- ### CI interpretation `\begin{align*} \left( \bar{X} - 1.96\frac{\sigma}{\sqrt{n}}, \bar{X} + 1.96\frac{\sigma}{\sqrt{n}} \right) \end{align*}` Suppose we select `\(M\)` different random samples from the population of size `\(n\)`, and use them to calculate `\(M\)` different 95% CIs in the same way as above. We expect 95% of these intervals would cover the true `\(\mu\)` and 5% do not [https://digitalfirst.bfwpub.com/stats_applet/stats_applet_4_ci.html](Interactive activity) --- ### CI interpretation `\begin{align*} \left( \bar{X} - 1.96\frac{\sigma}{\sqrt{n}}, \bar{X} + 1.96\frac{\sigma}{\sqrt{n}} \right) \end{align*}` .pull-center[ .question[ <b><strike>There is a 95% chance that `\(\mu\)` lies in the interval</strike></b>. ] ] --- ### CI interpretation `\begin{align*} \left( \bar{X} - 1.96\frac{\sigma}{\sqrt{n}}, \bar{X} + 1.96\frac{\sigma}{\sqrt{n}} \right) \end{align*}` **Important**: we do not know whether any particular interval is in the 95% of them that cover the mean or the 5% that don't Since `\(\mu\)` is a parameter, it's either in our confidence interval or not --- ### When can we use this CI? `\begin{align*} \left( \bar{X} - z^\star\frac{\sigma}{\sqrt{n}}, \bar{X} + z^\star\frac{\sigma}{\sqrt{n}} \right) \end{align*}` Remember, this is only ok to use when `\(\sigma\)` is known, and `\(X\)` is normal or `\(X\)` is *not* normal, but `\(n\)` is sufficiently large. In the real world, this is often not reasonable to assume, so once again we use the t distribution (reflecting our uncertainty in having to estimate `\(\sigma\)`): `\begin{align*} \left( \bar{X} - t^\star\frac{s}{\sqrt{n}}, \bar{X} + t^\star\frac{s}{\sqrt{n}} \right) \end{align*}` (and of course, remember that this is just for the mean; we'd have different CIs for other parameters of interest!) --- ### Back to regression As it turns out, a 95% confidence interval for the slope in our jeans example is from -11.07 to 4.03. `\begin{align*} \left( \hat{\beta}_1 - t^\star \times SE(\hat{\beta}_1), \hat{\beta}_1 + t^\star \times SE(\hat{\beta}_1) \right) \end{align*}` .question[ How do you interpret this interval? ] --- ### Multiple predictors Ok, we're finally getting to multiple predictors (phew). .question[ Why would we want to include more than one predictor in the model? ] -- One important reason is that accounting for multiple predictors allows us to address potential confounding. We are looking at relationships **while holding others constant**. --- ### Multiple regression The model is given by `\begin{align*} y_i = \beta_0 + \beta_1x_{i1} + \beta_2x_{i2} + \cdots + \beta_px_{ip} + \epsilon_i \end{align*}` - `\(p\)` is the total number of predictor or explanatory variables - `\(y_i\)` is the outcome (dependent variable) of interest - `\(\beta_0\)` is the intercept parameter - `\(\beta_1, \beta_2, \cdots, \beta_p\)` are the slope parameters - `\(x_{i1}, x_{i2}, \cdots, x_{ip}\)` are predictor variables - `\(\epsilon_i\)` is the error (like the `\(\beta\)`s, it is not observed) --- ### Multiple regression Consider the model `\begin{align*} y = \beta_0 + \beta_1 x_{1} + \beta_2 x_{2} + \beta_3 x_{3} \end{align*}` What happens to `\(y\)` when `\(x_{1}\)` changes by 1, while `\(x_{2}\)` and `\(x_{3}\)` stay the same? (we're slightly abusing notation here by getting rid of the `\(\epsilon\)` and ignoring the index of observation `\(i\)`, but bear with me) --- ### Multiple regression Consider the model `\begin{align*} z &= \beta_0 + \beta_1(x_1 + 1) &+ \beta_2 x_2 + \beta_3 x_3\\ &= \beta_0 + \beta_1 x_1 + \beta_1 &+ \beta_2 x_2 + \beta_3 x_3\\ y &= \beta_0 + \beta_1 x_{1} &+ \beta_2 x_2 + \beta_3 x_3 \end{align*}` .question[ What is the difference between `\(z\)` and `\(y\)`? How might you use this intuition to interpret `\(\beta_1\)`, `\(\beta_2\)`, and `\(\beta_3\)`? What about `\(\beta_0\)`? ] --- ### Multiple regression ```r m2 <- lm(maxHeightFront ~ price + minHeightFront + maxHeightBack, data = jeans) summary(m2) ``` ``` ## ## Call: ## lm(formula = maxHeightFront ~ price + minHeightFront + maxHeightBack, ## data = jeans) ## ## Residuals: ## Min 1Q Median 3Q Max ## -5.567 -2.645 -0.492 2.612 10.281 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) -15.40475 6.55444 -2.35 0.0214 ## price -0.00318 0.00857 -0.37 0.7116 ## minHeightFront 0.82819 0.11722 7.07 6.6e-10 ## maxHeightBack 1.37654 0.44514 3.09 0.0028 ## ## Residual standard error: 3.38 on 76 degrees of freedom ## Multiple R-squared: 0.539, Adjusted R-squared: 0.52 ## F-statistic: 29.6 on 3 and 76 DF, p-value: 8.97e-13 ``` .question[ How might we interpret these parameter estimates (watch out for the scale!) ] --- ### Hypotheses of interest Hypotheses of interest may include hypotheses for single parameters: - `\(H_0: \beta_1 = 0\)` - `\(H_1: \beta_1 \neq 0\)` In our previous model, this would test whether there is a linear association between price and max front pocket height, **while controlling for minimum front pocket height and maximum back pocket height**. This is tested using a t-test with `\(n-k\)` degrees of freedom, where `\(n\)` is the number of observations in the model and `\(k\)` is the number of estimated model parameters (including the intercept and all slope terms). .question[ Say we were to formally test this hypothesis for our model at `\(\alpha = 0.05\)`. What might we conclude? ] --- ### Hypotheses of interest We might also test multiple parameters at once (for instance, all the slopes): - `\(H_0: \beta_1 = \beta_2 = \beta_3 = 0\)` - `\(H_1:\)` at least one of the `\(\beta_k\)` is not `\(0\)`. This is done with an .vocab[F test] (don't worry too much about this). .question[ What might we conclude here? ] --- ### Multiple predictors ```r options(digits = 3) linear_reg() |> set_engine("lm") |> fit(maxHeightFront ~ price + minHeightFront + maxHeightBack, data = jeans |> filter(menWomen == "women")) |> tidy() ``` ``` ## # A tibble: 4 x 5 ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) 1.42 6.23 0.228 0.821 ## 2 price -0.00272 0.00639 -0.425 0.673 ## 3 minHeightFront 0.552 0.193 2.85 0.00714 ## 4 maxHeightBack 0.385 0.391 0.987 0.330 ```