class: center, middle, inverse, title-slide # Fitting linear models in R (2) ### Yue Jiang ### Duke University --- ### Estimating bike crashes in NC counties <img src="img/bike_crash.jpg" width="100%" style="display: block; margin: auto;" /> <!-- .center[Image credit: Petar Milošević, [CC BY-SA 4.0](https://creativecommons.org/licenses/by-sa/4.0), via Wikimedia Commons] --> --- ### Newton-Raphson in higher dimensions .vocab[Score vector] and .vocab[Hessian] for `\(\log \mathcal{L}(\boldsymbol\theta | \mathbf{X})\)` with `\(\boldsymbol\theta = (\theta_1, \cdots, \theta_p)^T\)`: `$$\nabla \log \mathcal{L} = \begin{pmatrix} \frac{\partial \log \mathcal{L}}{\partial \boldsymbol\theta_1}\\ \vdots\\ \frac{\partial \log \mathcal{L}}{\partial \boldsymbol\theta_p} \end{pmatrix}$$` `$$\nabla^2 \log \mathcal{L} = \begin{pmatrix} \frac{\partial^2 \log\mathcal{L}}{\partial \theta_1^2} & \frac{\partial^2 \log\mathcal{L}}{\partial \theta_1 \theta_2} & \cdots & \frac{\partial^2 \log\mathcal{L}}{\partial \theta_1\theta_p}\\ \frac{\partial^2 \log\mathcal{L}}{\partial \theta_2\theta_1} & \frac{\partial^2 \log\mathcal{L}}{\partial \theta_2^2} & \cdots & \frac{\partial^2 \log\mathcal{L}}{\partial \theta_2\theta_p} \\ \vdots & \vdots & \ddots & \vdots \\ \frac{\partial^2 \log\mathcal{L}}{\partial \theta_p\theta_1} & \frac{\partial^2 \log\mathcal{L}}{\partial \theta_p\theta_2} & \cdots & \frac{\partial^2 \log\mathcal{L}}{\partial \theta_p^2} \end{pmatrix}$$` --- ### Newton-Raphson in higher dimensions We can modify the Newton-Raphson algorithm for higher dimensions: - Start with initial guess `\(\boldsymbol\theta ^{(0)}\)` - Iterate `\(\boldsymbol\theta^{(t + 1)} = \boldsymbol\theta^{(t)} - \left(\nabla^2\log\mathcal{L}(\boldsymbol\theta^{(t)} | \mathbf{X}) \right)^{-1} \left( \nabla \log\mathcal{L}(\boldsymbol\theta^{(t)} | \mathbf{X}) \right)\)` - Stop when convergence criterion is satisfied Under certain conditions, a global maximum exists; this again is guaranteed for many common applications. Computing the Hessian can be computationally demanding (and annoying), but there are ways around it in practice. --- ### Poisson regression `\begin{align*} \log \mathcal{L}&= \sum_{i = 1}^n y_i\mathbf{X}_i\boldsymbol\beta - e^{\mathbf{X}_i\boldsymbol\beta} - \log y_i!\\ \nabla \log \mathcal{L}&= \sum_{i = 1}^n \left(y_i - e^{\mathbf{X}_i\boldsymbol\beta}\right)\mathbf{X}_i^T\\ \nabla^2 \log \mathcal{L} &= -\sum_{i = 1}^n e^{\mathbf{X}_i\boldsymbol\beta}\mathbf{X}_i\mathbf{X}_i^T \end{align*}` Newton-Raphson update steps for Poisson regression: `\begin{align*} \boldsymbol\beta^{(t+1)} &= \boldsymbol\beta^{(t)} - \left(-\sum_{i = 1}^n e^{\mathbf{X}_i\boldsymbol\beta}\mathbf{X}_i\mathbf{X}_i^T \right)^{-1}\left(\sum_{i = 1}^n \left(y_i - e^{\mathbf{X}_i\boldsymbol\beta}\right)\mathbf{X}_i^T \right) \end{align*}` --- ### Back to bike crashes ``` ## # A tibble: 100 x 6 ## county pop med_hh_income traffic_vol pct_rural crashes ## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 Alamance 166436 50.5 182 29 77 ## 2 Alexander 37353 49.1 13 73 1 ## 3 Alleghany 11161 39.7 28 100 1 ## 4 Anson 24877 38 79 79 7 ## 5 Ashe 27109 41.9 18 85 4 ## 6 Avery 17505 41.7 35 89 5 ## 7 Beaufort 47079 46.4 53 66 37 ## 8 Bertie 19026 35.4 24 83 10 ## 9 Bladen 33190 37 19 91 9 ## 10 Brunswick 136744 60.2 43 43 88 ## # ... with 90 more rows ``` - `pop`: county population - `med_hh_income`: median household income in thousands - `traffic_vol`: mean traffic volume per meter of major roadways - `pct_rural`: percentage of county population living in rural area --- ### Back to bike crashes Let's fit a model with `traffic_vol` and `pct_rural` for now: ```r m1 <- glm(crashes ~ traffic_vol + pct_rural, data = bike, family = "poisson") round(summary(m1)$coef, 6) ``` ``` ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) 5.982181 0.053749 111.298625 0 ## traffic_vol 0.001541 0.000166 9.262671 0 ## pct_rural -0.044558 0.000875 -50.919036 0 ``` .question[ What might we conclude / interpret from this model? ] --- ### Newton-Raphson (rough) implementation Newton-Raphson update steps for Poisson regression: `\begin{align*} \boldsymbol\beta^{(t+1)} &= \boldsymbol\beta^{(t)} - \left(-\sum_{i = 1}^n e^{\mathbf{X}_i\boldsymbol\beta}\mathbf{X}_i\mathbf{X}_i^T \right)^{-1}\left(\sum_{i = 1}^n \left(y_i - e^{\mathbf{X}_i\boldsymbol\beta}\right)\mathbf{X}_i^T \right) \end{align*}` Function for score vector, given vector `beta`, matrix `X`, and vector `y`: ```r d1func <- function(beta, X, y){ d1 <- rep(0, length(beta)) for(i in 1:length(y)){ d1 <- d1 + (y[i] - exp(X[i,] %*% beta)) %*% X[i,] } return(colSums(d1)) } ``` --- ### Newton-Raphson (rough) implementation Newton-Raphson update steps for Poisson regression: `\begin{align*} \boldsymbol\beta^{(t+1)} &= \boldsymbol\beta^{(t)} - \left(-\sum_{i = 1}^n e^{\mathbf{X}_i\boldsymbol\beta}\mathbf{X}_i\mathbf{X}_i^T \right)^{-1}\left(\sum_{i = 1}^n \left(y_i - e^{\mathbf{X}_i\boldsymbol\beta}\right)\mathbf{X}_i^T \right) \end{align*}` Function for Hessian, given vector `beta`, matrix `X`, and vector `y`: ```r d2func <- function(beta, X, y){ d2 <- matrix(0, nrow = length(beta), ncol = length(beta)) for(i in 1:length(y)){ d2 <- d2 - t((exp(X[i,] %*% beta)) %*% X[i,]) %*% (X[i,]) } return(d2) } ``` --- ### Newton-Raphson (rough) implementation Some bookkeeping: ```r beta <- c(mean(log(bike$crashes)), 0, 0) X <- cbind(1, bike$traffic_vol, bike$pct_rural) y <- bike$crashes iter <- 1 delta <- 1 temp <- matrix(0, nrow = 500, ncol = 3) ``` --- ### Newton-Raphson (rough) implementation Actual Newton-Raphson implementation: `\begin{align*} \boldsymbol\beta^{(t+1)} &= \boldsymbol\beta^{(t)} - \left(-\sum_{i = 1}^n e^{\mathbf{X}_i\boldsymbol\beta}\mathbf{X}_i\mathbf{X}_i^T \right)^{-1}\left(\sum_{i = 1}^n \left(y_i - e^{\mathbf{X}_i\boldsymbol\beta}\right)\mathbf{X}_i^T \right) \end{align*}` ```r while(delta > 0.000001 & iter < 500){ old <- beta beta <- old - solve(d2func(beta = beta, X = X, y = y)) %*% d1func(beta = beta, X = X, y = y) temp[iter,] <- beta delta <- sqrt(sum((beta - old)^2)) iter <- iter + 1 } ``` --- ### Newton-Raphson (rough) implementation ```r iter ``` ``` ## [1] 22 ``` ```r delta ``` ``` ## [1] 3.911961e-07 ``` ```r beta ``` ``` ## [,1] ## [1,] 5.98218054 ## [2,] 0.00154064 ## [3,] -0.04455809 ``` ```r m1$coefficients ``` ``` ## (Intercept) traffic_vol pct_rural ## 5.98218054 0.00154064 -0.04455809 ``` --- ### Back to bike crashes <img src="glm-2_files/figure-html/unnamed-chunk-11-1.png" style="display: block; margin: auto;" /><img src="glm-2_files/figure-html/unnamed-chunk-11-2.png" style="display: block; margin: auto;" /> --- ### Back to bike crashes `\begin{align*} \log\left(E(Y | \mathbf{X})\right) &= \beta_0 + \beta_1(pop) + \beta_2(traffic) + \beta_3(rural)\\ \end{align*}` ```r m2 <- glm(crashes ~ traffic_vol + pct_rural + pop, data = bike, family = "poisson") ``` `\begin{align*} \log\left( \frac{E(Y | \mathbf{X})}{pop} \right) &= \beta_0 + \beta_1(traffic) + \beta_2(rural) \end{align*}` ```r m3 <- glm(crashes ~ traffic_vol + pct_rural, offset = log(pop), data = bike, family = "poisson") ``` .question[ What are the differences in the two models above? ] --- ### Back to bike crashes `\begin{align*} \log\left( \frac{E(Y | \mathbf{X})}{pop} \right) &= \beta_0 + \beta_1(traffic) + \beta_2(rural) \\ \log\left(E(Y | \mathbf{X})\right) &= \beta_0 + \beta_1(traffic) + \beta_2(rural) + \log(pop) \end{align*}` Here, we are using `pop` as an .vocab[offset]. This means that our dependent variable is actually a *rate*, even though we are providing counts, and we can look at covariate effects directly on this rate. If we use `pop` as a covariate, then the response is no longer a rate of bike crashes per unit population. However, we would be able to assess association between population and number of bike crashes (conditionally on traffic volume and urbanicity). --- ### Back to bike crashes ```r round(summary(m1)$coef, 6) ``` ``` ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) 5.982181 0.053749 111.298625 0 ## traffic_vol 0.001541 0.000166 9.262671 0 ## pct_rural -0.044558 0.000875 -50.919036 0 ``` ```r round(summary(m2)$coef, 6) ``` ``` ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) 5.655725 0.054837 103.136325 0.000000 ## traffic_vol -0.000093 0.000179 -0.518756 0.603931 ## pct_rural -0.037761 0.000878 -43.015409 0.000000 ## pop 0.000001 0.000000 30.215337 0.000000 ``` ```r round(summary(m3)$coef, 6) ``` ``` ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) -6.916803 0.054480 -126.961100 0.000000 ## traffic_vol -0.000047 0.000171 -0.272118 0.785531 ## pct_rural -0.010936 0.000857 -12.766690 0.000000 ``` --- ### Poisson regression with offset .question[ Can we simply use `bike$crashes/bike$pop` as our outcome variable in the code we've already written? ] ```r beta <- c(mean(log(bike$crashes)), 0, 0) X <- cbind(1, bike$traffic_vol, bike$pct_rural) y <- bike$crashes / bike$pop iter <- 1 delta <- 1 temp <- matrix(0, nrow = 500, ncol = 3) while(delta > 0.000001 & iter < 500){ old <- beta beta <- old - solve(d2func(beta = beta, X = X, y = y)) %*% d1func(beta = beta, X = X, y = y) temp[iter,] <- beta delta <- sqrt(sum((beta - old)^2)) iter <- iter + 1 } ``` --- ### Poisson regression with offset ```r round(beta, 6) ``` ``` ## [,1] ## [1,] -6.810266 ## [2,] 0.000314 ## [3,] -0.011783 ``` ```r round(m3$coefficients, 6) ``` ``` ## (Intercept) traffic_vol pct_rural ## -6.916803 -0.000047 -0.010936 ``` .question[ They're close, but not quite right. Did something go wrong? ] --- ### Poisson regression with offset ```r m3_wrong <- m2 <- glm(crashes/pop ~ traffic_vol + pct_rural, data = bike, family = "poisson") round(m3_wrong$coefficients, 6) ``` ``` ## (Intercept) traffic_vol pct_rural ## -6.810266 0.000314 -0.011783 ``` ```r round(beta, 6) ``` ``` ## [,1] ## [1,] -6.810266 ## [2,] 0.000314 ## [3,] -0.011783 ``` .question[ What's happening? (keep in mind, all output on this page is **wrong**) ] --- ### Poisson regression with offset Let's denote an offset term by `\(\omega\)`. If we directly use `crashes/pop` in our Poisson regression likelihood, we would have a log-likelihood along the lines of `\begin{align*} \log \mathcal{L}&\propto \sum_{i = 1}^n \frac{y_i}{\omega_i}\mathbf{X}_i\boldsymbol\beta - e^{\mathbf{X}_i\boldsymbol\beta} \end{align*}` This is incorrect. We cannot assume `crashes/pop` has a Poisson distribution. --- ### Poisson regression with offset If we write the log-likelihood for a Poisson regression with offset correctly, we have: `\begin{align*} \log\left(E(Y | \mathbf{X})\right) &= \beta_0 + \mathbf{X}^T\boldsymbol\beta - \log \boldsymbol\omega \\ \log \mathcal{L}&\propto \sum_{i = 1}^n y_i\mathbf{X}_i\boldsymbol\beta - \omega_i e^{\mathbf{X}_i\boldsymbol\beta} \end{align*}` Thus, we must use this *correct* log-likelihood to determine the score vector and Hessian for our Newton-Raphson implementation: `\begin{align*} \nabla \log \mathcal{L}&= \sum_{i = 1}^n \left(y_i - \omega_ie^{\mathbf{X}_i\boldsymbol\beta}\right)\mathbf{X}_i^T\\ \nabla^2 \log \mathcal{L} &= -\sum_{i = 1}^n \omega_i e^{\mathbf{X}_i\boldsymbol\beta}\mathbf{X}_i\mathbf{X}_i^T \end{align*}` --- ### Poisson regression with offset Newton-Raphson update steps for Poisson regression with offset: `\begin{align*} \boldsymbol\beta^{(t+1)} &= \boldsymbol\beta^{(t)} - \left(-\sum_{i = 1}^n \omega_ie^{\mathbf{X}_i\boldsymbol\beta}\mathbf{X}_i\mathbf{X}_i^T \right)^{-1}\left(\sum_{i = 1}^n \left(y_i - \omega_i e^{\mathbf{X}_i\boldsymbol\beta}\right)\mathbf{X}_i^T \right) \end{align*}` In writing code, we must now specify the offset `\(\omega\)` in addition to the observed data `\(\mathbf{X}\)`, `\(y\)`, and candidate `\(\boldsymbol\beta\)`s. --- ### Poisson regression with offset Functions for score vector and Hessian (including offset term): ```r d1ofs <- function(beta, X, y, offset){ d1 <- rep(0, length(beta)) for(i in 1:length(y)){ d1 <- d1 + (y[i] - offset[i] *exp(X[i,] %*% beta)) %*% X[i,] } return(colSums(d1)) } d2ofs <- function(beta, X, y, offset){ d2 <- matrix(0, nrow = length(beta), ncol = length(beta)) for(i in 1:length(y)){ d2 <- d2 - offset[i] * t((exp(X[i,] %*% beta)) %*% X[i,]) %*% (X[i,]) } return(d2) } ``` --- ### Poisson regression with offset Implementing Newton-Raphson: ```r beta <- c(mean(log(bike$crashes)), 0, 0) X <- cbind(1, bike$traffic_vol, bike$pct_rural) y <- bike$crashes offset <- bike$pop iter <- 1 delta <- 1 temp <- matrix(0, nrow = 500, ncol = 3) while(delta > 0.000001 & iter < 500){ old <- beta beta <- old - solve(d2ofs(beta = beta, X = X, y = y, offset = offset)) %*% d1ofs(beta = beta, X = X, y = y, offset = offset) temp[iter,] <- beta delta <- sqrt(sum((beta - old)^2)) iter <- iter + 1 } ``` --- ### Poisson regression with offset ```r round(beta, 6) ``` ``` ## [,1] ## [1,] -6.916803 ## [2,] -0.000047 ## [3,] -0.010936 ``` ```r round(m3$coefficients, 6) ``` ``` ## (Intercept) traffic_vol pct_rural ## -6.916803 -0.000047 -0.010936 ``` Our manual Newton-Raphson code lines up, as expected. --- ### Fisher Scoring <img src="img/fisher-scoring.png" width="80%" style="display: block; margin: auto;" /> --- ### Fisher Scoring Fisher Scoring replaces `\(\nabla^2 \log \mathcal{L}\)` with the expected Fisher information: `\begin{align*} E\Large(\left(\nabla\log\mathcal{L} \right)\left(\nabla\log\mathcal{L} \right)^T \Large), \end{align*}` which is asymptotically equivalent to the Hessian. It's often easier to implement because we don't need to take the second derivative. Fisher Scoring is usually a bit more stable than vanilla Newton-Raphson and the initial steps usually "get to where we want" faster. Newton-Raphson and Fisher Scoring updating steps are identical for GLMs under canonical link.