Fitting linear models in R (2)

class: center, middle, inverse, title-slide

# Fitting linear models in R (2)
### Yue Jiang
### Duke University

---

### Estimating bike crashes in NC counties

---
  
### Newton-Raphson in higher dimensions

.vocab[Score vector] and .vocab[Hessian] for `$\log \mathcal{L}(\boldsymbol\theta | \mathbf{X})$` with `$\boldsymbol\theta = (\theta_1, \cdots, \theta_p)^T$`:

`$$\nabla \log \mathcal{L} = \begin{pmatrix}
\frac{\partial \log \mathcal{L}}{\partial \boldsymbol\theta_1}\\
\vdots\\
\frac{\partial \log \mathcal{L}}{\partial \boldsymbol\theta_p}
\end{pmatrix}$$`
`$$\nabla^2 \log \mathcal{L} = \begin{pmatrix}
\frac{\partial^2 \log\mathcal{L}}{\partial \theta_1^2} & \frac{\partial^2 \log\mathcal{L}}{\partial \theta_1 \theta_2} & \cdots & \frac{\partial^2 \log\mathcal{L}}{\partial \theta_1\theta_p}\\
\frac{\partial^2 \log\mathcal{L}}{\partial \theta_2\theta_1} & \frac{\partial^2 \log\mathcal{L}}{\partial \theta_2^2} & \cdots & \frac{\partial^2 \log\mathcal{L}}{\partial \theta_2\theta_p} \\
\vdots & \vdots & \ddots & \vdots \\
\frac{\partial^2 \log\mathcal{L}}{\partial \theta_p\theta_1} & \frac{\partial^2 \log\mathcal{L}}{\partial \theta_p\theta_2} & \cdots & \frac{\partial^2 \log\mathcal{L}}{\partial \theta_p^2}
\end{pmatrix}$$`
 
---

### Newton-Raphson in higher dimensions

We can modify the Newton-Raphson algorithm for higher dimensions:

- Start with initial guess `$\boldsymbol\theta ^{(0)}$`
- Iterate `$\boldsymbol\theta^{(t + 1)} = \boldsymbol\theta^{(t)} - \left(\nabla^2\log\mathcal{L}(\boldsymbol\theta^{(t)} | \mathbf{X}) \right)^{-1} \left( \nabla \log\mathcal{L}(\boldsymbol\theta^{(t)} | \mathbf{X}) \right)$`
- Stop when convergence criterion is satisfied

Under certain conditions, a global maximum exists; this again is guaranteed for 
many common applications.

Computing the Hessian can be computationally demanding (and annoying), but there 
are ways around it in practice.

---

### Poisson regression

`\begin{align*}
\log \mathcal{L}&= \sum_{i = 1}^n y_i\mathbf{X}_i\boldsymbol\beta - e^{\mathbf{X}_i\boldsymbol\beta} - \log y_i!\\
\nabla \log \mathcal{L}&= \sum_{i = 1}^n \left(y_i - e^{\mathbf{X}_i\boldsymbol\beta}\right)\mathbf{X}_i^T\\
\nabla^2 \log \mathcal{L} &= -\sum_{i = 1}^n e^{\mathbf{X}_i\boldsymbol\beta}\mathbf{X}_i\mathbf{X}_i^T
\end{align*}`

Newton-Raphson update steps for Poisson regression:

`\begin{align*}
\boldsymbol\beta^{(t+1)} &= \boldsymbol\beta^{(t)} - \left(-\sum_{i = 1}^n e^{\mathbf{X}_i\boldsymbol\beta}\mathbf{X}_i\mathbf{X}_i^T \right)^{-1}\left(\sum_{i = 1}^n \left(y_i - e^{\mathbf{X}_i\boldsymbol\beta}\right)\mathbf{X}_i^T \right)
\end{align*}`

---

### Back to bike crashes

```
## # A tibble: 100 x 6
##    county       pop med_hh_income traffic_vol pct_rural crashes
##    <chr>      <dbl>         <dbl>       <dbl>     <dbl>   <dbl>
##  1 Alamance  166436          50.5         182        29      77
##  2 Alexander  37353          49.1          13        73       1
##  3 Alleghany  11161          39.7          28       100       1
##  4 Anson      24877          38            79        79       7
##  5 Ashe       27109          41.9          18        85       4
##  6 Avery      17505          41.7          35        89       5
##  7 Beaufort   47079          46.4          53        66      37
##  8 Bertie     19026          35.4          24        83      10
##  9 Bladen     33190          37            19        91       9
## 10 Brunswick 136744          60.2          43        43      88
## # ... with 90 more rows
```

- `pop`: county population
- `med_hh_income`: median household income in thousands
- `traffic_vol`: mean traffic volume per meter of major roadways
- `pct_rural`: percentage of county population living in rural area

---

### Back to bike crashes

Let's fit a model with `traffic_vol` and `pct_rural` for now:

```r
m1 <- glm(crashes ~ traffic_vol + pct_rural, 
          data = bike, family = "poisson")

round(summary(m1)$coef, 6)
```

```
##              Estimate Std. Error    z value Pr(>|z|)
## (Intercept)  5.982181   0.053749 111.298625        0
## traffic_vol  0.001541   0.000166   9.262671        0
## pct_rural   -0.044558   0.000875 -50.919036        0
```

.question[
What might we conclude / interpret from this model?
]

---

### Newton-Raphson (rough) implementation

Newton-Raphson update steps for Poisson regression:

Function for score vector, given vector `beta`, matrix `X`, and vector 
`y`:

```r
d1func <- function(beta, X, y){
  d1 <- rep(0, length(beta))
  for(i in 1:length(y)){
    d1 <- d1 + (y[i] - exp(X[i,] %*% beta)) %*% X[i,]
  }
  return(colSums(d1))
}
```

---

### Newton-Raphson (rough) implementation

Newton-Raphson update steps for Poisson regression:

Function for Hessian, given vector `beta`, matrix `X`, and vector 
`y`:

```r
d2func <- function(beta, X, y){
  d2 <- matrix(0, nrow = length(beta), ncol = length(beta))
  for(i in 1:length(y)){
    d2 <- d2 - t((exp(X[i,] %*% beta)) %*% X[i,]) %*% (X[i,])
  }
  return(d2)
}
```

---

### Newton-Raphson (rough) implementation

Some bookkeeping:

```r
beta <- c(mean(log(bike$crashes)), 0, 0)
X <- cbind(1, bike$traffic_vol, bike$pct_rural)
y <- bike$crashes
iter <- 1
delta <- 1

temp <- matrix(0, nrow = 500, ncol = 3)
```

---

### Newton-Raphson (rough) implementation

Actual Newton-Raphson implementation:

```r
while(delta > 0.000001 & iter < 500){
  old <- beta
  beta <- old - solve(d2func(beta = beta, X = X, y = y)) %*% 
                d1func(beta = beta, X = X, y = y)
  temp[iter,] <- beta
  
  delta <- sqrt(sum((beta - old)^2))
  iter <- iter + 1
}
```

---

### Newton-Raphson (rough) implementation

```r
iter
```

```
## [1] 22
```

```r
delta
```

```
## [1] 3.911961e-07
```

```r
beta
```

```
##             [,1]
## [1,]  5.98218054
## [2,]  0.00154064
## [3,] -0.04455809
```

```r
m1$coefficients
```

```
## (Intercept) traffic_vol   pct_rural 
##  5.98218054  0.00154064 -0.04455809
```

---

### Back to bike crashes

---

### Back to bike crashes

`\begin{align*}
\log\left(E(Y | \mathbf{X})\right) &= \beta_0 + \beta_1(pop) + \beta_2(traffic) + \beta_3(rural)\\
\end{align*}`

```r
m2 <- glm(crashes ~ traffic_vol + pct_rural + pop, 
          data = bike, family = "poisson")
```

`\begin{align*}
\log\left( \frac{E(Y | \mathbf{X})}{pop} \right) &= \beta_0 + \beta_1(traffic) + \beta_2(rural)
\end{align*}`

```r
m3 <- glm(crashes ~ traffic_vol + pct_rural, offset = log(pop),
          data = bike, family = "poisson")
```

.question[
What are the differences in the two models above?
]

---

### Back to bike crashes

`\begin{align*}
\log\left( \frac{E(Y | \mathbf{X})}{pop} \right) &= \beta_0 + \beta_1(traffic) + \beta_2(rural) \\
\log\left(E(Y | \mathbf{X})\right) &=  \beta_0 + \beta_1(traffic) + \beta_2(rural) + \log(pop)
\end{align*}`

Here, we are using `pop` as an .vocab[offset]. This means that our dependent 
variable is actually a *rate*, even though we are providing counts, and we can
look at covariate effects directly on this rate.

If we use `pop` as a covariate, then the response is no longer a rate of bike
crashes per unit population. However, we would be able to assess association
between population and number of bike crashes (conditionally on traffic volume
and urbanicity).

---

### Back to bike crashes

```r
round(summary(m1)$coef, 6)
```

```r
round(summary(m2)$coef, 6)
```

```
##              Estimate Std. Error    z value Pr(>|z|)
## (Intercept)  5.655725   0.054837 103.136325 0.000000
## traffic_vol -0.000093   0.000179  -0.518756 0.603931
## pct_rural   -0.037761   0.000878 -43.015409 0.000000
## pop          0.000001   0.000000  30.215337 0.000000
```

```r
round(summary(m3)$coef, 6)
```

```
##              Estimate Std. Error     z value Pr(>|z|)
## (Intercept) -6.916803   0.054480 -126.961100 0.000000
## traffic_vol -0.000047   0.000171   -0.272118 0.785531
## pct_rural   -0.010936   0.000857  -12.766690 0.000000
```

---

### Poisson regression with offset

.question[
Can we simply use `bike$crashes/bike$pop` as our outcome variable in the
code we've already written?
]

```r
beta <- c(mean(log(bike$crashes)), 0, 0)
X <- cbind(1, bike$traffic_vol, bike$pct_rural)
y <- bike$crashes / bike$pop
iter <- 1
delta <- 1

temp <- matrix(0, nrow = 500, ncol = 3)

while(delta > 0.000001 & iter < 500){
  old <- beta
  beta <- old - solve(d2func(beta = beta, X = X, y = y)) %*% 
                d1func(beta = beta, X = X, y = y)
  temp[iter,] <- beta
  
  delta <- sqrt(sum((beta - old)^2))
  iter <- iter + 1
}
```

---

### Poisson regression with offset

```r
round(beta, 6)
```

```
##           [,1]
## [1,] -6.810266
## [2,]  0.000314
## [3,] -0.011783
```

```r
round(m3$coefficients, 6)
```

```
## (Intercept) traffic_vol   pct_rural 
##   -6.916803   -0.000047   -0.010936
```

.question[
They're close, but not quite right. Did something go wrong?
]

---

### Poisson regression with offset

```r
m3_wrong <- m2 <- glm(crashes/pop ~ traffic_vol + pct_rural, 
          data = bike, family = "poisson")

round(m3_wrong$coefficients, 6)
```

```
## (Intercept) traffic_vol   pct_rural 
##   -6.810266    0.000314   -0.011783
```

```r
round(beta, 6)
```

```
##           [,1]
## [1,] -6.810266
## [2,]  0.000314
## [3,] -0.011783
```

.question[
What's happening? (keep in mind, all output on this page is **wrong**)
]

---

### Poisson regression with offset

Let's denote an offset term by `$\omega$`. If we directly use `crashes/pop` in 
our Poisson regression likelihood, we would have a log-likelihood along the 
lines of

`\begin{align*}
\log \mathcal{L}&\propto \sum_{i = 1}^n \frac{y_i}{\omega_i}\mathbf{X}_i\boldsymbol\beta - e^{\mathbf{X}_i\boldsymbol\beta}
\end{align*}`

This is incorrect. We cannot assume `crashes/pop` has a Poisson distribution.

---

### Poisson regression with offset

If we write the log-likelihood for a Poisson regression with offset correctly, 
we have:

`\begin{align*}
\log\left(E(Y | \mathbf{X})\right) &= \beta_0 + \mathbf{X}^T\boldsymbol\beta - \log \boldsymbol\omega \\
\log \mathcal{L}&\propto \sum_{i = 1}^n y_i\mathbf{X}_i\boldsymbol\beta - \omega_i e^{\mathbf{X}_i\boldsymbol\beta}
\end{align*}`

Thus, we must use this *correct* log-likelihood to determine the score vector
and Hessian for our Newton-Raphson implementation:

`\begin{align*}
\nabla \log \mathcal{L}&= \sum_{i = 1}^n \left(y_i - \omega_ie^{\mathbf{X}_i\boldsymbol\beta}\right)\mathbf{X}_i^T\\
\nabla^2 \log \mathcal{L} &= -\sum_{i = 1}^n \omega_i e^{\mathbf{X}_i\boldsymbol\beta}\mathbf{X}_i\mathbf{X}_i^T
\end{align*}`

---

### Poisson regression with offset

Newton-Raphson update steps for Poisson regression with offset:

`\begin{align*}
\boldsymbol\beta^{(t+1)} &= \boldsymbol\beta^{(t)} - \left(-\sum_{i = 1}^n \omega_ie^{\mathbf{X}_i\boldsymbol\beta}\mathbf{X}_i\mathbf{X}_i^T \right)^{-1}\left(\sum_{i = 1}^n \left(y_i - \omega_i e^{\mathbf{X}_i\boldsymbol\beta}\right)\mathbf{X}_i^T \right)
\end{align*}`

In writing code, we must now specify the offset `$\omega$` in addition to the
observed data `$\mathbf{X}$`, `$y$`, and candidate `$\boldsymbol\beta$`s.

---

### Poisson regression with offset

Functions for score vector and Hessian (including offset term):

```r
d1ofs <- function(beta, X, y, offset){
  d1 <- rep(0, length(beta))
  for(i in 1:length(y)){
    d1 <- d1 + (y[i] -  offset[i] *exp(X[i,] %*% beta)) %*% X[i,]
  }
  return(colSums(d1))
}

d2ofs <- function(beta, X, y, offset){
  d2 <- matrix(0, nrow = length(beta), ncol = length(beta))
  for(i in 1:length(y)){
    d2 <- d2 - offset[i] * t((exp(X[i,] %*% beta)) %*% X[i,]) %*% (X[i,])
  }
  return(d2)
}
```

---

### Poisson regression with offset

Implementing Newton-Raphson:

```r
beta <- c(mean(log(bike$crashes)), 0, 0)
X <- cbind(1, bike$traffic_vol, bike$pct_rural)
y <- bike$crashes
offset <- bike$pop
iter <- 1
delta <- 1

temp <- matrix(0, nrow = 500, ncol = 3)

while(delta > 0.000001 & iter < 500){
  old <- beta
  beta <- old - solve(d2ofs(beta = beta, X = X, y = y, offset = offset)) %*% 
                d1ofs(beta = beta, X = X, y = y, offset = offset)
  temp[iter,] <- beta
  
  delta <- sqrt(sum((beta - old)^2))
  iter <- iter + 1
}
```

---

### Poisson regression with offset

```r
round(beta, 6)
```

```
##           [,1]
## [1,] -6.916803
## [2,] -0.000047
## [3,] -0.010936
```

```r
round(m3$coefficients, 6)
```

```
## (Intercept) traffic_vol   pct_rural 
##   -6.916803   -0.000047   -0.010936
```

Our manual Newton-Raphson code lines up, as expected.

---

### Fisher Scoring

---

### Fisher Scoring

Fisher Scoring replaces `$\nabla^2 \log \mathcal{L}$` with the expected 
Fisher information:

`\begin{align*}
E\Large(\left(\nabla\log\mathcal{L} \right)\left(\nabla\log\mathcal{L} \right)^T \Large),
\end{align*}`

which is asymptotically equivalent to the Hessian. It's often easier to 
implement because we don't need to take the second derivative.

Fisher Scoring is usually a bit more stable than vanilla Newton-Raphson and
the initial steps usually "get to where we want" faster. Newton-Raphson and 
Fisher Scoring updating steps are identical for GLMs under canonical link.