Poisson Regression

class: center, middle, inverse, title-slide

# Poisson Regression
### Dr. Maria Tackett
### 04.17.19

---

### Announcements

- Fill out team feedback #3

- Project presentations and write up due May 1 at 2p

- Friday's Lab: 
    - Every team member should be at lab this week
    - Use part of this time to schedule times you will work on the project together
    
- Exam 2 extra credit: If the response rate for the course evaluations is 90%, everyone gets +1 pt on their Exam 2 score (out of 40 pts)

---

## Poisson response variables

The following are examples of scenarios with Poisson response variables:

- Are the **number of motorcycle deaths** in a given year related to a state’s helmet laws?

- Does the **number of employers** conducting on-campus interviews during a year differ for public and private colleges?

- Does the **daily number of asthma-related visits** to an Emergency Room differ depending on air pollution indices?

- Has the **number of deformed fish** in randomly selected Minnesota lakes been affected by changes in trace minerals in the water over the last decade?

---

### Poisson Distribution

- If `$Y$` follows a Poisson distribution, then

`$$P(Y=y) = \frac{\exp\{-\lambda\}\lambda^y}{y!} \hspace{10mm} y=0,1,2,\ldots$$`
<br>

- Features of the Poisson distribution:
  + Mean and variance are equal `$(\lambda)$`
  + Distribution tends to be skewed right, especially when the mean is small
  + If the mean is larger, it can be approximated by a Normal distribution

---

### Simulated Poisson distributions

---

### Simulated Poisson distributions

<table>
 <thead>
  <tr>
   <th style="text-align:left;">   </th>
   <th style="text-align:right;"> Mean </th>
   <th style="text-align:right;"> Variance </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> lambda=2 </td>
   <td style="text-align:right;"> 2.00740 </td>
   <td style="text-align:right;"> 2.015245 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> lambda=5 </td>
   <td style="text-align:right;"> 4.99130 </td>
   <td style="text-align:right;"> 4.968734 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> lambda=20 </td>
   <td style="text-align:right;"> 19.99546 </td>
   <td style="text-align:right;"> 19.836958 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> lambda=100 </td>
   <td style="text-align:right;"> 100.02276 </td>
   <td style="text-align:right;"> 100.527647 </td>
  </tr>
</tbody>
</table>

---

### Poisson Regression

- We want `$\lambda$` to be a function of predictor variables `$X_1, \ldots, X_p$`

.question[
Why is a multiple linear regression model not appropriate?
]

- `$\lambda$` must be greater than or equal to 0 for any combination of predictor variables
- Constant variance assumption will be violated!

---

### MLR vs. Poisson

![](img/22/poisson_ols.png)

---

### Poisson Regression

- If the observed values `$Y_i$` are Poisson, then we can model using a <font class="vocab">Poisson regression model</font> of the form

`$$\log(\lambda_i) = \beta_0 + \beta_1 X_{1i} + \beta_2 X_{2i} + \dots + \beta_p X_{pi}$$`
<br>

- Note that we don't have an error term, since `$\lambda$` determines the mean and variance of the Poisson random variable

---
 
### Interpreting Model Coefficients

- <font class="vocab">Slope, `$\beta_j$`: </font>
    - **Quantitative Predictor**:  When `$X_j$` increases by one unit, the expected count of `$Y$` changes by a multiplicative factor of `$\exp\{\beta_j\}$`, holding all else constant
    
    - **Categorical Predictor**: The expected count for category `$k$` is `$\exp\{\beta_j\}$` times the expected count for the baseline category, holding all else constant

- <font class="vocab">Intercept, `$\beta_0$`: </font>: When `$X$` is 0, the expected count of `$Y$` is `$\exp\{\beta_0\}$`

---

### Example: Age, Gender, Pulse Rate

- <font class="vocab">Goal:</font> We want to use `age` and `gender` to understand variation in pulse rate
- We will use adults age 20 to 39 in the `NHANES` data set to answer this question

- **Reponse**
  + <font class="vocab">`Pulse`: </font>Number of heartbeats in 60 seconds

- **Explanatory**
  + <font class="vocab">`Age`: </font>Age in years. Subjects 80 years or older recorded as 80
  + <font class="vocab">`Gender`: </font>male/female

---

### Example: Age, Gender, Pulse Rate

- We calculate the Poisson regression model using the mean-centered Age

```r
model1 <- glm(Pulse ~ ageCent + Gender, data = nhanes,
              family = "poisson")
kable(tidy(model1, conf.int = T),format="html")
```

<table>
 <thead>
  <tr>
   <th style="text-align:left;"> term </th>
   <th style="text-align:right;"> estimate </th>
   <th style="text-align:right;"> std.error </th>
   <th style="text-align:right;"> statistic </th>
   <th style="text-align:right;"> p.value </th>
   <th style="text-align:right;"> conf.low </th>
   <th style="text-align:right;"> conf.high </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> (Intercept) </td>
   <td style="text-align:right;"> 4.3416799 </td>
   <td style="text-align:right;"> 0.0031800 </td>
   <td style="text-align:right;"> 1365.30794 </td>
   <td style="text-align:right;"> 0.0000000 </td>
   <td style="text-align:right;"> 4.3354407 </td>
   <td style="text-align:right;"> 4.3479061 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> ageCent </td>
   <td style="text-align:right;"> -0.0007360 </td>
   <td style="text-align:right;"> 0.0003933 </td>
   <td style="text-align:right;"> -1.87118 </td>
   <td style="text-align:right;"> 0.0613201 </td>
   <td style="text-align:right;"> -0.0015069 </td>
   <td style="text-align:right;"> 0.0000349 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Gendermale </td>
   <td style="text-align:right;"> -0.0595673 </td>
   <td style="text-align:right;"> 0.0045620 </td>
   <td style="text-align:right;"> -13.05723 </td>
   <td style="text-align:right;"> 0.0000000 </td>
   <td style="text-align:right;"> -0.0685091 </td>
   <td style="text-align:right;"> -0.0506263 </td>
  </tr>
</tbody>
</table>

.question[

1. Write the model equation.

2. Interpret the intercept in the context of the problem.

3. Interpret the coefficient of `ageCent` in the context of the problem.

]

---

### Drop In Deviance Test

- We would like to test if there is a significant interaction between `Age` and `Gender`

- Since we have a generalized linear model, we can use the Drop In Deviance Test (similar test to logistic regression)

```r
model1 <- glm(Pulse ~ ageCent + Gender, data = nhanes,
              family = "poisson")
model2 <- glm(Pulse ~ ageCent + Gender + ageCent*Gender,
              data = nhanes, family = "poisson")

kable(anova(model1,model2,test="Chisq"),format="html")
```

<table>
 <thead>
  <tr>
   <th style="text-align:right;"> Resid. Df </th>
   <th style="text-align:right;"> Resid. Dev </th>
   <th style="text-align:right;"> Df </th>
   <th style="text-align:right;"> Deviance </th>
   <th style="text-align:right;"> Pr(&gt;Chi) </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:right;"> 2575 </td>
   <td style="text-align:right;"> 4536.813 </td>
   <td style="text-align:right;"> NA </td>
   <td style="text-align:right;"> NA </td>
   <td style="text-align:right;"> NA </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 2574 </td>
   <td style="text-align:right;"> 4536.345 </td>
   <td style="text-align:right;"> 1 </td>
   <td style="text-align:right;"> 0.4686061 </td>
   <td style="text-align:right;"> 0.4936291 </td>
  </tr>
</tbody>
</table>
--

- There is not sufficient evidence that the interaction is significant, so we won't include it in the model.

---

### Model Assumptions

1. **Poisson Response**: Response variable is a count per unit of time or space

2. **Independence**: The observations are independent of one another

3. **Mean = Variance**

4. **Linearity**: `$\log(\lambda)$` is a linear function of the predictors

---

### Model Diagnostics

- The raw residual for the `$i^{th}$` observation, `$Y_i - \hat{\lambda}_i$`, is difficult to interpret since the variance is equal to the mean in the Poisson distribution

- Instead, we can analyze a standardized residual called the <font class="vocab">Pearson residual</font>
`$$r_i = \frac{Y_i - \hat{\lambda}_i}{\sqrt{\hat{\lambda}_i}}$$`

- Examine a plot of the Pearson residuals versus the predicted values and versus each predictor variable
  + A distinguishable trend in any of the plots indicates that the model is not an appropriate fit for the data

---

### Example: Age, Gender, Pulse Rate

- Let's examine the Pearson residuals for the model that includes the main effects for `Age` and `Gender`

```r
nhanes_aug <- augment(model1, type.predict = "response", 
                      type.residuals = "pearson")
```

---

### Poisson Regression in R

- Use the <font class="vocab">`glm()`</font> function

```r
# poisson regression model
my.model <- glm(Y ~ X, data = my.data, family = poisson)
```
<br>

```r
# predicted values and Pearson residuals 
mydata_aug <- augment(my.model,
                      type.predict = "response", 
                      type.residuals = "pearson")
```

---

## Physician Visits

What factors influence the number of itmes some visits a physcians office? We will use the variables `chronic`, `health`, and `insurance` to predict `visits`.

```r
library(AER)
data(NMES1988)
```

```r
visits_model <- glm(visits ~ chronic + health + insurance, data = NMES1988, 
                    family = "poisson")
kable(tidy(visits_model, conf.int = T), format = "html", digits = 3)
```

<table>
 <thead>
  <tr>
   <th style="text-align:left;"> term </th>
   <th style="text-align:right;"> estimate </th>
   <th style="text-align:right;"> std.error </th>
   <th style="text-align:right;"> statistic </th>
   <th style="text-align:right;"> p.value </th>
   <th style="text-align:right;"> conf.low </th>
   <th style="text-align:right;"> conf.high </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> (Intercept) </td>
   <td style="text-align:right;"> 1.217 </td>
   <td style="text-align:right;"> 0.017 </td>
   <td style="text-align:right;"> 71.069 </td>
   <td style="text-align:right;"> 0 </td>
   <td style="text-align:right;"> 1.184 </td>
   <td style="text-align:right;"> 1.251 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> chronic </td>
   <td style="text-align:right;"> 0.167 </td>
   <td style="text-align:right;"> 0.004 </td>
   <td style="text-align:right;"> 37.504 </td>
   <td style="text-align:right;"> 0 </td>
   <td style="text-align:right;"> 0.159 </td>
   <td style="text-align:right;"> 0.176 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> healthpoor </td>
   <td style="text-align:right;"> 0.290 </td>
   <td style="text-align:right;"> 0.017 </td>
   <td style="text-align:right;"> 16.749 </td>
   <td style="text-align:right;"> 0 </td>
   <td style="text-align:right;"> 0.256 </td>
   <td style="text-align:right;"> 0.324 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> healthexcellent </td>
   <td style="text-align:right;"> -0.360 </td>
   <td style="text-align:right;"> 0.030 </td>
   <td style="text-align:right;"> -11.889 </td>
   <td style="text-align:right;"> 0 </td>
   <td style="text-align:right;"> -0.419 </td>
   <td style="text-align:right;"> -0.301 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> insuranceyes </td>
   <td style="text-align:right;"> 0.279 </td>
   <td style="text-align:right;"> 0.016 </td>
   <td style="text-align:right;"> 17.270 </td>
   <td style="text-align:right;"> 0 </td>
   <td style="text-align:right;"> 0.247 </td>
   <td style="text-align:right;"> 0.310 </td>
  </tr>
</tbody>
</table>

---

## Physician Visits

Let's compare the fitted values versus the actual values:

.question[
Does this model fit the data well? 
]

---

### Zero-inflated Poisson

- In the original data, there are far more respondents who had zero visits to the physicians office than what's predicted by the Poisson regression model 
    - This is called **zero-inflated data**
    
- To deal with this, we will fit a model that has 2 parts: 
    1. Poisson regression for the number of doctor's visits of those who went to the physician at least one time (parameter = `$\lambda$`)
    2. Logistic regression to find the probability someone goes to the physican at least once (parameter = `$\alpha$`)
    
- We will use the `zeroinfl` model in the **pscl** package.

---

### Zero-inflated Poisson Regression

- We will use `chronic`, `health`, and `insurance` in the Poisson portion of the model

- We will use `chronic` and `insurance` in the logistic part of the model

```r
zi_model <- zeroinfl(visits ~ chronic + health + insurance | chronic + insurance, dist = "poisson",  data = NMES1988)

summary(zi_model)
```

```
## 
## Call:
## zeroinfl(formula = visits ~ chronic + health + insurance | chronic + 
##     insurance, data = NMES1988, dist = "poisson")
## 
## Pearson residuals:
##     Min      1Q  Median      3Q     Max 
## -3.9221 -1.2195 -0.4316  0.5598 24.1031 
## 
## Count model coefficients (poisson with log link):
##                 Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      1.55878    0.01762  88.448   <2e-16 ***
## chronic          0.11868    0.00462  25.691   <2e-16 ***
## healthpoor       0.29470    0.01729  17.043   <2e-16 ***
## healthexcellent -0.30482    0.03115  -9.786   <2e-16 ***
## insuranceyes     0.14467    0.01631   8.870   <2e-16 ***
## 
## Zero-inflation model coefficients (binomial with logit link):
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -0.37426    0.09213  -4.062 4.86e-05 ***
## chronic      -0.56112    0.04334 -12.948  < 2e-16 ***
## insuranceyes -0.88314    0.09464  -9.332  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Number of iterations in BFGS optimization: 14 
## Log-likelihood: -1.651e+04 on 8 Df
```

---

## Predicted

```r
new_data <- NMES1988 %>% distinct(insurance, chronic, health) 
  
new_data <- new_data %>%
  mutate(pred_count = predict(zi_model, newdata = new_data, type = "count"),
  pred_zero = predict(zi_model, newdata = new_data, type = "zero")
  )
```

---

## References

These slides draw material from [*Broadening Your Statistical Horizons*](https://bookdown.org/roback/bookdown-bysh/ch-poissonreg.html)