class: center, middle, inverse, title-slide # Poisson Regression ### Dr. Maria Tackett ### 04.17.19 --- ### Announcements - Fill out team feedback #3 - Project presentations and write up due May 1 at 2p - Friday's Lab: - Every team member should be at lab this week - Use part of this time to schedule times you will work on the project together - Exam 2 extra credit: If the response rate for the course evaluations is 90%, everyone gets +1 pt on their Exam 2 score (out of 40 pts) --- ## Poisson response variables The following are examples of scenarios with Poisson response variables: - Are the **number of motorcycle deaths** in a given year related to a state’s helmet laws? - Does the **number of employers** conducting on-campus interviews during a year differ for public and private colleges? - Does the **daily number of asthma-related visits** to an Emergency Room differ depending on air pollution indices? - Has the **number of deformed fish** in randomly selected Minnesota lakes been affected by changes in trace minerals in the water over the last decade? --- ### Poisson Distribution - If `\(Y\)` follows a Poisson distribution, then `$$P(Y=y) = \frac{\exp\{-\lambda\}\lambda^y}{y!} \hspace{10mm} y=0,1,2,\ldots$$` <br> - Features of the Poisson distribution: + Mean and variance are equal `\((\lambda)\)` + Distribution tends to be skewed right, especially when the mean is small + If the mean is larger, it can be approximated by a Normal distribution --- ### Simulated Poisson distributions <img src="22-poisson-regression_files/figure-html/unnamed-chunk-2-1.png" style="display: block; margin: auto;" /> --- ### Simulated Poisson distributions <table> <thead> <tr> <th style="text-align:left;"> </th> <th style="text-align:right;"> Mean </th> <th style="text-align:right;"> Variance </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> lambda=2 </td> <td style="text-align:right;"> 2.00740 </td> <td style="text-align:right;"> 2.015245 </td> </tr> <tr> <td style="text-align:left;"> lambda=5 </td> <td style="text-align:right;"> 4.99130 </td> <td style="text-align:right;"> 4.968734 </td> </tr> <tr> <td style="text-align:left;"> lambda=20 </td> <td style="text-align:right;"> 19.99546 </td> <td style="text-align:right;"> 19.836958 </td> </tr> <tr> <td style="text-align:left;"> lambda=100 </td> <td style="text-align:right;"> 100.02276 </td> <td style="text-align:right;"> 100.527647 </td> </tr> </tbody> </table> --- ### Poisson Regression - We want `\(\lambda\)` to be a function of predictor variables `\(X_1, \ldots, X_p\)` -- .question[ Why is a multiple linear regression model not appropriate? ] -- - `\(\lambda\)` must be greater than or equal to 0 for any combination of predictor variables - Constant variance assumption will be violated! --- ### MLR vs. Poisson  --- ### Poisson Regression - If the observed values `\(Y_i\)` are Poisson, then we can model using a <font class="vocab">Poisson regression model</font> of the form `$$\log(\lambda_i) = \beta_0 + \beta_1 X_{1i} + \beta_2 X_{2i} + \dots + \beta_p X_{pi}$$` <br> - Note that we don't have an error term, since `\(\lambda\)` determines the mean and variance of the Poisson random variable --- ### Interpreting Model Coefficients - <font class="vocab">Slope, `\(\beta_j\)`: </font> - **Quantitative Predictor**: When `\(X_j\)` increases by one unit, the expected count of `\(Y\)` changes by a multiplicative factor of `\(\exp\{\beta_j\}\)`, holding all else constant - **Categorical Predictor**: The expected count for category `\(k\)` is `\(\exp\{\beta_j\}\)` times the expected count for the baseline category, holding all else constant -- - <font class="vocab">Intercept, `\(\beta_0\)`: </font>: When `\(X\)` is 0, the expected count of `\(Y\)` is `\(\exp\{\beta_0\}\)` --- ### Example: Age, Gender, Pulse Rate - <font class="vocab">Goal:</font> We want to use `age` and `gender` to understand variation in pulse rate - We will use adults age 20 to 39 in the `NHANES` data set to answer this question - **Reponse** + <font class="vocab">`Pulse`: </font>Number of heartbeats in 60 seconds - **Explanatory** + <font class="vocab">`Age`: </font>Age in years. Subjects 80 years or older recorded as 80 + <font class="vocab">`Gender`: </font>male/female --- ### Example: Age, Gender, Pulse Rate - We calculate the Poisson regression model using the mean-centered Age ```r model1 <- glm(Pulse ~ ageCent + Gender, data = nhanes, family = "poisson") kable(tidy(model1, conf.int = T),format="html") ``` <table> <thead> <tr> <th style="text-align:left;"> term </th> <th style="text-align:right;"> estimate </th> <th style="text-align:right;"> std.error </th> <th style="text-align:right;"> statistic </th> <th style="text-align:right;"> p.value </th> <th style="text-align:right;"> conf.low </th> <th style="text-align:right;"> conf.high </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> (Intercept) </td> <td style="text-align:right;"> 4.3416799 </td> <td style="text-align:right;"> 0.0031800 </td> <td style="text-align:right;"> 1365.30794 </td> <td style="text-align:right;"> 0.0000000 </td> <td style="text-align:right;"> 4.3354407 </td> <td style="text-align:right;"> 4.3479061 </td> </tr> <tr> <td style="text-align:left;"> ageCent </td> <td style="text-align:right;"> -0.0007360 </td> <td style="text-align:right;"> 0.0003933 </td> <td style="text-align:right;"> -1.87118 </td> <td style="text-align:right;"> 0.0613201 </td> <td style="text-align:right;"> -0.0015069 </td> <td style="text-align:right;"> 0.0000349 </td> </tr> <tr> <td style="text-align:left;"> Gendermale </td> <td style="text-align:right;"> -0.0595673 </td> <td style="text-align:right;"> 0.0045620 </td> <td style="text-align:right;"> -13.05723 </td> <td style="text-align:right;"> 0.0000000 </td> <td style="text-align:right;"> -0.0685091 </td> <td style="text-align:right;"> -0.0506263 </td> </tr> </tbody> </table> .question[ 1. Write the model equation. 2. Interpret the intercept in the context of the problem. 3. Interpret the coefficient of `ageCent` in the context of the problem. ] --- ### Drop In Deviance Test - We would like to test if there is a significant interaction between `Age` and `Gender` - Since we have a generalized linear model, we can use the Drop In Deviance Test (similar test to logistic regression) ```r model1 <- glm(Pulse ~ ageCent + Gender, data = nhanes, family = "poisson") model2 <- glm(Pulse ~ ageCent + Gender + ageCent*Gender, data = nhanes, family = "poisson") kable(anova(model1,model2,test="Chisq"),format="html") ``` <table> <thead> <tr> <th style="text-align:right;"> Resid. Df </th> <th style="text-align:right;"> Resid. Dev </th> <th style="text-align:right;"> Df </th> <th style="text-align:right;"> Deviance </th> <th style="text-align:right;"> Pr(>Chi) </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 2575 </td> <td style="text-align:right;"> 4536.813 </td> <td style="text-align:right;"> NA </td> <td style="text-align:right;"> NA </td> <td style="text-align:right;"> NA </td> </tr> <tr> <td style="text-align:right;"> 2574 </td> <td style="text-align:right;"> 4536.345 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 0.4686061 </td> <td style="text-align:right;"> 0.4936291 </td> </tr> </tbody> </table> -- - There is not sufficient evidence that the interaction is significant, so we won't include it in the model. --- ### Model Assumptions 1. **Poisson Response**: Response variable is a count per unit of time or space 2. **Independence**: The observations are independent of one another 3. **Mean = Variance** 4. **Linearity**: `\(\log(\lambda)\)` is a linear function of the predictors --- ### Model Diagnostics - The raw residual for the `\(i^{th}\)` observation, `\(Y_i - \hat{\lambda}_i\)`, is difficult to interpret since the variance is equal to the mean in the Poisson distribution - Instead, we can analyze a standardized residual called the <font class="vocab">Pearson residual</font> `$$r_i = \frac{Y_i - \hat{\lambda}_i}{\sqrt{\hat{\lambda}_i}}$$` - Examine a plot of the Pearson residuals versus the predicted values and versus each predictor variable + A distinguishable trend in any of the plots indicates that the model is not an appropriate fit for the data --- ### Example: Age, Gender, Pulse Rate - Let's examine the Pearson residuals for the model that includes the main effects for `Age` and `Gender` ```r nhanes_aug <- augment(model1, type.predict = "response", type.residuals = "pearson") ``` <img src="22-poisson-regression_files/figure-html/unnamed-chunk-8-1.png" style="display: block; margin: auto;" /> --- ### Poisson Regression in R - Use the <font class="vocab">`glm()`</font> function ```r # poisson regression model my.model <- glm(Y ~ X, data = my.data, family = poisson) ``` <br> ```r # predicted values and Pearson residuals mydata_aug <- augment(my.model, type.predict = "response", type.residuals = "pearson") ``` --- ## Physician Visits What factors influence the number of itmes some visits a physcians office? We will use the variables `chronic`, `health`, and `insurance` to predict `visits`. ```r library(AER) data(NMES1988) ``` ```r visits_model <- glm(visits ~ chronic + health + insurance, data = NMES1988, family = "poisson") kable(tidy(visits_model, conf.int = T), format = "html", digits = 3) ``` <table> <thead> <tr> <th style="text-align:left;"> term </th> <th style="text-align:right;"> estimate </th> <th style="text-align:right;"> std.error </th> <th style="text-align:right;"> statistic </th> <th style="text-align:right;"> p.value </th> <th style="text-align:right;"> conf.low </th> <th style="text-align:right;"> conf.high </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> (Intercept) </td> <td style="text-align:right;"> 1.217 </td> <td style="text-align:right;"> 0.017 </td> <td style="text-align:right;"> 71.069 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 1.184 </td> <td style="text-align:right;"> 1.251 </td> </tr> <tr> <td style="text-align:left;"> chronic </td> <td style="text-align:right;"> 0.167 </td> <td style="text-align:right;"> 0.004 </td> <td style="text-align:right;"> 37.504 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 0.159 </td> <td style="text-align:right;"> 0.176 </td> </tr> <tr> <td style="text-align:left;"> healthpoor </td> <td style="text-align:right;"> 0.290 </td> <td style="text-align:right;"> 0.017 </td> <td style="text-align:right;"> 16.749 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 0.256 </td> <td style="text-align:right;"> 0.324 </td> </tr> <tr> <td style="text-align:left;"> healthexcellent </td> <td style="text-align:right;"> -0.360 </td> <td style="text-align:right;"> 0.030 </td> <td style="text-align:right;"> -11.889 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> -0.419 </td> <td style="text-align:right;"> -0.301 </td> </tr> <tr> <td style="text-align:left;"> insuranceyes </td> <td style="text-align:right;"> 0.279 </td> <td style="text-align:right;"> 0.016 </td> <td style="text-align:right;"> 17.270 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 0.247 </td> <td style="text-align:right;"> 0.310 </td> </tr> </tbody> </table> --- ## Physician Visits Let's compare the fitted values versus the actual values: <img src="22-poisson-regression_files/figure-html/unnamed-chunk-14-1.png" style="display: block; margin: auto;" /> .question[ Does this model fit the data well? ] --- ### Zero-inflated Poisson - In the original data, there are far more respondents who had zero visits to the physicians office than what's predicted by the Poisson regression model - This is called **zero-inflated data** - To deal with this, we will fit a model that has 2 parts: 1. Poisson regression for the number of doctor's visits of those who went to the physician at least one time (parameter = `\(\lambda\)`) 2. Logistic regression to find the probability someone goes to the physican at least once (parameter = `\(\alpha\)`) - We will use the `zeroinfl` model in the **pscl** package. --- ### Zero-inflated Poisson Regression - We will use `chronic`, `health`, and `insurance` in the Poisson portion of the model - We will use `chronic` and `insurance` in the logistic part of the model ```r zi_model <- zeroinfl(visits ~ chronic + health + insurance | chronic + insurance, dist = "poisson", data = NMES1988) summary(zi_model) ``` ``` ## ## Call: ## zeroinfl(formula = visits ~ chronic + health + insurance | chronic + ## insurance, data = NMES1988, dist = "poisson") ## ## Pearson residuals: ## Min 1Q Median 3Q Max ## -3.9221 -1.2195 -0.4316 0.5598 24.1031 ## ## Count model coefficients (poisson with log link): ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) 1.55878 0.01762 88.448 <2e-16 *** ## chronic 0.11868 0.00462 25.691 <2e-16 *** ## healthpoor 0.29470 0.01729 17.043 <2e-16 *** ## healthexcellent -0.30482 0.03115 -9.786 <2e-16 *** ## insuranceyes 0.14467 0.01631 8.870 <2e-16 *** ## ## Zero-inflation model coefficients (binomial with logit link): ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) -0.37426 0.09213 -4.062 4.86e-05 *** ## chronic -0.56112 0.04334 -12.948 < 2e-16 *** ## insuranceyes -0.88314 0.09464 -9.332 < 2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Number of iterations in BFGS optimization: 14 ## Log-likelihood: -1.651e+04 on 8 Df ``` --- ## Predicted ```r new_data <- NMES1988 %>% distinct(insurance, chronic, health) new_data <- new_data %>% mutate(pred_count = predict(zi_model, newdata = new_data, type = "count"), pred_zero = predict(zi_model, newdata = new_data, type = "zero") ) ``` --- ## References These slides draw material from [*Broadening Your Statistical Horizons*](https://bookdown.org/roback/bookdown-bysh/ch-poissonreg.html)