November 13, 2014

From last time

Application Exercise 17

Fit models that include interactions and do model selection. Recommendation: use backwards selection, and do it by hand (as opposed to automated). Talk to Hilary and Sandra about what your focus should be (don't try a full model that has everything, pick a focus for your set of explanatory variables) You can also get creative with composite variables. Once you settle on a model interpret the coefficients and create at least one visualization that supports your narrative.

Generalized linear models

Regression so far

At this point we have covered:

  • Simple linear regression: Relationship between numerical response and a numerical or categorical predictor
  • Multiple regression: Relationship between numerical response and multiple numerical and/or categorical predictors

What we haven't seen is what to do when the predictors are weird (nonlinear, complicated dependence structure, etc.) or when the response is weird (categorical, count data, etc.)

Odds

Odds are another way of quantifying the probability of an event, commonly used in gambling (and logistic regression).

For some event \(E\), \[\text{odds}(E) = \frac{P(E)}{P(E^c)} = \frac{P(E)}{1-P(E)}\] Similarly, if we are told the odds of E are \(x\) to \(y\) then \[\text{odds}(E) = \frac{x}{y} = \frac{x/(x+y)}{y/(x+y)} \] which implies \[P(E) = x/(x+y),\quad P(E^c) = y/(x+y)\]

French paintings, or not

pp$french = as.factor(ifelse(pp$school_pntg == "F", "F", "not F"))
table(pp$french, useNA = "ifany")
## 
##     F not F  <NA> 
##  1386  1938    69
qplot(french, log(price), data = pp, geom = "boxplot")

plot of chunk unnamed-chunk-2

ggplot(pp, aes(dealer, ..count..)) + geom_bar(aes(fill = french))

plot of chunk unnamed-chunk-2

Framework

we can treat F and not F as successes and failures arising from a binomial distribution where the probability of a success is given by a transformation of a linear model of the predictors.

Generalized linear models

It turns out that this is a very general way of addressing this type of problem in regression, and the resulting models are called generalized linear models (GLMs). Logistic regression is just one example of this type of model.

All generalized linear models have the following three characteristics:

  1. A probability distribution describing the outcome variable
  2. A linear model
    • \(\eta = \beta_0+\beta_1 X_1 + \cdots + \beta_n X_n\)
  3. A link function that relates the linear model to the parameter of the outcome distribution
    • \(g(p) = \eta\) or \(p = g^{-1}(\eta)\)

Logistic regression

Logistic regression

Logistic regression is a GLM used to model a binary categorical variable using numerical and categorical predictors.

We assume a binomial distribution produced the outcome variable and we therefore want to model \(p\) the probability of success for a given set of predictors.

To finish specifying the Logistic model we just need to establish a reasonable link function that connects \(\eta\) to \(p\). There are a variety of options but the most commonly used is the logit function.

\[logit(p) = \log\left(\frac{p}{1-p}\right),\text{ for $0\le p \le 1$}\]

Properties of the Logit

The logit function takes a value between 0 and 1 and maps it to a value between \(-\infty\) and \(\infty\).

Inverse logit (logistic) function: \[g^{-1}(x) = \frac{\exp(x)}{1+\exp(x)} = \frac{1}{1+\exp(-x)}\]

The inverse logit function takes a value between \(-\infty\) and \(\infty\) and maps it to a value between 0 and 1.

This formulation also has some use when it comes to interpreting the model, as logit can be interpreted as the log odds of a success. More on this later…

The logistic regression model

The three GLM criteria give us:

\[ \begin{align*} y_i &\sim \text{Binom}(p_i) \eta_i &= \beta_0+\beta_1 x_{1,i} + \cdots + \beta_n x_{n,i} (p) &= \eta \end{align*} \]

From which we arrive at,

\[ p_i = \frac{\exp(\beta_0+\beta_1 x_{1,i} + \cdots + \beta_n x_{n,i})}{1+\exp(\beta_0+\beta_1 x_{1,i} + \cdots + \beta_n x_{n,i})} \]

Example - predicting "French-ness" from price

mod_fr = glm(french ~ price, data = pp, family = binomial)
print(summary(mod_fr), digits = 1, signif.stars = FALSE)
## 
## Call:
## glm(formula = french ~ price, family = binomial, data = pp)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
##   -2.7    -1.3     0.9     1.1     1.1  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)
## (Intercept)  0.18169    0.03924       5    4e-06
## price        0.00024    0.00003       8    3e-14
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 4515.9  on 3323  degrees of freedom
## Residual deviance: 4429.2  on 3322  degrees of freedom
##   (69 observations deleted due to missingness)
## AIC: 4433
## 
## Number of Fisher Scoring iterations: 4

Interpreting the model

Model: \[\log\left(\frac{p}{1-p}\right) = 0.18169 - 0.00024\times \text{price}\]

Odds / Probability of being French for a painting that costs 0 livres (french = 0): \[ \begin{align*} \log\left(\frac{p}{1-p}\right) &= 0.18169 - 0.00024 \times 0\\ \frac{p}{1-p} &= e^{0.18169} = 1.2 \\ p &= 1.2 / 2.2 = 0.55 \end{align*} \]

Prediction

Model: \[\log\left(\frac{p}{1-p}\right) = 0.18169 - 0.00024\times \text{price}\]

Odds / Probability of being French for a painting that costs 1000 livres (french = 0): \[ \begin{align*} \log\left(\frac{p}{1-p}\right) &= 0.18169 - 0.00024 \times 1000\\ \frac{p}{1-p} &= e^{0.18169 - 0.24} = 0.94 \\ p &= 0.94 / 1.94 = 0.48 \end{align*} \]

Prediction for a new observation

missing_school = which(is.na(pp$school_pntg))
newdata = data.frame(ID = pp$name[missing_school], price = pp$price[missing_school])
newdata$p = predict(mod_fr, newdata, type = "response")
head(newdata)
##          ID price      p
## 1 L1764-10a     1 0.5454
## 2 L1764-10b     1 0.5454
## 3 L1764-10c     1 0.5454
## 4 L1764-27c    16 0.5462
## 5  L1764-42     9 0.5458
## 6 L1764-45a    24 0.5467