Lecture 22 -

Paris Paintings, Pt 4

Fit models that include interactions and do model selection. Recommendation: use backwards selection, and do it by hand (as opposed to automated). Talk to Hilary and Sandra about what your focus should be (don’t try a full model that has everything, pick a focus for your set of explanatory variables) You can also get creative with composite variables. Once you settle on a model interpret the coefficients and create at least one visualization that supports your narrative.

At this point we have covered:

- Simple linear regression: Relationship between numerical response and a numerical or categorical predictor

- Multiple regression: Relationship between numerical response and multiple numerical and/or categorical predictors

What we haven’t seen is what to do when the predictors are weird (nonlinear, complicated dependence structure, etc.) or when the response is weird (categorical, count data, etc.)

Odds are another way of quantifying the probability of an event, commonly used in gambling (and logistic regression).

For some event \(E\), \[\text{odds}(E) = \frac{P(E)}{P(E^c)} = \frac{P(E)}{1-P(E)}\] Similarly, if we are told the odds of E are \(x\) to \(y\) then \[\text{odds}(E) = \frac{x}{y} = \frac{x/(x+y)}{y/(x+y)} \] which implies \[P(E) = x/(x+y),\quad P(E^c) = y/(x+y)\]

```
pp$french = as.factor(ifelse(pp$school_pntg == "F", "F", "not F"))
table(pp$french, useNA = "ifany")
```

```
##
## F not F <NA>
## 1386 1938 69
```

`qplot(french, log(price), data = pp, geom = "boxplot")`

`ggplot(pp, aes(dealer, ..count..)) + geom_bar(aes(fill = french))`

we can treat `F`

and `not F`

as successes and failures arising from a binomial distribution where the probability of a success is given by a transformation of a linear model of the predictors.

It turns out that this is a very general way of addressing this type of problem in regression, and the resulting models are called generalized linear models (GLMs). Logistic regression is just one example of this type of model.

All generalized linear models have the following three characteristics:

- A probability distribution describing the outcome variable
- A linear model
- \(\eta = \beta_0+\beta_1 X_1 + \cdots + \beta_n X_n\)

- A link function that relates the linear model to the parameter of the outcome distribution
- \(g(p) = \eta\) or \(p = g^{-1}(\eta)\)

Logistic regression is a GLM used to model a binary categorical variable using numerical and categorical predictors.

We assume a binomial distribution produced the outcome variable and we therefore want to model \(p\) the probability of success for a given set of predictors.

To finish specifying the Logistic model we just need to establish a reasonable link function that connects \(\eta\) to \(p\). There are a variety of options but the most commonly used is the logit function.

\[logit(p) = \log\left(\frac{p}{1-p}\right),\text{ for $0\le p \le 1$}\]

The logit function takes a value between 0 and 1 and maps it to a value between \(-\infty\) and \(\infty\).

**Inverse logit (logistic) function:** \[g^{-1}(x) = \frac{\exp(x)}{1+\exp(x)} = \frac{1}{1+\exp(-x)}\]

The inverse logit function takes a value between \(-\infty\) and \(\infty\) and maps it to a value between 0 and 1.

This formulation also has some use when it comes to interpreting the model, as logit can be interpreted as the log odds of a success. More on this later…

The three GLM criteria give us:

\[ \begin{align*} y_i &\sim \text{Binom}(p_i) \eta_i &= \beta_0+\beta_1 x_{1,i} + \cdots + \beta_n x_{n,i} (p) &= \eta \end{align*} \]

From which we arrive at,

\[ p_i = \frac{\exp(\beta_0+\beta_1 x_{1,i} + \cdots + \beta_n x_{n,i})}{1+\exp(\beta_0+\beta_1 x_{1,i} + \cdots + \beta_n x_{n,i})} \]

```
mod_fr = glm(french ~ price, data = pp, family = binomial)
print(summary(mod_fr), digits = 1, signif.stars = FALSE)
```

```
##
## Call:
## glm(formula = french ~ price, family = binomial, data = pp)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.7 -1.3 0.9 1.1 1.1
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.18169 0.03924 5 4e-06
## price 0.00024 0.00003 8 3e-14
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 4515.9 on 3323 degrees of freedom
## Residual deviance: 4429.2 on 3322 degrees of freedom
## (69 observations deleted due to missingness)
## AIC: 4433
##
## Number of Fisher Scoring iterations: 4
```

Model: \[\log\left(\frac{p}{1-p}\right) = 0.18169 - 0.00024\times \text{price}\]

Odds / Probability of being French for a painting that costs 0 livres (`french = 0`

): \[
\begin{align*}
\log\left(\frac{p}{1-p}\right) &= 0.18169 - 0.00024 \times 0\\
\frac{p}{1-p} &= e^{0.18169} = 1.2 \\
p &= 1.2 / 2.2 = 0.55
\end{align*}
\]

Model: \[\log\left(\frac{p}{1-p}\right) = 0.18169 - 0.00024\times \text{price}\]

Odds / Probability of being French for a painting that costs 1000 livres (`french = 0`

): \[
\begin{align*}
\log\left(\frac{p}{1-p}\right) &= 0.18169 - 0.00024 \times 1000\\
\frac{p}{1-p} &= e^{0.18169 - 0.24} = 0.94 \\
p &= 0.94 / 1.94 = 0.48
\end{align*}
\]

```
missing_school = which(is.na(pp$school_pntg))
newdata = data.frame(ID = pp$name[missing_school], price = pp$price[missing_school])
newdata$p = predict(mod_fr, newdata, type = "response")
head(newdata)
```

```
## ID price p
## 1 L1764-10a 1 0.5454
## 2 L1764-10b 1 0.5454
## 3 L1764-10c 1 0.5454
## 4 L1764-27c 16 0.5462
## 5 L1764-42 9 0.5458
## 6 L1764-45a 24 0.5467
```