Logistic Regression

# Logistic Regression
### Prof. Maria Tackett
### 03.25.20

---

### [Click for PDF of slides](13-logistic-pt1.pdf)

---

### Part I: Categorical Response Variables

---

### Quantitative vs. Categorical Response Variables

.vocab[Quantitative response variable]: 
- Sales price of a house in Levittown, NY
- **Model**: variation in the mean sales price given values of the predictor variables (`bedrooms`, `lot_size`, `year_built`, etc.)

.vocab[Categorical response variable]: 
- Patient at risk of coronary heart disease (Yes/No)
- **Model**: variation in the probability a patient is at risk of coronary heart disease given values of the predictor variables (`age`, `currentSmoker`, `totChol`, etc.)

---

### Models for categorical response variables

2 Outcomes

Agree/Disagree
]

3+ Outcomes

Strongly Agree, Agree, Disagree, Strongly Disagree
]

---

### FiveThirtyEight Live Win Probabilities

.pull-left[
<img src="img/13/live-win-prob.png" width="100%" style="display: block; margin: auto;" />
]

.pull-right[
[FiveThirtyEight: 2019 March MadnessLive Win Probabilities](https://projects.fivethirtyeight.com/2019-march-madness-predictions/)
]

*"These probabilities are derived using .vocab[logistic regression analysis], which lets us plug the current state of a game into a model to produce the probability that either team will win the game.*
 
<div align="right">
- <a href=https://fivethirtyeight.com/features/how-our-march-madness-predictions-work-2/ target="_blank">"How Our March Madness Predictions Work"<a/>
</div>

---

### 2018 Election Forecasts

<center>
<img src="img/13/fivethirtyeight_senate.png" width="70%" style="display: block; margin: auto;" />
<a href="https://projects.fivethirtyeight.com/2018-midterm-election-forecast/senate/?ex_cid=irpromo">FiveThirtyEight.com Senate forecast</a>

<img src="img/13/fivethirtyeight_house.png" width="70%" style="display: block; margin: auto;" />
<a href="https://projects.fivethirtyeight.com/2018-midterm-election-forecast/house/?ex_cid=irpromo">FiveThirtyEight.com House forecast</a>
</center>

---
class: middle, center

*Our models are probabilistic in nature; we do a lot of thinking about these probabilities, and the goal is to develop probabilistic estimates that hold up well under real-world conditions.*
 
<div align="right">
<a href="https://fivethirtyeight.com/methodology/how-fivethirtyeights-house-and-senate-models-work/" target="_blank">-"How FiveThirtyEight's House, Senate, and Governor Models Work"<a/>
</div>

---

### Response Variable, `$Y$`

- `$Y$` is a binary response variable 
  + 1: yes (success)
  + 0: no (failure)

- `$\text{Mean}(Y) = \pi$`
  + `$\pi$` is the proportion of "yes" responses in the population
  + `$\hat{\pi}$` is the proportion of "yes" responses in the sample

- `$\text{Variance}(Y) = \pi(1-\pi)$`
  + Sample variance: `$\hat{\pi}(1-\hat{\pi})$`

- `$\text{Odds(Y=1)} = \frac{\pi}{1-\pi}$`
  + Sample odds: `$\frac{\hat{\pi}}{1-\hat{\pi}}$`

---

### Odds

- Given `$\pi$`, the population proportion of "yes" responses (i.e. "success"), the corresponding odds of a "yes" response is

`$$\omega = \frac{\pi}{1-\pi}$$`

- The *sample odds* are `$\hat{\omega} = \frac{\hat{\pi}}{1-\hat{\pi}}$`

- Ex: Suppose the sample proportion `$\hat{\pi} = 0.3$`. Then, the sample odds are 
`$$\hat{\omega} = \frac{0.3}{1-0.03} = 0.4286 \approx \text{ 2 in 5}$$`

---

### Properties of the odds

- `$\text{odds} \geq 0$`

- If `$\pi = 0.5$`, then odds `$= 1$`

- If odds of "yes" `$=\omega$`, then the odds of "no" `$=\frac{1}{\omega}$`

- If odds of "yes" `$=\omega$`, then `$\pi = \frac{\omega}{(1+\omega)}$`

---

### Risk of coronary heart disease

This dataset is from an ongoing cardiovascular study on residents of the town of Framingham, Massachusetts. We want to predict if a patient has a high risk of getting coronary heart disease in the next 10 years.

**Response**:

.vocab[`TenYearCHD`]: 
- 0 = Patient is not high risk of having coronary heart disease in the next 10 years 
- 1 = Patient is high risk of having coronary heart disease in the next 10 years

**Predictors**:

- .vocab[`age`]: Age at exam time.
- .vocab[`currentSmoker`]: 0 = nonsmoker; 1 = smoker
- .vocab[`totChol`]: total cholesterol (mg/dL)

---

### Response Variable, `TenYearCHD`

```
## # A tibble: 2 x 3
## TenYearCHD n proportion
## <fct> <int> <dbl>
## 1 0 3101 0.848
## 2 1 557 0.152
```

- `$\hat{\pi}$` = 0.152

- Sample variance = 0.152 * (1- 0.152) =  0.128896

- Odds(Y = 1) = 0.152/(1 - 0.152) =  0.1792453

- Odds(Y = 0) = 1 / 0.1792453 = 5.5789474

---

### Let's incorporate more variables

- We want to use information about a patient's age, cholesterol, and whether or they are a smoker to understand the probability they're high risk of having coronary heart disease.

- To do this, we need to fit a model!

---

### Consider possible models

- `$y$`: Whether a patient in the sample is high risk of having coronary heart disease.
--

- `$\pi_i = P(y_i = 1 | \text{age}_i, \text{currentSmoker}_i, \text{totChol}_i)$`: probability a patient `$i$` is high risk  for coronary heart disease given their age, smoking status, and total cholesterol

.question[
.small[
Let's consider fitting a multiple linear regression model. Below are 3 possible response variables. For each response variable, briefly explain why a multiple linear regression model is **not** appropriate.

**Model 1**: `$\hat{y}_i = \hat{\beta}_0 + \hat{\beta}_1 \text{age} + \hat{\beta}_2 \text{currentSmoker} + \hat{\beta}_3 \text{totChol}$`

**Model 2**: `$\hat{\pi}_i = \hat{\beta}_0 + \hat{\beta}_1 \text{age} + \hat{\beta}_2 \text{currentSmoker} + \hat{\beta}_3 \text{totChol}$`

**Model 3**: `$\widehat{\log(\pi)}_i = \hat{\beta}_0 + \hat{\beta}_1 \text{age} + \hat{\beta}_2 \text{currentSmoker} + \hat{\beta}_3 \text{totChol}$`
]
]

---

### Part 2: Basics of logistic regression

---

### Logistic Regression Model

- Suppose `$P(y_i = 1|x_i) = \pi_i$` and `$P(y_i = 0|x_i) = 1 - \pi_i$`

- The logistic regression model is

`$$\log\Big(\frac{\pi_i}{1-\pi_i}\Big) = \beta_0 + \beta_1 x_i$$`

- `$\log\Big(\frac{\pi_i}{1-\pi_i}\Big)$` is called the logit function

---

### Logit function

`$$0 \leq \pi \leq 1 \hspace{5mm} \Rightarrow \hspace{5mm} -\infty < 
\log\Big(\frac{\pi}{1-\pi}\Big) < \infty$$`

<div class="figure" style="text-align: center">
<img src="img/13/logit.png" alt="OpenIntro Statistics, 4th ed (pg. 373)" width="90%" />
OpenIntro Statistics, 4th ed (pg. 373)
</div>

---

### Estimating the coefficients

- Estimate coefficients using **maximum likelihood estimation**

- Basic Idea: 
 + Find values of `$\hat{\beta}_0$` and `$\hat{\beta}_1$` that give observed data the maximum probability of occuring
 + More details pg. 156 - 157 of the textbook

- We will fit logistic regression models using R

---

### Interpreting the intercept: `$\beta_0$`

- When `$x=0$`, log-odds of `$y$` are `$\beta_0$`
    - Won't use this interpretation in practice

- **When `$x=0$`, odds of `$y$` are `$\exp\{\beta_0\}$`**

---

### Interpreting slope coefficient `$\beta_1$`

If `$x$` is a quantitative predictor

- As `$x_i$` increases by 1 unit, we expect the log-odds of `$y$` to increase by `$\beta_1$`

- **As `$x_i$` increases by 1 unit, the odds of `$y$` multiply by a factor of `$\exp\{\beta_1\}$`**

If `$x$` is a categorical predictor. Suppose `$x_i = k$`

- The difference in the log-odds between group `$k$` and the baseline is `$\beta_1$`
- **The odds of `$y$` for group `$k$` are `$\exp\{\beta_1\}$` times the odds of `$y$` for the baseline group.**

---

### Inference for coefficients

- The standard error is the estimated standard deviation of the sampling distribution of `$\hat{\beta}_1$`

- We can calculate the `$\color{blue}{C%}$` confidence interval based on the large-sample Normal approximations

- **CI for `$\boldsymbol{\beta}_1$`**: `$$\hat{\beta}_1 \pm z^* SE(\hat{\beta}_1)$$`

.alert[
**CI for `$\exp\{\boldsymbol{\beta}_1\}$`**: `$$\exp\{\hat{\beta}_1 \pm z^* SE(\hat{\beta}_1)\}$$`
  ]

---

### Modeling risk of coronary heart disease

Let's use the mean-centered variables for `age` and `totChol`.

|term           | estimate| std.error| statistic| p.value| conf.low| conf.high|
|:--------------|--------:|---------:|---------:|-------:|--------:|---------:|
|(Intercept)    |   -2.111|     0.077|   -27.519|   0.000|   -2.264|    -1.963|
|ageCent        |    0.081|     0.006|    13.477|   0.000|    0.070|     0.093|
|currentSmoker1 |    0.447|     0.099|     4.537|   0.000|    0.255|     0.641|
|totCholCent    |    0.003|     0.001|     2.339|   0.019|    0.000|     0.005|

`$$\small{\log\Big(\frac{\hat{\pi}}{1-\hat{\pi}}\Big) = -2.111 + 0.081 \text{ageCent} + 0.447 \text{currentSmoker} + 0.003 \text{totChol}}$$`

---

### Modeling risk of coronary heart disease

`$$\small{\log\Big(\frac{\hat{\pi}}{1-\hat{\pi}}\Big) = -2.111 + 0.081 \text{ageCent} + 0.447 \text{currentSmoker} + 0.003 \text{totChol}}$$`

.question[
Use the model to interpret the following. Write all interpretations in terms of the odds of a patient being high risk for coronary heart disease.

1. Interpret the intercept.
2. Interpret `ageCent` and its 95% confidence interval.
3. Interpret `currentSmoker1` and its 95% confidence interval.
]