Categorical Predictors

class: center, middle, inverse, title-slide

# Categorical Predictors
### Yue Jiang
### STA 210 / Duke University / Spring 2023

---

### Coffee

---

### Coffee

---

### Coffee

---

### Coffee

---

### Review: Multiple regression

The model is given by

`\begin{align*}
y_i = \beta_0 + \beta_1x_{i1} + \beta_2x_{i2} + \cdots + \beta_px_{ip} + \epsilon_i
\end{align*}`

- `$p$` is the total number of predictor or explanatory variables
- `$y_i$` is the outcome (dependent variable) of interest
- `$\beta_0$` is the intercept parameter
- `$\beta_1, \beta_2, \cdots, \beta_p$` are the slope parameters
- `$x_{i1}, x_{i2}, \cdots, x_{ip}$` are predictor variables
- `$\epsilon_i$` is the error (like the `$\beta$`s, it is not observed)

---

### Categorical predictors

We often have categorical predictors in modeling settings (for instance, the type of process involved in the coffee production). However, it might not make sense to think about a "one unit increase" in a categorical variable (how would that even work?)

In regression settings, we can account for categorical variables by creating .vocab[dummy variables], which are indicator variables for certain conditions happening.

When considering categorical variables, one variable is taken to be the .vocab[baseline] or .vocab[reference] value. All other categories will be compared to it.

---

### Categorical predictors

Suppose the dry process is taken to be the referent value. Then we can create two .vocab[dummy variables]:

- other process: 1 if this condition is true; 0 otherwise
- wet process: 1 if this condition is true; 0 otherwise

Let's suppose we represent the data as follows. What are the coffee processes used for each of the following observations? Is it possible to have a 1 in each of the following variables?

| Other | Wet |
| --- | ----- |
|   0 |     1 |
|   0 |     0 |
|   1 |     0 |
|   0 |     1 |
|   1 |     0 |

---

### Dummy variables

Suppose the dry process is taken to be the referent value. Then we can create two dummy variables:

- other process: 1 if this condition is true; 0 otherwise
- wet process: 1 if this condition is true; 0 otherwise

Keep in mind that we originally had one variable in the underlying data, but are using *two* variables to represent it in the dataset.

.question[
In general, how many dummy variables are needed to represent a categorical variable with `$k$` categories?
]

---

### Interpretation of dummy variables

Consider the model

`\begin{align*}
Score_i &= \beta_0 + \beta_1(I(other)_i) + \beta_2(I(wet)_i) + \epsilon_i
\end{align*}`

The parameter interpretations are below.

- `$\beta_0$` represents the expected score for a coffee with 0 for the two dummy variables. That is, if it was a dry process coffee
- `$\beta_1$` represents the expected **difference** in score for coffee with an other roasting process, *compared* to the baseline level (dry process)
- `$\beta_2$` represents the expected difference in score for wet process coffee compared to dry process coffee.

Again, we had to estimate *multiple* "slopes" for this one variable - one corresponding to each non-reference level.

---

### More variables

Of course, we can also have other variables as well. Take a look at the following
dataset. What do each of the rows and columns represent?

| Score | Flavor | Other | Wet |
| ----- | ------ | --- | ----- |
|    93 |     92 |   0 |     1 |
|    91 |     91 |   0 |     0 |
|    92 |     92 |   1 |     0 |
|    92 |     88 |   0 |     1 |
|    89 |     85 |   1 |     0 |
|    90 |     90 |   0 |     0 |

`\begin{align*}
Score_i &= \beta_0 + \beta_1flavor_i + \beta_2(I(other)_i) + \beta_3(I(wet)_i) + \epsilon_i
\end{align*}`

.question[
How would you interpret each of those coefficients?
]

---

### Fitting the model

```r
m1 <- lm(score ~ flavor + process, data = coffee)
summary(m1)
```

```
## 
## Call:
## lm(formula = score ~ flavor + process, data = coffee)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.4419 -0.3507  0.1730  0.7289  5.2934 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)
## (Intercept)             34.4155     1.0581  32.526  < 2e-16
## flavor                   6.3137     0.1385  45.586  < 2e-16
## processOther / Missing   0.3966     0.1414   2.805  0.00514
## processWashed / Wet      0.2339     0.1153   2.028  0.04283
## 
## Residual standard error: 1.382 on 919 degrees of freedom
## Multiple R-squared:  0.7025,	Adjusted R-squared:  0.7015 
## F-statistic: 723.3 on 3 and 919 DF,  p-value: < 2.2e-16
```

---

### Fitting the model

```r
options(digits = 3)
m2 <- linear_reg() |>
  set_engine("lm") |>
  fit(score ~ flavor + process, data = coffee)

tidy(m2)
```

```
## # A tibble: 4 x 5
##   term                   estimate std.error statistic   p.value
##   <chr>                     <dbl>     <dbl>     <dbl>     <dbl>
## 1 (Intercept)              34.4       1.06      32.5  4.89e-155
## 2 flavor                    6.31      0.139     45.6  4.00e-238
## 3 processOther / Missing    0.397     0.141      2.81 5.14e-  3
## 4 processWashed / Wet       0.234     0.115      2.03 4.28e-  2
```

---

### Explained variation

The strength of the fit of a linear model can be evaluated in many ways. One 
common statistic is `$R^2$`, the percentage of the variability in the response 
variable that is explained by the model. The remainder of the variability is 
considered "unexplained" by the model and associated with the residuals (errors).

`$R^2$` is sometimes called the .vocab[coefficient of determination] (especially
in older sources).

.question[
What does "explained variability in the response variable" mean?
]

---

### Obtaining R-squared in R

`$R^2$` is presented in the `summary()` function in base `R`. Alternatively in
the tidymodels ecosystem, you can find a quick summary of many model metrics 
with the `glance()` function (we'll talk about some of the others later!)

```r
glance(m2)
```

```
## # A tibble: 1 x 12
##   r.squared adj.r.squ~1 sigma stati~2   p.value    df logLik   AIC   BIC devia~3
##       <dbl>       <dbl> <dbl>   <dbl>     <dbl> <dbl>  <dbl> <dbl> <dbl>   <dbl>
## 1     0.702       0.702  1.38    723. 2.44e-241     3 -1607. 3223. 3247.   1756.
## # ... with 2 more variables: df.residual <int>, nobs <int>, and abbreviated
## #   variable names 1: adj.r.squared, 2: statistic, 3: deviance
```

Roughly 70% of the variability in the "total score" of cups of coffee can be explained by flavor rating and bean processing method.

---

### R-squared - basic principles

We can write explained variation using the following ratio of sums of squares:

`$$R^2 =  1 - \left( \frac{ SS_{Error} }{ SS_{Total} } \right)$$`

where `$SS_{Error}$` is the sum of squared residuals and `$SS_{Total}$` is the total
variance in the response variable.

.question[
Why does this expression make sense?
]

---

### Adjusted R-squared

Note that `$R^2$` never decreases when **any** variable is added to the model, even if it's completely unrelated (like what shirt color you have on).

`$$R^2_{adj} = 1 - \left( \frac{ SS_{Error} }{ SS_{Total} } \times \frac{n - 1}{n - k - 1} \right),$$`

where `$n$` is the number of observations and `$k$` is the number of predictors in 
the model. Adjusted `$R^2$` doesn't increase if the new variable does not provide any new 
information or is completely unrelated. However, we can no longer interpret it
as percentage of variation in the outcome explained by the linear model with the
predictors.

---

### R-squared recap

In our model, `$R^2 \approx 0.7$`, suggesting that about 7% of the variability in total score can be explained by our model. However, `$R^2$` can never decrease when variables are added to a model, even if they are useless.
	
Thus, we can use *adjusted* `$R^2$`, where the adjustment is made to account for the number of predictors. The adjusted `$R^2$` incorporates a penalty for each additional variable in a model, so that the adjusted `$R^2$` will go down if a new variable does not improve prediction much, and it will go up if the new variable does improve prediction, conditional on the other variables already in the model.

With that said, `$R^2$` or adj. `$R^2$` should never be used as the only reasons to select variables for your model - you must rely on scientific knowledge and context. More on variable selection later this semester!

---

### Your turn

.question[
Fit a linear regression model that aims to predict the total score of a cup of
coffee based on the process type, the flavor rating, the aroma rating, and the
country of origin.
]

1. Adjusting for aroma, flavor, and process, which country has the highest 
expected score? The lowest?
2. After adjusting for aroma, flavor, and country of origin, is there sufficient
evidence to suggest a difference in mean score between dry vs. wet process
beans? Explain, using a formal hypothesis test (be careful re: distribution of
test statistic under `$H_0$`!)
3. What percentage of the variability in the total score is explained by a 
linear relationship with the predictors in your model?

<p style="text-align: center;">
<a href = "https://classroom.github.com/a/2qnnbNbT">https://classroom.github.com/a/2qnnbNbT</a>
</p>