class: center, middle, inverse, title-slide # Categorical Predictors ### Yue Jiang ### STA 210 / Duke University / Spring 2023 --- ### Coffee <img src="img/coffee.jpg" width="80%" style="display: block; margin: auto;" /> --- ### Coffee <img src="categorical_files/figure-html/unnamed-chunk-3-1.png" style="display: block; margin: auto;" /> --- ### Coffee <img src="categorical_files/figure-html/unnamed-chunk-4-1.png" style="display: block; margin: auto;" /> --- ### Coffee <img src="categorical_files/figure-html/unnamed-chunk-5-1.png" style="display: block; margin: auto;" /> --- ### Review: Multiple regression The model is given by `\begin{align*} y_i = \beta_0 + \beta_1x_{i1} + \beta_2x_{i2} + \cdots + \beta_px_{ip} + \epsilon_i \end{align*}` - `\(p\)` is the total number of predictor or explanatory variables - `\(y_i\)` is the outcome (dependent variable) of interest - `\(\beta_0\)` is the intercept parameter - `\(\beta_1, \beta_2, \cdots, \beta_p\)` are the slope parameters - `\(x_{i1}, x_{i2}, \cdots, x_{ip}\)` are predictor variables - `\(\epsilon_i\)` is the error (like the `\(\beta\)`s, it is not observed) --- ### Categorical predictors We often have categorical predictors in modeling settings (for instance, the type of process involved in the coffee production). However, it might not make sense to think about a "one unit increase" in a categorical variable (how would that even work?) In regression settings, we can account for categorical variables by creating .vocab[dummy variables], which are indicator variables for certain conditions happening. When considering categorical variables, one variable is taken to be the .vocab[baseline] or .vocab[reference] value. All other categories will be compared to it. --- ### Categorical predictors Suppose the dry process is taken to be the referent value. Then we can create two .vocab[dummy variables]: - other process: 1 if this condition is true; 0 otherwise - wet process: 1 if this condition is true; 0 otherwise Let's suppose we represent the data as follows. What are the coffee processes used for each of the following observations? Is it possible to have a 1 in each of the following variables? | Other | Wet | | --- | ----- | | 0 | 1 | | 0 | 0 | | 1 | 0 | | 0 | 1 | | 1 | 0 | --- ### Dummy variables Suppose the dry process is taken to be the referent value. Then we can create two dummy variables: - other process: 1 if this condition is true; 0 otherwise - wet process: 1 if this condition is true; 0 otherwise Keep in mind that we originally had one variable in the underlying data, but are using *two* variables to represent it in the dataset. .question[ In general, how many dummy variables are needed to represent a categorical variable with `\(k\)` categories? ] --- ### Interpretation of dummy variables Consider the model `\begin{align*} Score_i &= \beta_0 + \beta_1(I(other)_i) + \beta_2(I(wet)_i) + \epsilon_i \end{align*}` The parameter interpretations are below. - `\(\beta_0\)` represents the expected score for a coffee with 0 for the two dummy variables. That is, if it was a dry process coffee - `\(\beta_1\)` represents the expected **difference** in score for coffee with an other roasting process, *compared* to the baseline level (dry process) - `\(\beta_2\)` represents the expected difference in score for wet process coffee compared to dry process coffee. Again, we had to estimate *multiple* "slopes" for this one variable - one corresponding to each non-reference level. --- ### More variables Of course, we can also have other variables as well. Take a look at the following dataset. What do each of the rows and columns represent? | Score | Flavor | Other | Wet | | ----- | ------ | --- | ----- | | 93 | 92 | 0 | 1 | | 91 | 91 | 0 | 0 | | 92 | 92 | 1 | 0 | | 92 | 88 | 0 | 1 | | 89 | 85 | 1 | 0 | | 90 | 90 | 0 | 0 | `\begin{align*} Score_i &= \beta_0 + \beta_1flavor_i + \beta_2(I(other)_i) + \beta_3(I(wet)_i) + \epsilon_i \end{align*}` .question[ How would you interpret each of those coefficients? ] --- ### Fitting the model ```r m1 <- lm(score ~ flavor + process, data = coffee) summary(m1) ``` ``` ## ## Call: ## lm(formula = score ~ flavor + process, data = coffee) ## ## Residuals: ## Min 1Q Median 3Q Max ## -9.4419 -0.3507 0.1730 0.7289 5.2934 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 34.4155 1.0581 32.526 < 2e-16 ## flavor 6.3137 0.1385 45.586 < 2e-16 ## processOther / Missing 0.3966 0.1414 2.805 0.00514 ## processWashed / Wet 0.2339 0.1153 2.028 0.04283 ## ## Residual standard error: 1.382 on 919 degrees of freedom ## Multiple R-squared: 0.7025, Adjusted R-squared: 0.7015 ## F-statistic: 723.3 on 3 and 919 DF, p-value: < 2.2e-16 ``` --- ### Fitting the model ```r options(digits = 3) m2 <- linear_reg() |> set_engine("lm") |> fit(score ~ flavor + process, data = coffee) tidy(m2) ``` ``` ## # A tibble: 4 x 5 ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) 34.4 1.06 32.5 4.89e-155 ## 2 flavor 6.31 0.139 45.6 4.00e-238 ## 3 processOther / Missing 0.397 0.141 2.81 5.14e- 3 ## 4 processWashed / Wet 0.234 0.115 2.03 4.28e- 2 ``` --- ### Explained variation The strength of the fit of a linear model can be evaluated in many ways. One common statistic is `\(R^2\)`, the percentage of the variability in the response variable that is explained by the model. The remainder of the variability is considered "unexplained" by the model and associated with the residuals (errors). `\(R^2\)` is sometimes called the .vocab[coefficient of determination] (especially in older sources). .question[ What does "explained variability in the response variable" mean? ] --- ### Obtaining R-squared in R `\(R^2\)` is presented in the `summary()` function in base `R`. Alternatively in the tidymodels ecosystem, you can find a quick summary of many model metrics with the `glance()` function (we'll talk about some of the others later!) ```r glance(m2) ``` ``` ## # A tibble: 1 x 12 ## r.squared adj.r.squ~1 sigma stati~2 p.value df logLik AIC BIC devia~3 ## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 0.702 0.702 1.38 723. 2.44e-241 3 -1607. 3223. 3247. 1756. ## # ... with 2 more variables: df.residual <int>, nobs <int>, and abbreviated ## # variable names 1: adj.r.squared, 2: statistic, 3: deviance ``` Roughly 70% of the variability in the "total score" of cups of coffee can be explained by flavor rating and bean processing method. --- ### R-squared - basic principles We can write explained variation using the following ratio of sums of squares: `$$R^2 = 1 - \left( \frac{ SS_{Error} }{ SS_{Total} } \right)$$` where `\(SS_{Error}\)` is the sum of squared residuals and `\(SS_{Total}\)` is the total variance in the response variable. .question[ Why does this expression make sense? ] --- ### Adjusted R-squared Note that `\(R^2\)` never decreases when **any** variable is added to the model, even if it's completely unrelated (like what shirt color you have on). `$$R^2_{adj} = 1 - \left( \frac{ SS_{Error} }{ SS_{Total} } \times \frac{n - 1}{n - k - 1} \right),$$` where `\(n\)` is the number of observations and `\(k\)` is the number of predictors in the model. Adjusted `\(R^2\)` doesn't increase if the new variable does not provide any new information or is completely unrelated. However, we can no longer interpret it as percentage of variation in the outcome explained by the linear model with the predictors. --- ### R-squared recap In our model, `\(R^2 \approx 0.7\)`, suggesting that about 7% of the variability in total score can be explained by our model. However, `\(R^2\)` can never decrease when variables are added to a model, even if they are useless. Thus, we can use *adjusted* `\(R^2\)`, where the adjustment is made to account for the number of predictors. The adjusted `\(R^2\)` incorporates a penalty for each additional variable in a model, so that the adjusted `\(R^2\)` will go down if a new variable does not improve prediction much, and it will go up if the new variable does improve prediction, conditional on the other variables already in the model. With that said, `\(R^2\)` or adj. `\(R^2\)` should never be used as the only reasons to select variables for your model - you must rely on scientific knowledge and context. More on variable selection later this semester! --- ### Your turn .question[ Fit a linear regression model that aims to predict the total score of a cup of coffee based on the process type, the flavor rating, the aroma rating, and the country of origin. ] 1. Adjusting for aroma, flavor, and process, which country has the highest expected score? The lowest? 2. After adjusting for aroma, flavor, and country of origin, is there sufficient evidence to suggest a difference in mean score between dry vs. wet process beans? Explain, using a formal hypothesis test (be careful re: distribution of test statistic under `\(H_0\)`!) 3. What percentage of the variability in the total score is explained by a linear relationship with the predictors in your model? <p style="text-align: center;"> <a href = "https://classroom.github.com/a/2qnnbNbT">https://classroom.github.com/a/2qnnbNbT</a> </p>