class: center, middle, inverse, title-slide # Classification ## Intro to Data Science ### Shawn Santo ### 03-03-20 --- ## Load data The `pokemon` dataset contains data on 896 Pokemon - Includes Sword and Shield! - No alternate forms (mega evolutions, regional forms, etc.) - Contains information on Pokemon name, baseline battle statistics (base stats), and whether they are legendary (broadly defined). ```r library(tidyverse) pokemon <- read_csv("data/pokemon_cleaned.csv", na = c("n/a", "", "NA")) ``` --- ## Data preview ```r pokemon ``` ``` #> # A tibble: 896 x 10 #> n_dex_num name hp atk def spa spd spe total legendary #> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> #> 1 1 Bulbasaur 45 49 49 65 65 45 318 No #> 2 2 Ivysaur 60 62 63 80 80 60 405 No #> 3 3 Venusaur 80 82 83 100 100 80 525 No #> 4 4 Charmander 39 52 43 60 50 65 309 No #> 5 5 Charmeleon 58 64 58 80 65 80 405 No #> 6 6 Charizard 78 84 78 109 85 100 534 No #> 7 7 Squirtle 44 48 65 50 64 43 314 No #> 8 8 Wartortle 59 63 80 65 80 58 405 No #> 9 9 Blastoise 79 83 100 85 105 78 530 No #> 10 10 Caterpie 45 30 35 20 20 45 195 No #> # … with 886 more rows ``` --- ## Classification - The previous lectures have focused on using linear regression as a tool to + make predictions about new observations, and + describe relationships between a response and a set of explanatory variables. - These examples have all had *continuous* response variables -- <br/><br/> Our goal is to make statements about **categorical** response variables. --- ## Research goals - Predict whether a Pokemon is a legendary based on its base stats. - Describe the relationship between stats and *probability* of being legendary. .pull-left[  ] .pull-right[  ] .tiny-text[ Pokemon and Pokemon names are trademarks of Nintendo. ] --- class: center, middle, inverse # k-nearest neighbors --- ## The general idea Let's visualize attack and special attack statistics based on legendary status. <img src="lec-09a-classification_files/figure-html/unnamed-chunk-3-1.png" style="display: block; margin: auto;" /> --- ## Classifying hypothetical Pokemon Suppose we have some hypothetical Pokemon with the following attack and special attack. Would we classify them as legendary Pokemon? <img src="lec-09a-classification_files/figure-html/unnamed-chunk-4-1.png" style="display: block; margin: auto;" /> --- ## Classifying hypothetical Pokemon Suppose we have some hypothetical Pokemon with the following attack and special attack. Would we classify them as legendary Pokemon? <br/><br/> Given a new data point, predict its class status by taking the plurality vote of its *k* nearest neighbors in terms of *their* class memberships. --- ## A hypothetical Pokemon <img src="lec-09a-classification_files/figure-html/unnamed-chunk-5-1.png" style="display: block; margin: auto;" /> --- ## Finding the "nearest" neighbors <img src="lec-09a-classification_files/figure-html/unnamed-chunk-6-1.png" style="display: block; margin: auto;" /> --- ## Classification result <img src="lec-09a-classification_files/figure-html/unnamed-chunk-7-1.png" style="display: block; margin: auto;" /> --- ## Distance functions We can visualize the "nearest" neighbors with only attack and special attack in some sense. But suppose we want to use all of the Pokemon's base stats for classification. How can we do that? -- The k-nearest neighbors are calculated based on a pre-specified **distance metric**. We will use **Euclidean distance**, but many other options are available. -- Suppose `\(\mathbf{x}\)` and `\(\mathbf{y}\)` are p-dimensional vectors. The Euclidean distance between them is `$$\begin{align*} D(\mathbf{x}, \mathbf{y}) &= \sqrt{(x_1 - y_1)^2 + \cdots + (x_p - y_p)^2}\\ &= \sqrt{\sum_{i = 1}^p (x_i - y_i)^2} \end{align*}$$` --- ## How do we choose *k*? - A larger *k* reduces variance, but is computationally expensive and makes class boundaries fuzzier. - Smaller *k* results in a sharp class boundary, but may be too sensitive to the local data structure. - For binary classification, it's helpful to choose an odd *k* to avoid ties. - A commonly chosen simple approach is to use the square root of the sample size. -- <br/><br/> - A more robust and advanced procedure to choose *k* is known as cross validation. If you choose to do classification for your project, see me and I can help you find the "best" *k*. --- ## Classifying hypothetical Pokemon Suppose we have some hypothetical Pokemon with the following features. Would we classify them as legendary Pokemon? Crapistat: HP = 55, ATK = 25, DEF = 30, SPA = 60, SPD = 50, SPE = 102 Mediocra: HP = 90, ATK = 110, DEF = 130, SPA = 75, SPD = 80, SPE = 45 Broaken: HP = 104, ATK = 125, DEF = 105, SPA = 148, SPD = 102, SPE = 136 <br/> <b> We will calculate their k nearest neighbors in 6-dimensional Euclidean space and take the plurality vote as their legendary status. </b> --- ## Implementation in R ```r # Create new test values for hypothetical Pokemon new_pokemon <- tibble(hp = c(55, 90, 104), atk = c(25, 110, 125), def = c(30, 130, 105), spa = c(60, 75, 148), spd = c(50, 80, 102), spe = c(102, 45, 136)) train <- pokemon %>% select(hp, atk, def, spa, spd, spe) legendary_status <- pokemon %>% pull(legendary) # must pull classes as a vector! library(class) mod_knn <- knn(train, new_pokemon, legendary_status, k = 30, prob = F, use.all = T) mod_knn ``` ``` #> [1] No No Yes #> Levels: No Yes ``` --- ## Strengths - Intuitive to understand and straightforwardly implemented - Decision boundary can have arbitrary shape - Virtually assumption-free - Easy to extend to multi-class problems (just take the plurality vote) - Can be extended to add flexibility (e.g., weighting votes based on distance) --- ## Drawbacks - Unbalanced class sizes are difficult to resolve, since rare classes are dominated in most of the predictor-space - Computationally intensive, since we must compute distances to all points - Sensitive to high variance predictors, irrelevant predictors, and outliers - Completely ignores "far away" points - Requires that predictors can be compared on the same scale (i.e., a distance of x in feature 1 must have the same meaning for feature 2) - Need to determine k and choose distance function (how can we choose a distance for categorical predictors?) - Cannot deal with missing values - Not suitable in a high dimensional setting --- class: center, middle, inverse # Logistic regression --- ## Regression difficulties... Suppose we consider the following model for `\(p\)`, the probability of being a legendary Pokemon: `$${p} = \beta_0 + \beta_1\times\mbox{hp} + \beta_2\times\mbox{atk} + \cdots + \beta_6\times\mbox{spe}$$` <br/><br/> What can go wrong here? --- ## Residuals ```r library(broom) pokemon <- pokemon %>% mutate(leg_bin = dplyr::if_else(legendary == "Yes", 1, 0)) lm_legendary <- lm(leg_bin ~ hp + atk + def + spa + spd + spe, data = pokemon) ``` --- ```r ggplot(data = augment(lm_legendary), aes(x = .fitted, y = .resid)) + geom_point() + labs(x = "Predicted", y = "Residual", title = "Residual plot") + theme_bw(base_size = 16) ``` <img src="lec-09a-classification_files/figure-html/unnamed-chunk-10-1.png" style="display: block; margin: auto;" /> --- ## Predicted legendary status .tiny[ ```r ggplot(data = augment(lm_legendary), aes(x = .fitted)) + geom_histogram(fill = "grey50", alpha = .5, color = "darkgreen") + labs(x = "Predicted Values", y = "Count") + theme_bw(base_size = 16) ``` <img src="lec-09a-classification_files/figure-html/unnamed-chunk-11-1.png" style="display: block; margin: auto;" /> ] --- ## From probabilities to log-odds - Suppose the probability of an event is `\(p\)`. - Then the **odds** that the event occurs is `\(\frac{p}{1-p}\)`. - Taking the (natural) log of the odds, we have the *logit* of `\(p\)`: `$$\mbox{logit}(p) = \log\left(\frac{p}{1-p}\right)$$` -- Note that `\(p\)` is constrained to lie within 0 and 1, but `\(\mbox{logit}(p)\)` can range from `\(-\infty\)` to `\(\infty\)`. Let's instead consider the following linear model for the log-odds of `\(p\)`: <br/> `$${\mbox{logit}(p)} = \beta_0 + \beta_1\times\mbox{hp} + \beta_2\times\mbox{atk} + \cdots + \beta_6\times\mbox{spe}$$` --- ## Logistic regression Since there is a one-to-one relationship between probabilities and log-odds, we can undo the previous function. -- <br/> If we create a linear model on the **log-odds**, we can "work backwards" to obtain predicted probabilities that are guaranteed to lie between 0 and 1. -- To "work backwards," we use the **logistic function**: `$$f(x) = \frac{1}{1 + e^{-x}} = \frac{e^x}{1 + e^x}$$` -- So, our *linear* model for `\(\mbox{logit}(p)\)` is equivalent to `$$p = \frac{e^{\beta_0 + \beta_1\times\mbox{hp} + \beta_2\times\mbox{atk} + \cdots + \beta_6\times\mbox{spe}}}{1 + e^{\beta_0 + \beta_1\times\mbox{hp} + \beta_2\times\mbox{atk} + \cdots + \beta_6\times\mbox{spe}}}$$` --- ## Classification using logistic regression - With logistic regression, we can obtain predicted *probabilities* of "success" for a yes/no variable. - Mapping those to binary class probabilities, we have the predicted probability of being in a class. - By instituting a cut-off value (say, if the probability is greater than 0.5), we can create a classifier. - This can be extended to more than 2 categories, but that is beyond the scope of our course (for the curious: multinomial regression). --- ## Returning to Pokemon... ```r logit_mod <- glm(leg_bin ~ hp + atk + def + spa + spd + spe, data = pokemon, family = "binomial") pred_log_odds <- augment(logit_mod, newdata = new_pokemon) %>% pull(.fitted) ``` -- Let's work backwards to get predicted probabilities. ```r pred_probs <- exp(pred_log_odds) / (1 + exp(pred_log_odds)) round(pred_probs, 3) ``` ``` #> [1] 0.000 0.076 0.999 ``` What can we conclude given these predicted probabilities? --- ## Interpreting coefficients Once again, we have a *linear* model on a transformation of the response. We can interpret estimated coefficients in a familiar way: ```r tidy(logit_mod) %>% select(term, estimate) ``` ``` #> # A tibble: 7 x 2 #> term estimate #> <chr> <dbl> #> 1 (Intercept) -26.2 #> 2 hp 0.0447 #> 3 atk 0.0228 #> 4 def 0.0578 #> 5 spa 0.0492 #> 6 spd 0.0406 #> 7 spe 0.0599 ``` Holding all other variables constant, for each unit increase in base speed, we would expect the log-odds of being legendary to increase by approximately 0.06. -- Holding all other variables constant, a Pokemon that has a base speed one unit larger than another would have `\(\exp(0.06) \approx 1.06\)` times the odds of being legendary. --- ## Strengths - Linear model of transformation of response - Can straightforwardly interpret coefficients - Interpretation as log-odds has intuitive appeal in many settings (particularly in health-related or biomedical fields) - Can easily handle both continuous and categorical predictors - Can quantify degree of uncertainty around prediction - Can be straightforwardly extended to high dimensional cases --- ## Drawbacks - Only considers a linear decision boundary between classes (k-NN can support an arbitrary boundary) - Requires additional assumptions regarding independence, a specific shape for the transformation (linearity in the log-odds) - If predictors are highly correlated, coefficient estimates may be unreliable --- ## Application exercise https://classroom.github.com/a/kF2w7Abz 1. Train the k-NN model using all existing Pokemon and create a logistic regression model for legendary status. For the following hypothetical Pokemon, classify them as being legendary vs. non-legendary using both k-NN and logistic regression. When using k-NN, try varying the chosen `\(k\)` and compare/contrast the results. - HP: 91, ATK: 134, DEF: 95, SPA: 100, SPD: 100, SPE: 80 - HP: 30, ATK: 60, DEF: 180, SPA: 50, SPD: 180, SPE: 50 - HP: 105, ATK: 95, DEF: 60, SPA: 95, SPD: 60, SPE: 90 - HP: 45, ATK: 55, DEF: 60, SPA: 60, SPD: 50, SPE: 35 - HP: 100, ATK: 130, DEF: 110, SPA: 90, SPD: 80, SPE: 100 <br/><br/><br/> 2. When using logistic regression, which has the highest estimated probability of being a legendary Pokemon? --- ## References 1. (2020). Cran.r-project.org. Retrieved 2 March 2020, from https://cran.r-project.org/web/packages/class/class.pdf