class: center, middle, inverse, title-slide # Classification ### Yue Jiang ### 03.02.20 --- layout: true <div class="my-footer"> <span> <a href="http://datasciencebox.org" target="_blank">datasciencebox.org</a> </span> </div> --- ## Announcements --- ## Load data The `pokemon` dataset contains data on 896 Pokemon - Includes Sword and Shield! - No alternate forms (mega evolutions, regional forms, etc.) - Contains information on Pokemon name, baseline battle statistics (base stats), and whether they are legendary (broadly defined) ```r pokemon <- read_csv("data/pokemon_cleaned.csv", na = c("n/a", "", "NA")) ``` --- ## Classification - The previous lectures have focused on using linear regression as a tool to + Make predictions about new observations + Describe relationships - These examples have all had *continuous* response variables .question[ Our goal is to make statements about **categorical** response variables ] --- ## Research goals - Predict whether a Pokemon is a legendary based on its base stats - Describe the relationship between stats and *probability* of being legendary .pull-left[  ] .pull-right[  ] .small[ Pokemon and Pokemon names are trademarks of Nintendo. ] --- class: center, middle # k-nearest neighbors --- ## The general idea Let's visualize attack and special attack statistics based on legendary status <!-- --> --- ## Classifying hypothetical Pokemon Suppose we have some hypothetical Pokemon with the following attack and special attack. Would we classify them as legendary Pokemon? <!-- --> -- .question[ Given a new data point, predict its class status by taking the plurality vote of its *k* nearest neighbors in terms of *their* class memberships ] --- ## Classifying hypothetical Pokemon <!-- --> --- ## Finding the "nearest" neighbors <!-- --> --- ## Classification result <!-- --> --- ## Distance functions We can visualize the "nearest" neighbors with only attack and special attack in some sense. But suppose we want to use all of the Pokemon's base stats for classification. How can we do that? -- The k-nearest neighbors are calculated based on a pre-specified **distance metric**. We will use **Euclidean distance**, but many other options are available. -- Suppose `\(\mathbf{x}\)` and `\(\mathbf{y}\)` are n-dimensional vectors. Then the Euclidean distance between them is `$$\begin{align*} D(\mathbf{x}, \mathbf{y}) &= \sqrt{(x_1 - y_1)^2 + \cdots + (x_n - y_n)^2}\\ &= \sqrt{\sum_{i = 1}^n (x_i - y_i)^2} \end{align*}$$` --- ## How do we choose *k*? - Larger *k* reduces variance due to focusing on local features, but is computationally expensive and makes class boundaries fuzzier - Smaller *k* results in sharp class boundary, but may be too sensitive to local data structure - For binary classification, it's helpful to choose an odd *k* to avoid ties - A commonly chosen simple approach is to use the square root of the sample size - Using a method such as **cross-validation** is often good in practice -- if you are using this for your final project, come see me! --- ## Classifying hypothetical Pokemon Suppose we have some hypothetical Pokemon with the following features. Would we classify them as legendary Pokemon? Crapistat: HP = 55, ATK = 25, DEF = 30, SPA = 60, SPD = 50, SPE = 102 Mediocra: HP = 90, ATK = 110, DEF = 130, SPA = 75, SPD = 80, SPE = 45 Literally Dragonite: HP: 91, ATK: 134, DEF: 95, SPA: 100, SPD: 100, SPE: 80 Broaken: HP = 104, ATK = 125, DEF = 105, SPA = 148, SPD = 102, SPE = 136 .question[ We will calculate their k nearest neighbors in 6-dimensional Euclidean space and take the plurality vote as their legendary status ] --- ## Implementation in R ```r # Create new test values for hypothetical Pokemon new_pokemon <- tibble(hp = c(55, 90, 91, 104), atk = c(25, 110, 134, 125), def = c(30, 130, 95, 105), spa = c(60, 75, 100, 148), spd = c(50, 80, 100, 102), spe = c(102, 45, 80, 136)) train <- pokemon %>% select(hp, atk, def, spa, spd, spe) which_leg <- pokemon %>% pull(legendary) # must pull classes as a vector! library(class) mod_knn <- knn(train, new_pokemon, which_leg, k = 30, prob = F, use.all = T) mod_knn ``` ``` ## [1] No No Yes Yes ## Levels: No Yes ``` --- ## Strengths - Intuitive to understand and straightforwardly implemented - Decision boundary can have arbitrary shape - Virtually assumption-free - Easy to extend to multi-class problems (just take the plurality vote) - Can be extended to add flexibility (e.g., weighting votes based on distance) --- ## Drawbacks - Unbalanced class sizes are difficult to resolve, since rare classes are dominated in most of the predictor-space - Computationally intensive, since we must compute distances to all points - Sensitive to high variance predictors, irrelevant predictors, and outliers - Completely ignores "far away" points - Requires that predictors can be compared on the same scale (i.e., a distance of x in predictor 1 must have the same meaning for feature 2) - Need to determine k and choose distance function (how can we choose a distance for categorical predictors?) - Cannot deal with missing values - Breaks down in high dimensions --- class: center, middle # Logistic regression --- ## Regression difficulties... Suppose we consider the following model for `\(p\)`, the probability of being a legendary Pokemon: `$${p} = \beta_0 + \beta_1\times hp + \beta_2\times atk + \cdots + \beta_6\times spe$$` .question[ What can go wrong here? ] --- ## Residuals ```r library(broom) pokemon <- pokemon %>% mutate(leg_bin = if_else(legendary == "Yes", 1, 0)) m2 <- lm(leg_bin~hp + atk + def + spa + spd + spe, data = pokemon) resids <- augment(m2) %>% select(.resid) ggplot(data = augment(m2), aes(x = .fitted, y = .resid)) + geom_point() + labs(x = "Predicted", y = "Residual", title = "Residual plot") ``` <!-- --> --- ## Histogram of predicted binary legendary status ```r ggplot(data = augment(m2), aes(x = .fitted)) + geom_histogram() + labs(x = "Predicted Values", y = "Count", title = "Histogram of predictions") ``` <!-- --> --- ## From probabilities to log-odds - Suppose the probability of an event is `\(p\)` - Then the **odds** that the event occurs is `\(\frac{p}{1-p}\)` - Taking the (natural) log of the odds, we have the *logit* of `\(p\)`: `$$logit(p) = \log\left(\frac{p}{1-p}\right)$$` -- Note that `\(p\)` is constrained to lie within 0 and 1, but `\(logit(p)\)` can range from `\(-\infty\)` to `\(\infty\)`. Let's instead consider the following linear model for the log-odds of `\(p\)`: `$${logit(p)} = \beta_0 + \beta_1\times hp + \beta_2\times atk + \cdots + \beta_6\times spe$$` --- ## Logistic regression Since there is a one-to-one relationship between probabilities and log-odds, we can undo the previous function. .question[ If we create a linear model on the **log-odds**, we can "work backwards" to obtain predicted probabilities that are guaranteed to lie between 0 and 1 ] -- To "work backwards," we use the **logistic function**: `$$f(x) = \frac{1}{1 + e^{-x}} = \frac{e^x}{1 + e^x}$$` -- So, our *linear* model for `\(logit(p)\)` is equivalent to `$$p = \frac{e^{\beta_0 + \beta_1\times hp + \beta_2\times atk + \cdots + \beta_6\times spe}}{1 + e^{\beta_0 + \beta_1\times hp + \beta_2\times atk + \cdots + \beta_6\times spe}}$$` --- ## Classification using logistic regression - With logistic regression, we can obtain predicted *probabilities* of "success" for a yes/no variable - Mapping those to binary class probabilities, we have the predicted probability of being in a class - By instituting a cut-off value (say, if the probability is greater than 0.5), we can create a classifier - This can be extended to more than 2 categories, but that is beyond the scope of our course (for the curious: multinomial regression) --- ## Returning to Pokemon... ```r logit_mod <- glm(leg_bin ~ hp + atk + def + spa + spd + spe, data = pokemon, family = "binomial") pred_log_odds <- augment(logit_mod, newdata = new_pokemon) %>% pull(.fitted) ``` Let's work backwards to get predicted probabilities ```r pred_probs <- exp(pred_log_odds)/(1 + exp(pred_log_odds)) round(pred_probs, 3) ``` ``` ## [1] 0.000 0.076 0.553 0.999 ``` .question[ What can we conclude given these predicted probabilities? ] --- ## Interpreting coefficients Once again, we have a *linear* model on a transformation of the response. We can interpret estimated coefficients in a familiar way: ```r tidy(logit_mod) %>% select(term, estimate) ``` ``` ## # A tibble: 7 x 2 ## term estimate ## <chr> <dbl> ## 1 (Intercept) -26.2 ## 2 hp 0.0447 ## 3 atk 0.0228 ## 4 def 0.0578 ## 5 spa 0.0492 ## 6 spd 0.0406 ## 7 spe 0.0599 ``` Holding all other variables constant, for each unit increase in base speed, we would expect the log-odds of being legendary to increase by approximately 0.06. -- Holding all else, a Pokemon that has a base speed one unit larger than another would have `\(exp(0.06) \approx 1.06\)` times the odds of being legendary. --- ## Strengths - Linear model of transformation of response - Can straightforwardly interpret coefficients - Interpretation as log-odds as intuitive appeal in many settings (particularly in health-related or biomedical fields) - Can easily handle both continuous and categorical predictors - Can quantify degree of uncertainty around prediction - Can be straightforwardly extended to highi dimensional cases --- ## Drawbacks - Only considers a linear decision boundary between classes (k-NN can support an arbitrary boundary) - Requires additional assumptions regarding independence, a specific shape for the transformation (linearity in the log-odds) - If predictors are highly correlated, coefficient estimates may be unreliable --- ## Your Turn Go to [https://classroom.github.com/a/DjCAkxum](https://classroom.github.com/a/DjCAkxum). Clone this repository into RStudio Cloud. Don't forget to stage, commit, and push your changes.