Classification

# Classification
## Intro to Data Science
### Shawn Santo
### 03-03-20

---

## Load data

The `pokemon` dataset contains data on 896 Pokemon

- Includes Sword and Shield!

- No alternate forms (mega evolutions, regional forms, etc.)

- Contains information on Pokemon name, baseline battle statistics 
  (base stats), and whether they are legendary (broadly defined).

```r
library(tidyverse)
pokemon <- read_csv("data/pokemon_cleaned.csv", na = c("n/a", "", "NA"))
```

---

## Data preview

```r
pokemon
```

```
#> # A tibble: 896 x 10
#> n_dex_num name hp atk def spa spd spe total legendary
#> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> 
#> 1 1 Bulbasaur 45 49 49 65 65 45 318 No 
#> 2 2 Ivysaur 60 62 63 80 80 60 405 No 
#> 3 3 Venusaur 80 82 83 100 100 80 525 No 
#> 4 4 Charmander 39 52 43 60 50 65 309 No 
#> 5 5 Charmeleon 58 64 58 80 65 80 405 No 
#> 6 6 Charizard 78 84 78 109 85 100 534 No 
#> 7 7 Squirtle 44 48 65 50 64 43 314 No 
#> 8 8 Wartortle 59 63 80 65 80 58 405 No 
#> 9 9 Blastoise 79 83 100 85 105 78 530 No 
#> 10 10 Caterpie 45 30 35 20 20 45 195 No 
#> # … with 886 more rows
```

---

## Classification

- The previous lectures have focused on using linear regression as a tool to
  + make predictions about new observations, and
  + describe relationships between a response and a set of explanatory
    variables.
    
  
- These examples have all had *continuous* response variables

Our goal is to make statements about **categorical** response variables.

---

## Research goals

- Predict whether a Pokemon is a legendary based on its base 
  stats.
- Describe the relationship between stats and *probability* of being legendary.

---

# k-nearest neighbors

---

## The general idea

Let's visualize attack and special attack statistics based on legendary status.

---

## Classifying hypothetical Pokemon

Suppose we have some hypothetical Pokemon with the following attack and special
attack. Would we classify them as legendary Pokemon?

---

## Classifying hypothetical Pokemon

Suppose we have some hypothetical Pokemon with the following attack and special
attack. Would we classify them as legendary Pokemon?

Given a new data point, predict its class status by taking the plurality vote of
its *k* nearest neighbors in terms of *their* class memberships.

---

## A hypothetical Pokemon

---

## Finding the "nearest" neighbors

---

## Classification result

---

## Distance functions

We can visualize the "nearest" neighbors with only attack and special attack in
some sense. But suppose we want to use all of the Pokemon's base stats for 
classification. How can we do that?

The k-nearest neighbors are calculated based on a pre-specified 
**distance metric**. We will use **Euclidean distance**, but many other options 
are available.

Suppose `$\mathbf{x}$` and `$\mathbf{y}$` are p-dimensional vectors. The 
Euclidean distance between them is

`$$\begin{align*}
D(\mathbf{x}, \mathbf{y}) &= \sqrt{(x_1 - y_1)^2 + \cdots + (x_p - y_p)^2}\\
&= \sqrt{\sum_{i = 1}^p (x_i - y_i)^2}
\end{align*}$$`

---

## How do we choose *k*?

- A larger *k* reduces variance, but is computationally expensive and makes 
  class boundaries fuzzier.
  
- Smaller *k* results in a sharp class boundary, but may be too sensitive to 
  the local data structure.
  
- For binary classification, it's helpful to choose an odd *k* to avoid ties.

- A commonly chosen simple approach is to use the square root of the sample size.

- A more robust and advanced procedure to choose *k* is known as cross 
  validation. If you choose to do classification for your project, see me
  and I can help you find the "best" *k*.

---

## Classifying hypothetical Pokemon

Suppose we have some hypothetical Pokemon with the following features. 
Would we classify them as legendary Pokemon?

Crapistat: HP = 55, ATK = 25, DEF = 30, SPA = 60, SPD = 50, SPE = 102

Mediocra: HP = 90, ATK = 110, DEF = 130, SPA = 75, SPD = 80, SPE = 45

Broaken: HP = 104, ATK = 125, DEF = 105, SPA = 148, SPD = 102, SPE = 136

We will calculate their k nearest neighbors in 6-dimensional Euclidean space 
and take the plurality vote as their legendary status.

---

## Implementation in R

```r
# Create new test values for hypothetical Pokemon
new_pokemon <- tibble(hp = c(55, 90, 104),
 atk = c(25, 110, 125),
 def = c(30, 130, 105),
 spa = c(60, 75, 148),
 spd = c(50, 80, 102),
 spe = c(102, 45, 136))

train <- pokemon %>% 
 select(hp, atk, def, spa, spd, spe)

legendary_status <- pokemon %>% 
 pull(legendary) # must pull classes as a vector!

library(class)
mod_knn <- knn(train, new_pokemon, legendary_status, k = 30, prob = F, use.all = T)
mod_knn
```

```
#> [1] No  No  Yes
#> Levels: No Yes
```

---

## Strengths

- Intuitive to understand and straightforwardly implemented

- Decision boundary can have arbitrary shape

- Virtually assumption-free

- Easy to extend to multi-class problems (just take the plurality vote)

- Can be extended to add flexibility (e.g., weighting votes based on distance)

---

## Drawbacks

- Unbalanced class sizes are difficult to resolve, since rare classes are 
  dominated in most of the predictor-space

- Computationally intensive, since we must compute distances to all points

- Sensitive to high variance predictors, irrelevant predictors, and outliers

- Completely ignores "far away" points

- Requires that predictors can be compared on the same scale (i.e., a distance
  of x in feature 1 must have the same meaning for feature 2)

- Need to determine k and choose distance function (how can we choose a distance
  for categorical predictors?)

- Cannot deal with missing values

- Not suitable in a high dimensional setting

---

# Logistic regression

---

## Regression difficulties...

Suppose we consider the following model for `$p$`, the probability of being a legendary Pokemon:
`$${p} = \beta_0 + \beta_1\times\mbox{hp} + \beta_2\times\mbox{atk} + \cdots + 
\beta_6\times\mbox{spe}$$`

What can go wrong here?

---

## Residuals

```r
library(broom)
pokemon <- pokemon %>% 
 mutate(leg_bin = dplyr::if_else(legendary == "Yes", 1, 0))

lm_legendary <- lm(leg_bin ~ hp + atk + def + spa + spd + spe, 
 data = pokemon)
```

---

```r
ggplot(data = augment(lm_legendary), aes(x = .fitted, y = .resid)) +
  geom_point() + 
  labs(x = "Predicted", y = "Residual", title = "Residual plot") +
  theme_bw(base_size = 16)
```

---

## Predicted legendary status

```r
ggplot(data = augment(lm_legendary), aes(x = .fitted)) +
  geom_histogram(fill = "grey50", alpha = .5, color = "darkgreen") + 
  labs(x = "Predicted Values", y = "Count") +
  theme_bw(base_size = 16)
```

<img src="lec-09a-classification_files/figure-html/unnamed-chunk-11-1.png" style="display: block; margin: auto;" />
]

---

## From probabilities to log-odds

- Suppose the probability of an event is `$p$`.

- Then the **odds** that the event occurs is `$\frac{p}{1-p}$`.

- Taking the (natural) log of the odds, we have the *logit* of `$p$`:
  `$$\mbox{logit}(p) = \log\left(\frac{p}{1-p}\right)$$`

Note that `$p$` is constrained to lie within 0 and 1, but `$\mbox{logit}(p)$` can 
range from `$-\infty$` to `$\infty$`. Let's instead consider the following linear 
model for the log-odds of `$p$`:

`$${\mbox{logit}(p)} = \beta_0 + \beta_1\times\mbox{hp} + \beta_2\times\mbox{atk} + 
\cdots + \beta_6\times\mbox{spe}$$`

---

## Logistic regression

Since there is a one-to-one relationship between probabilities and log-odds, we
can undo the previous function.

If we create a linear model on the **log-odds**, we can "work backwards" to
obtain predicted probabilities that are guaranteed to lie between 0 and 1.

To "work backwards," we use the **logistic function**:

`$$f(x) = \frac{1}{1 + e^{-x}} = \frac{e^x}{1 + e^x}$$`

So, our *linear* model for `$\mbox{logit}(p)$` is equivalent to

`$$p = \frac{e^{\beta_0 + \beta_1\times\mbox{hp} + \beta_2\times\mbox{atk} + 
\cdots + \beta_6\times\mbox{spe}}}{1 + e^{\beta_0 + \beta_1\times\mbox{hp} + \beta_2\times\mbox{atk} + 
\cdots + \beta_6\times\mbox{spe}}}$$`

---

## Classification using logistic regression

- With logistic regression, we can obtain predicted *probabilities* of "success"
  for a yes/no variable.
  
- Mapping those to binary class probabilities, we have the predicted probability
  of being in a class.
  
- By instituting a cut-off value (say, if the probability is greater than 0.5),
  we can create a classifier.
  
- This can be extended to more than 2 categories, but that is beyond the scope
  of our course (for the curious: multinomial regression).

---

## Returning to Pokemon...

```r
logit_mod <- glm(leg_bin ~ hp + atk + def + spa + spd + spe, 
 data = pokemon, family = "binomial")

pred_log_odds <- augment(logit_mod, newdata = new_pokemon) %>% 
 pull(.fitted)
```

Let's work backwards to get predicted probabilities.

```r
pred_probs <- exp(pred_log_odds) / (1 + exp(pred_log_odds))
round(pred_probs, 3)
```

```
#> [1] 0.000 0.076 0.999
```

What can we conclude given these predicted probabilities?

---

## Interpreting coefficients

Once again, we have a *linear* model on a transformation of the response. We can
interpret estimated coefficients in a familiar way:

```r
tidy(logit_mod) %>% 
  select(term, estimate)
```

```
#> # A tibble: 7 x 2
#> term estimate
#> <chr> <dbl>
#> 1 (Intercept) -26.2 
#> 2 hp 0.0447
#> 3 atk 0.0228
#> 4 def 0.0578
#> 5 spa 0.0492
#> 6 spd 0.0406
#> 7 spe 0.0599
```

Holding all other variables constant, for each unit increase in base
speed, we would expect the log-odds of being legendary to increase by
approximately 0.06.

Holding all other variables constant, a Pokemon that has a base speed one unit
larger than another would have `$\exp(0.06) \approx 1.06$` times the odds of 
being legendary.

---

## Strengths

- Linear model of transformation of response

- Can straightforwardly interpret coefficients

- Interpretation as log-odds has intuitive appeal in many settings 
  (particularly in health-related or biomedical fields)
  
- Can easily handle both continuous and categorical predictors

- Can quantify degree of uncertainty around prediction

- Can be straightforwardly extended to high dimensional cases

---

## Drawbacks

- Only considers a linear decision boundary between classes 
  (k-NN can support an arbitrary boundary)
  
- Requires additional assumptions regarding independence, a specific shape for
  the transformation (linearity in the log-odds)
  
- If predictors are highly correlated, coefficient estimates may be unreliable

---

## Application exercise

https://classroom.github.com/a/kF2w7Abz

1. Train the k-NN model using all existing Pokemon and create a logistic 
   regression model for legendary status. For the following hypothetical 
   Pokemon, classify them as being legendary vs. non-legendary using both k-NN 
   and logistic regression. When using k-NN, try varying the chosen `$k$` and 
   compare/contrast the results.

- HP: 91, ATK: 134, DEF: 95, SPA: 100, SPD: 100, SPE: 80
 - HP: 30, ATK: 60, DEF: 180, SPA: 50, SPD: 180, SPE: 50
 - HP: 105, ATK: 95, DEF: 60, SPA: 95, SPD: 60, SPE: 90
 - HP: 45, ATK: 55, DEF: 60, SPA: 60, SPD: 50, SPE: 35
 - HP: 100, ATK: 130, DEF: 110, SPA: 90, SPD: 80, SPE: 100
 
2. When using logistic regression, which has the highest estimated probability 
 of being a legendary Pokemon?

---

## References

1. (2020). Cran.r-project.org. Retrieved 2 March 2020, from 
   https://cran.r-project.org/web/packages/class/class.pdf