Classification

# Classification
### Yue Jiang
### 03.02.20

---

<div class="my-footer">
<span>
<a href="http://datasciencebox.org" target="_blank">datasciencebox.org</a>
</span>
</div>

---

## Announcements

---

## Load data

The `pokemon` dataset contains data on 896 Pokemon
- Includes Sword and Shield!
- No alternate forms (mega evolutions, regional forms, etc.)
- Contains information on Pokemon name, baseline battle statistics (base stats),
and whether they are legendary (broadly defined)

```r
pokemon <- read_csv("data/pokemon_cleaned.csv", na = c("n/a", "", "NA"))
```

---

## Classification

- The previous lectures have focused on using linear regression as a tool to
  + Make predictions about new observations
  + Describe relationships

- These examples have all had *continuous* response variables

---

## Research goals

- Predict whether a Pokemon is a legendary based on its base 
stats
- Describe the relationship between stats and *probability* of being legendary

.pull-left[
![non-legendary](img/12/pidgey.png)
]
.pull-right[
![legendary](img/12/rayquaza.png)
]
.small[
Pokemon and Pokemon names are trademarks of Nintendo.
]

---

# k-nearest neighbors

---

## The general idea

Let's visualize attack and special attack statistics based on legendary status

![](12-classification_files/figure-html/unnamed-chunk-2-1.png)

---

## Classifying hypothetical Pokemon

Suppose we have some hypothetical Pokemon with the following attack and special
attack. Would we classify them as legendary Pokemon?

![](12-classification_files/figure-html/unnamed-chunk-3-1.png)

.question[
Given a new data point, predict its class status by taking the plurality vote of
its *k* nearest neighbors in terms of *their* class memberships
]

---

## Classifying hypothetical Pokemon

![](12-classification_files/figure-html/unnamed-chunk-4-1.png)

---

## Finding the "nearest" neighbors

![](12-classification_files/figure-html/unnamed-chunk-5-1.png)

---

## Classification result

![](12-classification_files/figure-html/unnamed-chunk-6-1.png)

---

## Distance functions

We can visualize the "nearest" neighbors with only attack and special attack in
some sense. But suppose we want to use all of the Pokemon's base stats for 
classification. How can we do that?

The k-nearest neighbors are calculated based on a pre-specified 
**distance metric**. We will use **Euclidean distance**, but many other options 
are available.

Suppose `$\mathbf{x}$` and `$\mathbf{y}$` are n-dimensional vectors. Then the 
Euclidean distance between them is

`$$\begin{align*}
D(\mathbf{x}, \mathbf{y}) &= \sqrt{(x_1 - y_1)^2 + \cdots + (x_n - y_n)^2}\\
&= \sqrt{\sum_{i = 1}^n (x_i - y_i)^2}
\end{align*}$$`

---

## How do we choose *k*?

- Larger *k* reduces variance due to focusing on local features, but is 
computationally expensive and makes class boundaries fuzzier
- Smaller *k* results in sharp class boundary, but may be too sensitive to local
data structure
- For binary classification, it's helpful to choose an odd *k* to avoid ties
- A commonly chosen simple approach is to use the square root of the sample size
- Using a method such as **cross-validation** is often good in practice -- if
you are using this for your final project, come see me!

---

## Classifying hypothetical Pokemon

Suppose we have some hypothetical Pokemon with the following features. Would we 
classify them as legendary Pokemon?

Crapistat: HP = 55, ATK = 25, DEF = 30, SPA = 60, SPD = 50, SPE = 102

Mediocra: HP = 90, ATK = 110, DEF = 130, SPA = 75, SPD = 80, SPE = 45

Literally Dragonite: HP: 91, ATK: 134, DEF: 95, SPA: 100, SPD: 100, SPE: 80

Broaken: HP = 104, ATK = 125, DEF = 105, SPA = 148, SPD = 102, SPE = 136

.question[
We will calculate their k nearest neighbors in 6-dimensional Euclidean space and take the plurality vote as their legendary status
]
---

## Implementation in R

```r
# Create new test values for hypothetical Pokemon
new_pokemon <- tibble(hp = c(55, 90, 91, 104),
                      atk = c(25, 110, 134, 125),
                      def = c(30, 130, 95, 105),
                      spa = c(60, 75, 100, 148),
                      spd = c(50, 80, 100, 102),
                      spe = c(102, 45, 80, 136))

train <- pokemon %>% 
  select(hp, atk, def, spa, spd, spe)

which_leg <- pokemon %>% 
  pull(legendary) # must pull classes as a vector!

library(class)
mod_knn <- knn(train, new_pokemon, which_leg, k = 30, prob = F, use.all = T)
mod_knn
```

```
## [1] No  No  Yes Yes
## Levels: No Yes
```

---

## Strengths

- Intuitive to understand and straightforwardly implemented
- Decision boundary can have arbitrary shape
- Virtually assumption-free
- Easy to extend to multi-class problems (just take the plurality vote)
- Can be extended to add flexibility (e.g., weighting votes based on distance)

---

## Drawbacks

- Unbalanced class sizes are difficult to resolve, since rare classes are 
dominated in most of the predictor-space
- Computationally intensive, since we must compute distances to all points
- Sensitive to high variance predictors, irrelevant predictors, and outliers
- Completely ignores "far away" points
- Requires that predictors can be compared on the same scale (i.e., a distance
of x in predictor 1 must have the same meaning for feature 2)
- Need to determine k and choose distance function (how can we choose a distance
for categorical predictors?)
- Cannot deal with missing values
- Breaks down in high dimensions

---

# Logistic regression

---

## Regression difficulties...

Suppose we consider the following model for `$p$`, the probability of being a legendary Pokemon:
`$${p} = \beta_0 + \beta_1\times hp + \beta_2\times atk + \cdots + \beta_6\times spe$$`

---

## Residuals

```r
library(broom)
pokemon <- pokemon %>% 
  mutate(leg_bin = if_else(legendary == "Yes", 1, 0))
m2 <- lm(leg_bin~hp + atk + def + spa + spd + spe, data = pokemon)

resids <- augment(m2) %>% 
  select(.resid)

ggplot(data = augment(m2), aes(x = .fitted, y = .resid)) +
  geom_point() + 
  labs(x = "Predicted", y = "Residual", title = "Residual plot")
```

![](12-classification_files/figure-html/unnamed-chunk-8-1.png)

---

## Histogram of predicted binary legendary status

```r
ggplot(data = augment(m2), aes(x = .fitted)) +
  geom_histogram() + 
  labs(x = "Predicted Values", y = "Count", title = "Histogram of predictions")
```

![](12-classification_files/figure-html/unnamed-chunk-9-1.png)

---

## From probabilities to log-odds

- Suppose the probability of an event is `$p$`
- Then the **odds** that the event occurs is `$\frac{p}{1-p}$`
- Taking the (natural) log of the odds, we have the *logit* of `$p$`:
`$$logit(p) = \log\left(\frac{p}{1-p}\right)$$`

Note that `$p$` is constrained to lie within 0 and 1, but `$logit(p)$` can range from `$-\infty$` to `$\infty$`. Let's instead consider the following linear model for the log-odds of `$p$`:

`$${logit(p)} = \beta_0 + \beta_1\times hp + \beta_2\times atk + \cdots + \beta_6\times spe$$`

---

## Logistic regression

Since there is a one-to-one relationship between probabilities and log-odds, we
can undo the previous function.

.question[
If we create a linear model on the **log-odds**, we can "work backwards" to
obtain predicted probabilities that are guaranteed to lie between 0 and 1
]

To "work backwards," we use the **logistic function**:

`$$f(x) = \frac{1}{1 + e^{-x}} = \frac{e^x}{1 + e^x}$$`

So, our *linear* model for `$logit(p)$` is equivalent to

`$$p = \frac{e^{\beta_0 + \beta_1\times hp + \beta_2\times atk + \cdots + \beta_6\times spe}}{1 + e^{\beta_0 + \beta_1\times hp + \beta_2\times atk + \cdots + \beta_6\times spe}}$$`

---

## Classification using logistic regression

- With logistic regression, we can obtain predicted *probabilities* of "success"
for a yes/no variable
- Mapping those to binary class probabilities, we have the predicted probability
of being in a class
- By instituting a cut-off value (say, if the probability is greater than 0.5),
we can create a classifier
- This can be extended to more than 2 categories, but that is beyond the scope
of our course (for the curious: multinomial regression)

---

## Returning to Pokemon...

```r
logit_mod <- glm(leg_bin ~ hp + atk + def + spa + spd + spe, data = pokemon,
                 family = "binomial")

pred_log_odds <- augment(logit_mod, newdata = new_pokemon) %>% 
  pull(.fitted)
```

Let's work backwards to get predicted probabilities

```r
pred_probs <- exp(pred_log_odds)/(1 + exp(pred_log_odds))
round(pred_probs, 3)
```

```
## [1] 0.000 0.076 0.553 0.999
```

---

## Interpreting coefficients

Once again, we have a *linear* model on a transformation of the response. We can
interpret estimated coefficients in a familiar way:

```r
tidy(logit_mod) %>% 
  select(term, estimate)
```

```
## # A tibble: 7 x 2
##   term        estimate
##   <chr>          <dbl>
## 1 (Intercept) -26.2   
## 2 hp            0.0447
## 3 atk           0.0228
## 4 def           0.0578
## 5 spa           0.0492
## 6 spd           0.0406
## 7 spe           0.0599
```

Holding all other variables constant, for each unit increase in base
speed, we would expect the log-odds of being legendary to increase by
approximately 0.06.

Holding all else, a Pokemon that has a base speed one unit
larger than another would have `$exp(0.06) \approx 1.06$` times the odds of being 
legendary.

---

## Strengths

- Linear model of transformation of response
- Can straightforwardly interpret coefficients
- Interpretation as log-odds as intuitive appeal in many settings (particularly
in health-related or biomedical fields)
- Can easily handle both continuous and categorical predictors
- Can quantify degree of uncertainty around prediction
- Can be straightforwardly extended to highi dimensional cases

---

## Drawbacks

- Only considers a linear decision boundary between classes (k-NN can support
an arbitrary boundary)
- Requires additional assumptions regarding independence, a specific shape for
the transformation (linearity in the log-odds)
- If predictors are highly correlated, coefficient estimates may be unreliable

---

## Your Turn

Go to [https://classroom.github.com/a/DjCAkxum](https://classroom.github.com/a/DjCAkxum). Clone this repository into RStudio Cloud. Don't forget to stage, commit, and push your changes.