Prediction and model validation

# Prediction and model validation
### Dr. Çetinkaya-Rundel
### October 24, 2017

---

# Getting started

---

## Getting started

- Peer evaluations due Wednesday 5pm

- New teams on Thursday

---

# Model selection

---

## Data: Course evals

```r
library(tidyverse) # ggplot2 + dplyr + readr + and some others
library(broom)
```

```r
# Load data
evals <- read_csv("data/evals-mod.csv")

# Calculate bty_avg
evals <- evals %>%
  rowwise() %>%
  mutate(bty_avg = mean(c(bty_f1lower, bty_f1upper, bty_f2upper, 
                          bty_m1lower, bty_m1upper, bty_m2upper))) %>%
  ungroup()
```

---

## Full model

<div class="question">
What percent of the variability in evaluation scores is explained by the model?
</div>

```r
full_model <- lm(score ~ rank + ethnicity + gender + language + 
                         age + cls_perc_eval + cls_did_eval + 
                         cls_students + cls_level + cls_profs + 
                         cls_credits + bty_avg, data = evals)
glance(full_model)$r.squared
```

```
## [1] 0.1644867
```

```r
glance(full_model)$adj.r.squared
```

```
## [1] 0.1402959
```

---

## Akaike Information Criterion

$$ AIC = -2log(L) + 2k $$

- `$L$`: likelihood	of the	model, i.e.	likelihood of seeing these data	given	the	model	
parameters	
- Applies	a	penalty	for	number of parameters in the	model, `$k$`	
- Used for model selection, lower the better
    - Value is not informative on its own
- In R to get the AIC, use `AIC(model)`

```r
AIC(full_model)
```

```
## [1] 695.7457
```

---

## Model selection -- a little faster

```r
selected_model <- step(full_model, direction = "backward")
```

```r
AIC(selected_model)
```

```
## [1] 687.5712
```

---

## Parsimony

<div class="question">
Take a look at the variables in the full and the selected model. Can you guess
why some of them may have been dropped? Remember: We like parsimonous models.
</div>

```r
full_model$call
```

```
## lm(formula = score ~ rank + ethnicity + gender + language + age + 
##     cls_perc_eval + cls_did_eval + cls_students + cls_level + 
##     cls_profs + cls_credits + bty_avg, data = evals)
```

```r
selected_model$call
```

```
## lm(formula = score ~ ethnicity + gender + language + age + cls_perc_eval + 
##     cls_credits + bty_avg, data = evals)
```

---

# Prediction

---

## New observation

To make a prediction for a new observation we need to create a data frame with that observation.

<div class="question">
Suppose we want to make a prediction for a 35 year old white woman professor who received her education at an English speaking country and who teaches a multi credit course. 80% of her classes tend to fill out evaluations, and she's average looiking (beauty score = 2.5).
<br><br>
The following won't work. Why? How would you correct it?
</div>

```r
new_prof <- data_frame(ethnicity = "white",
                       sex = "woman",
                       language = "English",
                       age = 35, 
                       cls_perc_eval = 0.80,
                       cls_credits = "multi-credit",
                       bty_avg = 2.5)
```

---

## New observation, corrected

```r
new_prof <- data_frame(ethnicity = "not minority",
                       gender = "female",
                       language = "english",
                       age = 35, 
                       cls_perc_eval = 0.80,
                       cls_credits = "multi credit",
                       bty_avg = 2.5)
```

---

## Prediction

```r
predict(selected_model, newdata = new_prof)
```

```
##        1 
## 3.642981
```

---

# Model validation

---

## Model validation

- One commonly used model validation technique is to partition your data into training
and testing set

- That is, fit the model on the training data

- And test it on the testing data

- Evaluate model performance using `$RMSE$`, root-mean squared error

$$ RMSE = \sqrt{\frac{\sum_{i = 1}^n (y_i - \hat{y}_i)^2}{n}} $$

<div class="question">
Do you think we should prefer low or high RMSE?
</div>

---

## Random sampling and reproducibility

Gotta set a seed!

```r
set.seed(25102017)
```

- Use different seeds from each other

- Need inspiration? https://www.random.org/

---

## Add an identifier column: `index`

```r
evals <- evals %>%
  mutate(index = 1:nrow(evals))
```

---

## 80% training, 20% testing

```r
training_data <- evals %>%
  sample_frac(size = 0.8)
nrow(training_data)
```

```
## [1] 370
```
]

```r
testing_data <- evals %>%
  anti_join(training_data, 
            by = "index")
nrow(testing_data)
```

```
## [1] 93
```
]

Check:

```r
(nrow(training_data) + nrow(testing_data)) == nrow(evals)
```

```
## [1] TRUE
```

---

## Your turn!

```r
# load packages
library(tidyverse)
library(broom)

# set seed
set.seed(123) #insert some value for the seed

# add index
evals <- evals %>%
  mutate(index = 1:nrow(evals))

# training data
training_data <- evals %>%
  sample_frac(size = 0.8)
nrow(training_data)

# testing data
testing_data <- evals %>%
  anti_join(training_data, by = "index")
nrow(testing_data)

# check
(nrow(training_data) + nrow(testing_data)) == nrow(evals)
```

---

## Step 1: Fit full model to training data

<div class="question">
Compare the output of your full model with neighbors. Are the parameter
estimates the same? Different? Very different?
</div>

```r
full <- lm(score ~ rank + ethnicity + gender + language + age + cls_perc_eval + cls_did_eval + 
           cls_students + cls_level + cls_profs + cls_credits + bty_avg, data = training_data)
full
```

```
## 
## Call:
## lm(formula = score ~ rank + ethnicity + gender + language + age + 
##     cls_perc_eval + cls_did_eval + cls_students + cls_level + 
##     cls_profs + cls_credits + bty_avg, data = training_data)
## 
## Coefficients:
##           (Intercept)       ranktenure track            ranktenured  
##             3.6748141             -0.1778738             -0.1075087  
## ethnicitynot minority             gendermale    languagenon-english  
##             0.1543111              0.1835562             -0.2465718  
##                   age          cls_perc_eval           cls_did_eval  
##            -0.0066084              0.0042989              0.0016229  
##          cls_students         cls_levelupper        cls_profssingle  
##            -0.0007173              0.0165106              0.0169324  
## cls_creditsone credit                bty_avg  
##             0.5098201              0.0691804
```

```r
glance(full)$r.squared             
```

```
## [1] 0.1865635
```
]

---
## Step 2: Model selection

```r
selected <- step(full, direction = "backward")
```

<div class="question">
Compare your selected model with your teammates' selected models. Do you have the same variables chosen?
</div>

```r
selected$call
```

```
## lm(formula = score ~ rank + ethnicity + gender + language + age + 
##     cls_perc_eval + cls_credits + bty_avg, data = training_data)
```

---

## Step 3: Prediction and RMSE

Calculate predicted evaluation scores:

```r
y_hat <- predict(selected, newdata = testing_data)
```

Calculate RMSE:

```r
(rmse <- sqrt(sum((testing_data$score - y_hat)^2) / nrow(testing_data)))
```

```
## [1] 0.5140652
```
]

<div class="question">
What is your RMSE? How does it compare to others'?
</div>

---

## RMSE on training vs. testing

<div class="question">
Would you expect the RMSE to be higher for your training data or your testing data? Why?
</div>

---

## RMSE on training vs. testing

RMSE for testing:
.small[

```r
sqrt(sum((testing_data$score - y_hat)^2) /nrow(testing_data))
```

```
## [1] 0.5140652
```
]

RMSE for training:
.small[

```r
sqrt(sum((training_data$score - selected$fitted.values)^2) / nrow(training_data))
```

```
## [1] 0.4958672
```
]