Model Validation

# Model Validation
### Dr. Maria Tackett
### 04.01.19

---

### Announcements

- Lab 08 due Wed at 11:59p

- Team feedback #2 due Wed at 11:59p

- HW 06 due Mon, Apr 8

---

### Multinomial Logistic Regression Example
### *Sesame Street*
---

### Example: *Sesame Street*

- We will analyze data from an experiment to test the effectiveness of the children's program *Sesame Street*.

- As part of the experiment, children were assigned to one of two groups: those who were encouraged to watch the program and those who were not

- We want to understand what effect the encouragement had on the frequency of viewing after adjusting for other factors

---

### Response Variable

- `viewcat`
 + 1: rarely watched show
 + 2: once or twice a week
 + 3: three to five times a week
 + 4: watched show on average more than five times a week
 
---

### Predictor Variables

- `age`: child's age in months
- `prenumb`: score on numbers pretest (0 to 54)
- `prelet`: score on letters pretest (0 to 58)
- `viewenc`: 1: encouraged to watch, 2: not encouraged
- `site:`
 + 1: three to five year old from disadvantaged inner city area
 + 2: four year old from advantaged suburban area
 + 3: from advantaged rural area
 + 4: from disadvantaged rural area
 + 5: from Spanish speaking home

---

## [Sesame Street analysis](https://www2.stat.duke.edu/courses/Spring19/sta210.001/appex/18-multinom-logistic.html)

---

## Model Validation

---

## Model Validation

- Goal: Want to find set of variables that give the lowest test (not training) error
 - Want a model that is generalizable, i.e. can be used to make predictions for new observations

- If we have a large data set, we can achieve this goal by randomly splitting the data into training and test (validation) sets

- Use the training set to fit a model, then use the fitted model to predict the responses for the observations in the test set

- Assess the error when applied to the test set and choose the model with the lowest error
    
---

## Error

- **Quantitative response**: use Mean Square Error

- **Categorical response**: use misclassification rate

.alert[
`$$\text{Misclassification Rate} = \frac{1}{n}\sum\limits_{i=1}^{n}Err_{i} = \frac{1}{n}\sum\limits_{i=1}^{n} I(y_i \neq \hat{y}_i)$$`
]

---

### Training and test set

- There is no set split for the training and test sets. Common splits are
    - 50% training; 50% test
    - 80% training; 20% test
 
- Assigning observations to training and test sets: 
    - Random assignment
    - Pick a certain time point to make split, if data is collected over time. Generally use earlier data in training and later data in test.
    - Use other relevant characteristic to make split

---

## Cautions

- Be sure the training set is large enough to build a reliable model    
    - The number of observations should be at least 10 times larger than the number of predictors 
    
- Standard errors for model coefficients fit using training data are larger than standard errors if entire dataset was used
    - If the training set is reasonably large, then the difference in standard errors is small
    
- The test error is highly variable depending on which observations are in the test set

---

## k-fold Cross Validation

- There are numerous validation methods that address the variability in testing error; we will focus on k-fold cross validation
 - More in-depth discussion of model validation in STA 325

- **k-fold Cross Validation**
    + Randomly split the data into `$k$` folds (typically 5 or 10)
    + Use `$k-1$` folds to fit a model (this is the training data)
    + Assess how well model predicts on remaining fold (this is the test data) 
    + Repeat `$k$` using a different fold as the test set each time

- Calculate estimated testing error by average the `$k$` different error rates

- Once the variables for the final model have been selected, use the entire dataset to estimate coefficients for final model

---

### 5-fold Cross Validation in R

- Split data into 5 folds. Don't forget to set a seed!

```r
library(modelr)
set.seed(04012019)
mydata_cv <- crossv_kfold(my.data, 5)
```

- Fit model on each training set

```r
models <- map(mydata_cv$train, ~ lm(Y ~ X1 + ... + XP, data = .))
```

- Calculate MSE on each test set

```r
# map2_dbl in purrr package that's loaded with tidyverse
test_mse <- map2_dbl(models, mydata_cv$test, mse)
```

---

## Example: Advertising

We want to use spending on TV, radio, and newspaper advertising ($thousands) to predict total sales ($millions). The data contains the advertising and sales for 200 markets.

```r
advertising <- read_csv("data/advertising.csv")
glimpse(advertising)
```

```
## Observations: 200
## Variables: 4
## $ tv <dbl> 230.1, 44.5, 17.2, 151.5, 180.8, 8.7, 57.5, 120.2, 8.6…
## $ radio <dbl> 37.8, 39.3, 45.9, 41.3, 10.8, 48.9, 32.8, 19.6, 2.1, 2…
## $ newspaper <dbl> 69.2, 45.1, 69.3, 58.5, 58.4, 75.0, 23.5, 11.6, 1.0, 2…
## $ sales <dbl> 22.1, 10.4, 9.3, 18.5, 12.9, 7.2, 11.8, 13.2, 4.8, 10.…
```

We'll start by looking at the 5-fold cross validation results for the model using the predictors `radio` and `newspaper` 
---

### Adveritising: Split into 5 folds

```r
set.seed(04012019)
ad_cv <- crossv_kfold(advertising, 5)
ad_cv
```

```
## # A tibble: 5 x 3
## train test .id 
## <list> <list> <chr>
## 1 <S3: resample> <S3: resample> 1 
## 2 <S3: resample> <S3: resample> 2 
## 3 <S3: resample> <S3: resample> 3 
## 4 <S3: resample> <S3: resample> 4 
## 5 <S3: resample> <S3: resample> 5
```
]

---

### Advertising: Fit models

- Fit model on each training set

```r
models <- map(ad_cv$train, 
 ~ lm(sales ~ radio + newspaper, data = .))
models
```

```
## $`1`
## 
## Call:
## lm(formula = sales ~ radio + newspaper, data = .)
## 
## Coefficients:
## (Intercept)        radio    newspaper  
##    9.378685     0.198167    -0.001513  
## 
## 
## $`2`
## 
## Call:
## lm(formula = sales ~ radio + newspaper, data = .)
## 
## Coefficients:
## (Intercept)        radio    newspaper  
##     8.74297      0.22174      0.01048  
## 
## 
## $`3`
## 
## Call:
## lm(formula = sales ~ radio + newspaper, data = .)
## 
## Coefficients:
## (Intercept)        radio    newspaper  
##    9.452519     0.195644     0.004422  
## 
## 
## $`4`
## 
## Call:
## lm(formula = sales ~ radio + newspaper, data = .)
## 
## Coefficients:
## (Intercept)        radio    newspaper  
##     8.97255      0.19969      0.01251  
## 
## 
## $`5`
## 
## Call:
## lm(formula = sales ~ radio + newspaper, data = .)
## 
## Coefficients:
## (Intercept)        radio    newspaper  
##    9.459099     0.178465     0.006133
```
]

---

### Advertising: Test error

```r
test_mse <- map2_dbl(models, ad_cv$test, mse)
test_mse
```

```
##        1        2        3        4        5 
## 22.84647 18.93572 15.38219 20.28784 16.52659
```

```r
(error_model1 <- mean(test_mse))
```

```
## [1] 18.79576
```
]

---

## Advertising

We will look at the cross-validation results for the model that includes `radio`, `newspaper`, and `tv` as predictors

```r
models <- map(ad_cv$train, 
 ~ lm(sales ~ radio + newspaper + tv, data = .))
```
]

```r
test_mse <- map2_dbl(models, ad_cv$test, mse)
test_mse
```

```
##        1        2        3        4        5 
## 4.797087 2.525156 2.012268 2.929083 2.381483
```

```r
(error_model2 <- mean(test_mse))
```

```
## [1] 2.929015
```
]

---

## Comparing Models

- The estimated testing error for 
    - Model 1: `radio` and `newspaper`: 18.7957614
    - Model 2: `radio`, `newspaper`, `tv`: 2.9290151
    
- Model 2 performs better than Model 1 when predicting the sales for new markets.