class: center, middle, inverse, title-slide # Model Validation ### Dr. Maria Tackett ### 04.01.19 --- ### Announcements - Lab 08 due Wed at 11:59p - Team feedback #2 due Wed at 11:59p - HW 06 due Mon, Apr 8 --- class: middle, center ### Multinomial Logistic Regression Example ### *Sesame Street* --- ### Example: *Sesame Street* - We will analyze data from an experiment to test the effectiveness of the children's program *Sesame Street*. - As part of the experiment, children were assigned to one of two groups: those who were encouraged to watch the program and those who were not - We want to understand what effect the encouragement had on the frequency of viewing after adjusting for other factors --- ### Response Variable - <font class="vocab">`viewcat`</font> + 1: rarely watched show + 2: once or twice a week + 3: three to five times a week + 4: watched show on average more than five times a week --- ### Predictor Variables - <font class="vocab">`age`:</font> child's age in months - <font class="vocab">`prenumb`: </font>score on numbers pretest (0 to 54) - <font class="vocab">`prelet`: </font>score on letters pretest (0 to 58) - <font class="vocab">`viewenc`:</font> 1: encouraged to watch, 2: not encouraged - <font class="vocab">`site:`</font> + 1: three to five year old from disadvantaged inner city area + 2: four year old from advantaged suburban area + 3: from advantaged rural area + 4: from disadvantaged rural area + 5: from Spanish speaking home .footnote[[Full data description](http://www2.stat.duke.edu/~jerry/sta210/sesamelab.html)] --- class: middle, center ## [Sesame Street analysis](https://www2.stat.duke.edu/courses/Spring19/sta210.001/appex/18-multinom-logistic.html) --- class: middle, center ## Model Validation --- ## Model Validation - <font class="vocab">Goal: </font> Want to find set of variables that give the lowest test (not training) error - Want a model that is generalizable, i.e. can be used to make predictions for new observations - If we have a large data set, we can achieve this goal by randomly splitting the data into training and test (validation) sets - Use the training set to fit a model, then use the fitted model to predict the responses for the observations in the test set - Assess the error when applied to the test set and choose the model with the lowest error --- ## Error - **Quantitative response**: use Mean Square Error .alert[ `$$MSE = \frac{1}{n}\sum\limits_{i=1}^{n}(y_i - \hat{y}_i)^2$$` ] -- - **Categorical response**: use misclassification rate .alert[ `$$\text{Misclassification Rate} = \frac{1}{n}\sum\limits_{i=1}^{n}Err_{i} = \frac{1}{n}\sum\limits_{i=1}^{n} I(y_i \neq \hat{y}_i)$$` ] --- ### Training and test set - There is no set split for the training and test sets. Common splits are - 50% training; 50% test - 80% training; 20% test - Assigning observations to training and test sets: - Random assignment - Pick a certain time point to make split, if data is collected over time. Generally use earlier data in training and later data in test. - Use other relevant characteristic to make split --- ## Cautions - Be sure the training set is large enough to build a reliable model - The number of observations should be at least 10 times larger than the number of predictors - Standard errors for model coefficients fit using training data are larger than standard errors if entire dataset was used - If the training set is reasonably large, then the difference in standard errors is small - The test error is highly variable depending on which observations are in the test set --- ## k-fold Cross Validation - There are numerous validation methods that address the variability in testing error; we will focus on <font class="vocab">k-fold cross validation</font> - More in-depth discussion of model validation in STA 325 -- - **k-fold Cross Validation** + Randomly split the data into `\(k\)` folds (typically 5 or 10) + Use `\(k-1\)` folds to fit a model (this is the training data) + Assess how well model predicts on remaining fold (this is the test data) + Repeat `\(k\)` using a different fold as the test set each time -- - Calculate estimated testing error by average the `\(k\)` different error rates -- - Once the variables for the final model have been selected, use the entire dataset to estimate coefficients for final model --- ### 5-fold Cross Validation in R - Split data into 5 folds. Don't forget to set a seed! ```r library(modelr) set.seed(04012019) mydata_cv <- crossv_kfold(my.data, 5) ``` - Fit model on each training set ```r models <- map(mydata_cv$train, ~ lm(Y ~ X1 + ... + XP, data = .)) ``` - Calculate MSE on each test set ```r # map2_dbl in purrr package that's loaded with tidyverse test_mse <- map2_dbl(models, mydata_cv$test, mse) ``` --- ## Example: Advertising We want to use spending on TV, radio, and newspaper advertising ($thousands) to predict total sales ($millions). The data contains the advertising and sales for 200 markets. ```r advertising <- read_csv("data/advertising.csv") glimpse(advertising) ``` ``` ## Observations: 200 ## Variables: 4 ## $ tv <dbl> 230.1, 44.5, 17.2, 151.5, 180.8, 8.7, 57.5, 120.2, 8.6… ## $ radio <dbl> 37.8, 39.3, 45.9, 41.3, 10.8, 48.9, 32.8, 19.6, 2.1, 2… ## $ newspaper <dbl> 69.2, 45.1, 69.3, 58.5, 58.4, 75.0, 23.5, 11.6, 1.0, 2… ## $ sales <dbl> 22.1, 10.4, 9.3, 18.5, 12.9, 7.2, 11.8, 13.2, 4.8, 10.… ``` We'll start by looking at the 5-fold cross validation results for the model using the predictors `radio` and `newspaper` --- ### Adveritising: Split into 5 folds .small[ ```r set.seed(04012019) ad_cv <- crossv_kfold(advertising, 5) ad_cv ``` ``` ## # A tibble: 5 x 3 ## train test .id ## <list> <list> <chr> ## 1 <S3: resample> <S3: resample> 1 ## 2 <S3: resample> <S3: resample> 2 ## 3 <S3: resample> <S3: resample> 3 ## 4 <S3: resample> <S3: resample> 4 ## 5 <S3: resample> <S3: resample> 5 ``` ] --- ### Advertising: Fit models - Fit model on each training set .small[ ```r models <- map(ad_cv$train, ~ lm(sales ~ radio + newspaper, data = .)) models ``` ``` ## $`1` ## ## Call: ## lm(formula = sales ~ radio + newspaper, data = .) ## ## Coefficients: ## (Intercept) radio newspaper ## 9.378685 0.198167 -0.001513 ## ## ## $`2` ## ## Call: ## lm(formula = sales ~ radio + newspaper, data = .) ## ## Coefficients: ## (Intercept) radio newspaper ## 8.74297 0.22174 0.01048 ## ## ## $`3` ## ## Call: ## lm(formula = sales ~ radio + newspaper, data = .) ## ## Coefficients: ## (Intercept) radio newspaper ## 9.452519 0.195644 0.004422 ## ## ## $`4` ## ## Call: ## lm(formula = sales ~ radio + newspaper, data = .) ## ## Coefficients: ## (Intercept) radio newspaper ## 8.97255 0.19969 0.01251 ## ## ## $`5` ## ## Call: ## lm(formula = sales ~ radio + newspaper, data = .) ## ## Coefficients: ## (Intercept) radio newspaper ## 9.459099 0.178465 0.006133 ``` ] --- ### Advertising: Test error .small[ ```r test_mse <- map2_dbl(models, ad_cv$test, mse) test_mse ``` ``` ## 1 2 3 4 5 ## 22.84647 18.93572 15.38219 20.28784 16.52659 ``` ```r (error_model1 <- mean(test_mse)) ``` ``` ## [1] 18.79576 ``` ] --- ## Advertising We will look at the cross-validation results for the model that includes `radio`, `newspaper`, and `tv` as predictors .small[ ```r models <- map(ad_cv$train, ~ lm(sales ~ radio + newspaper + tv, data = .)) ``` ] .small[ ```r test_mse <- map2_dbl(models, ad_cv$test, mse) test_mse ``` ``` ## 1 2 3 4 5 ## 4.797087 2.525156 2.012268 2.929083 2.381483 ``` ```r (error_model2 <- mean(test_mse)) ``` ``` ## [1] 2.929015 ``` ] --- ## Comparing Models - The estimated testing error for - Model 1: `radio` and `newspaper`: 18.7957614 - Model 2: `radio`, `newspaper`, `tv`: 2.9290151 - Model 2 performs better than Model 1 when predicting the sales for new markets.