class: center, middle, inverse, title-slide # Prediction + Model validation ### Dr. Çetinkaya-Rundel ### 2018-03-05 --- ## Announcements - Extra credit 01 announced -- requires you to do something today + But there's an alternative if you absolutely can't make the time today + Due Monday after spring break in class (on paper, please put your name on it!) - HW 04 posted, due Monday after break at noon + Different than HW 03, make sure to take a look + Limited help will be available over the break, but it might not be prompt - Will email over break about peer evaluations, please check email --- class: center, middle # Model selection --- ## Data: Course evals ```r library(tidyverse) library(broom) library(modelr) # new! ``` ```r # Load data evals <- read_csv("data/evals-mod.csv") # Calculate bty_avg evals <- evals %>% rowwise() %>% mutate(bty_avg = mean(c(bty_f1lower, bty_f1upper, bty_f2upper, bty_m1lower, bty_m1upper, bty_m2upper))) %>% ungroup() ``` --- ## Full model .question[ What percent of the variability in evaluation scores is explained by the model? ] ```r full_model <- lm(score ~ rank + ethnicity + gender + language + age + cls_perc_eval + cls_did_eval + cls_students + cls_level + cls_profs + cls_credits + bty_avg, data = evals) glance(full_model)$r.squared ``` ``` ## [1] 0.16 ``` ```r glance(full_model)$adj.r.squared ``` ``` ## [1] 0.14 ``` --- ## Akaike Information Criterion $$ AIC = -2log(L) + 2k $$ - `\(L\)`: likelihood of the model - Likelihood of seeing these data given the estimated model parameters - Won't go into calculating it in this course (but you will in future courses) - Used for model selection, lower the better - Value is not informative on its own - Applies a penalty for number of parameters in the model, `\(k\)` - Different penalty than adjusted `\(R^2\)` but similar idea - In R to get the AIC, use `AIC(model)` ```r glance(full_model)$AIC ``` ``` ## [1] 696 ``` --- ## Model selection -- a little faster ```r selected_model <- step(full_model, direction = "backward") ``` ```r tidy(selected_model) %>% select(term, estimate) ``` ``` ## term estimate ## 1 (Intercept) 3.4470 ## 2 ethnicitynot minority 0.2047 ## 3 gendermale 0.1848 ## 4 languagenon-english -0.1615 ## 5 age -0.0050 ## 6 cls_perc_eval 0.0051 ## 7 cls_creditsone credit 0.5151 ## 8 bty_avg 0.0650 ``` --- ## Selected variables | variable | selected | | ------------ | :----------:| | rank | | | ethnicity | x | | gender | x | | language | x | | age | x | | cls_perc_eval| x | | cls_did_eval | | | cls_students | | | cls_level | | | cls_profs | | | cls_credits | x | | bty_avg | x | --- ## Coefficient interpretation .question[ Interpret the slopes of `gender` and `bty_avg` in context of the data. ] ```r tidy(selected_model) %>% select(term, estimate) %>% mutate(estimate = round(estimate, 3)) ``` ``` ## term estimate ## 1 (Intercept) 3.447 ## 2 ethnicitynot minority 0.205 ## 3 gendermale 0.185 ## 4 languagenon-english -0.161 ## 5 age -0.005 ## 6 cls_perc_eval 0.005 ## 7 cls_creditsone credit 0.515 ## 8 bty_avg 0.065 ``` --- ## AIC ```r glance(full_model)$AIC ``` ``` ## [1] 696 ``` ```r glance(selected_model)$AIC ``` ``` ## [1] 688 ``` --- ## Parsimony <div class="question"> Take a look at the variables in the full and the selected model. Can you guess why some of them may have been dropped? Remember: We like parsimonous models. </div> .small[ | variable | selected | | ------------ | :----------:| | rank | | | ethnicity | x | | gender | x | | language | x | | age | x | | cls_perc_eval| x | | cls_did_eval | | | cls_students | | | cls_level | | | cls_profs | | | cls_credits | x | | bty_avg | x | ] --- class: center, middle # Model validation --- ## Overfitting - The data we are using to construct our models come from a larger population. - Ultimately we want our model to tell us how the population works, not just the sample we have. - If the model we fit is too tailored to our sample, it might not perform as well with the remaining population. This means the model is "overfitting" our data. - We measure this using **model validation** techniques. - Note: Overfitting is not a huge concern with linear models with low level interactions, however it can be with more complex and flexible models. The following is just an example of model validation, even though we're using it in a scenario where the concern for overfitting is not high. --- ## Model validation - One commonly used model validation technique is to partition your data into training and testing set - That is, fit the model on the training data - And test it on the testing data - Evaluate model performance using `\(RMSE\)`, root-mean squared error $$ RMSE = \sqrt{\frac{\sum_{i = 1}^n (y_i - \hat{y}_i)^2}{n}} $$ .question[ Do you think we should prefer low or high RMSE? ] --- ## Random sampling and reproducibility Gotta set a seed! ```r set.seed(3518) ``` - Use different seeds from each other - Need inspiration? https://www.random.org/ --- ## Cross validation More specifically, **k-fold cross validation**: - Split your data into k folds. - Use 1 fold for testing and the remaining (k - 1) folds for training. - Repeat k times. --- ## Aside -- the modulo operator ```r 9 %% 5 ``` ``` ## [1] 4 ``` -- .pull-left[
] -- .pull-right[ ```r (1:8) %% 5 ``` ``` ## [1] 1 2 3 4 0 1 2 3 ``` ```r ((1:8) - 1) %% 5 ``` ``` ## [1] 0 1 2 3 4 0 1 2 ``` ```r (((1:8) - 1) %% 5) + 1 ``` ``` ## [1] 1 2 3 4 5 1 2 3 ``` ] --- ## Prepping your data for 5-fold CV ```r evals_cv <- evals %>% mutate(id = 1:n()) %>% sample_n(nrow(evals)) %>% mutate( fold = (((1:n()) - 1) %% 5) + 1 ) evals_cv %>% count(fold) ``` ``` ## # A tibble: 5 x 2 ## fold n ## <dbl> <int> ## 1 1.00 93 ## 2 2.00 93 ## 3 3.00 93 ## 4 4.00 92 ## 5 5.00 92 ``` --- ## CV 1 ```r test_fold <- 1 test <- evals_cv %>% filter(fold == test_fold) train <- evals_cv %>% anti_join(test, by = "id") mod <- lm(score ~ ethnicity + gender + language + age + cls_perc_eval + cls_credits + bty_avg, data = train) (rmse_test1 <- rmse(mod, test)) ``` ``` ## [1] 0.53 ``` --- ## RMSE on training vs. testing .question[ Would you expect the RMSE to be higher for your training data or your testing data? Why? ] --- ## RMSE on training vs. testing RMSE for testing: .small[ ```r (rmse_test1 <- rmse(mod, test)) ``` ``` ## [1] 0.53 ``` ] RMSE for training: .small[ ```r (rmse_train1 <- rmse(mod, train)) ``` ``` ## [1] 0.49 ``` ] --- ## CV 2 ```r test_fold <- 2 test <- evals_cv %>% filter(fold == test_fold) train <- evals_cv %>% anti_join(test, by = "id") mod <- lm(score ~ ethnicity + gender + language + age + cls_perc_eval + cls_credits + bty_avg, data = train) ``` ```r (rmse_test2 <- rmse(mod, test)) ``` ``` ## [1] 0.48 ``` ```r (rmse_train2 <- rmse(mod, train)) ``` ``` ## [1] 0.5 ``` --- ## CV 3 ```r test_fold <- 3 test <- evals_cv %>% filter(fold == test_fold) train <- evals_cv %>% anti_join(test, by = "id") mod <- lm(score ~ ethnicity + gender + language + age + cls_perc_eval + cls_credits + bty_avg, data = train) ``` ```r (rmse_test3 <- rmse(mod, test)) ``` ``` ## [1] 0.53 ``` ```r (rmse_train3 <- rmse(mod, train)) ``` ``` ## [1] 0.49 ``` --- ## CV 4 ```r test_fold <- 4 test <- evals_cv %>% filter(fold == test_fold) train <- evals_cv %>% anti_join(test, by = "id") mod <- lm(score ~ ethnicity + gender + language + age + cls_perc_eval + cls_credits + bty_avg, data = train) ``` ```r (rmse_test4 <- rmse(mod, test)) ``` ``` ## [1] 0.47 ``` ```r (rmse_train4 <- rmse(mod, train)) ``` ``` ## [1] 0.51 ``` --- ## CV 5 ```r test_fold <- 5 test <- evals_cv %>% filter(fold == test_fold) train <- evals_cv %>% anti_join(test, by = "id") mod <- lm(score ~ ethnicity + gender + language + age + cls_perc_eval + cls_credits + bty_avg, data = train) ``` ```r (rmse_test5 <- rmse(mod, test)) ``` ``` ## [1] 0.52 ``` ```r (rmse_train5 <- rmse(mod, train)) ``` ``` ## [1] 0.5 ``` --- ## Putting it altogether .small[ ```r rmse_evals <- tibble( test_fold = 1:5, rmse_train = c(rmse_train1, rmse_train2, rmse_train3, rmse_train4, rmse_train5), rmse_test = c(rmse_test1, rmse_test2, rmse_test3, rmse_test4, rmse_test5) ) ``` ```r ggplot(data = rmse_evals, mapping = aes(x = test_fold, y = rmse_test)) + geom_point() + geom_line() ``` <!-- --> ] --- ## How does RMSE compare to y? - `score` summary stats: ``` ## # A tibble: 1 x 6 ## min max mean med sd IQR ## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 2.30 5.00 4.17 4.30 0.544 0.800 ``` - `rmse_test` summary stats: ``` ## # A tibble: 1 x 6 ## min max mean med sd IQR ## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 0.473 0.535 0.508 0.520 0.0301 0.0558 ``` --- class: center, middle # Prediction --- ## New observation To make a prediction for a new observation we need to create a data frame with that observation. <div class="question"> Suppose we want to make a prediction for a 35 year old white woman professor who received her education at an English speaking country and who teaches a multi credit course. 80% of her classes tend to fill out evaluations, and she's average looiking (beauty score = 2.5). <br><br> The following won't work. Why? How would you correct it? </div> ```r new_prof <- data_frame(ethnicity = "white", sex = "woman", language = "English", age = 35, cls_perc_eval = 0.80, cls_credits = "multi-credit", bty_avg = 2.5) ``` --- ## New observation, corrected ```r new_prof <- data_frame(ethnicity = "not minority", gender = "female", language = "english", age = 35, cls_perc_eval = 0.80, cls_credits = "multi credit", bty_avg = 2.5) ``` --- ## Prediction ```r predict(selected_model, newdata = new_prof) ``` ``` ## 1 ## 3.6 ``` --- ## Uncertainty around prediction - Confidence interval around `\(\bar{y}\)` for new data (average score for profs with given characteristics): ```r predict(selected_model, newdata = new_prof, interval = "confidence") ``` ``` ## fit lwr upr ## 1 3.6 3.4 3.9 ``` - Prediction interval around `\(\hat{y}\)` for new data (average score for profs with given characteristics): ```r predict(selected_model, newdata = new_prof, interval = "prediction") ``` ``` ## fit lwr upr ## 1 3.6 2.6 4.7 ```