class: center, middle, inverse, title-slide # Prediction and model validation ### Dr. Çetinkaya-Rundel ### October 24, 2017 --- class: center, middle # Getting started --- ## Getting started - Peer evaluations due Wednesday 5pm - New teams on Thursday --- class: center, middle # Model selection --- ## Data: Course evals ```r library(tidyverse) # ggplot2 + dplyr + readr + and some others library(broom) ``` ```r # Load data evals <- read_csv("data/evals-mod.csv") # Calculate bty_avg evals <- evals %>% rowwise() %>% mutate(bty_avg = mean(c(bty_f1lower, bty_f1upper, bty_f2upper, bty_m1lower, bty_m1upper, bty_m2upper))) %>% ungroup() ``` --- ## Full model <div class="question"> What percent of the variability in evaluation scores is explained by the model? </div> ```r full_model <- lm(score ~ rank + ethnicity + gender + language + age + cls_perc_eval + cls_did_eval + cls_students + cls_level + cls_profs + cls_credits + bty_avg, data = evals) glance(full_model)$r.squared ``` ``` ## [1] 0.1644867 ``` ```r glance(full_model)$adj.r.squared ``` ``` ## [1] 0.1402959 ``` --- ## Akaike Information Criterion $$ AIC = -2log(L) + 2k $$ - `\(L\)`: likelihood of the model, i.e. likelihood of seeing these data given the model parameters - Applies a penalty for number of parameters in the model, `\(k\)` - Used for model selection, lower the better - Value is not informative on its own - In R to get the AIC, use `AIC(model)` ```r AIC(full_model) ``` ``` ## [1] 695.7457 ``` --- ## Model selection -- a little faster ```r selected_model <- step(full_model, direction = "backward") ``` ```r AIC(selected_model) ``` ``` ## [1] 687.5712 ``` --- ## Parsimony <div class="question"> Take a look at the variables in the full and the selected model. Can you guess why some of them may have been dropped? Remember: We like parsimonous models. </div> ```r full_model$call ``` ``` ## lm(formula = score ~ rank + ethnicity + gender + language + age + ## cls_perc_eval + cls_did_eval + cls_students + cls_level + ## cls_profs + cls_credits + bty_avg, data = evals) ``` ```r selected_model$call ``` ``` ## lm(formula = score ~ ethnicity + gender + language + age + cls_perc_eval + ## cls_credits + bty_avg, data = evals) ``` --- class: center, middle # Prediction --- ## New observation To make a prediction for a new observation we need to create a data frame with that observation. <div class="question"> Suppose we want to make a prediction for a 35 year old white woman professor who received her education at an English speaking country and who teaches a multi credit course. 80% of her classes tend to fill out evaluations, and she's average looiking (beauty score = 2.5). <br><br> The following won't work. Why? How would you correct it? </div> ```r new_prof <- data_frame(ethnicity = "white", sex = "woman", language = "English", age = 35, cls_perc_eval = 0.80, cls_credits = "multi-credit", bty_avg = 2.5) ``` --- ## New observation, corrected ```r new_prof <- data_frame(ethnicity = "not minority", gender = "female", language = "english", age = 35, cls_perc_eval = 0.80, cls_credits = "multi credit", bty_avg = 2.5) ``` --- ## Prediction ```r predict(selected_model, newdata = new_prof) ``` ``` ## 1 ## 3.642981 ``` --- class: center, middle # Model validation --- ## Model validation - One commonly used model validation technique is to partition your data into training and testing set - That is, fit the model on the training data - And test it on the testing data - Evaluate model performance using `\(RMSE\)`, root-mean squared error $$ RMSE = \sqrt{\frac{\sum_{i = 1}^n (y_i - \hat{y}_i)^2}{n}} $$ <div class="question"> Do you think we should prefer low or high RMSE? </div> --- ## Random sampling and reproducibility Gotta set a seed! ```r set.seed(25102017) ``` - Use different seeds from each other - Need inspiration? https://www.random.org/ --- ## Add an identifier column: `index` ```r evals <- evals %>% mutate(index = 1:nrow(evals)) ``` --- ## 80% training, 20% testing .pull-left[ Create training dataset: ```r training_data <- evals %>% sample_frac(size = 0.8) nrow(training_data) ``` ``` ## [1] 370 ``` ] .pull-right[ Create testing dataset: ```r testing_data <- evals %>% anti_join(training_data, by = "index") nrow(testing_data) ``` ``` ## [1] 93 ``` ] Check: ```r (nrow(training_data) + nrow(testing_data)) == nrow(evals) ``` ``` ## [1] TRUE ``` --- ## Your turn! ```r # load packages library(tidyverse) library(broom) # set seed set.seed(123) #insert some value for the seed # add index evals <- evals %>% mutate(index = 1:nrow(evals)) # training data training_data <- evals %>% sample_frac(size = 0.8) nrow(training_data) # testing data testing_data <- evals %>% anti_join(training_data, by = "index") nrow(testing_data) # check (nrow(training_data) + nrow(testing_data)) == nrow(evals) ``` --- ## Step 1: Fit full model to training data <div class="question"> Compare the output of your full model with neighbors. Are the parameter estimates the same? Different? Very different? </div> .small[ ```r full <- lm(score ~ rank + ethnicity + gender + language + age + cls_perc_eval + cls_did_eval + cls_students + cls_level + cls_profs + cls_credits + bty_avg, data = training_data) full ``` ``` ## ## Call: ## lm(formula = score ~ rank + ethnicity + gender + language + age + ## cls_perc_eval + cls_did_eval + cls_students + cls_level + ## cls_profs + cls_credits + bty_avg, data = training_data) ## ## Coefficients: ## (Intercept) ranktenure track ranktenured ## 3.6748141 -0.1778738 -0.1075087 ## ethnicitynot minority gendermale languagenon-english ## 0.1543111 0.1835562 -0.2465718 ## age cls_perc_eval cls_did_eval ## -0.0066084 0.0042989 0.0016229 ## cls_students cls_levelupper cls_profssingle ## -0.0007173 0.0165106 0.0169324 ## cls_creditsone credit bty_avg ## 0.5098201 0.0691804 ``` ```r glance(full)$r.squared ``` ``` ## [1] 0.1865635 ``` ] --- ## Step 2: Model selection ```r selected <- step(full, direction = "backward") ``` <div class="question"> Compare your selected model with your teammates' selected models. Do you have the same variables chosen? </div> ```r selected$call ``` ``` ## lm(formula = score ~ rank + ethnicity + gender + language + age + ## cls_perc_eval + cls_credits + bty_avg, data = training_data) ``` --- ## Step 3: Prediction and RMSE Calculate predicted evaluation scores: ```r y_hat <- predict(selected, newdata = testing_data) ``` Calculate RMSE: .small[ ```r (rmse <- sqrt(sum((testing_data$score - y_hat)^2) / nrow(testing_data))) ``` ``` ## [1] 0.5140652 ``` ] <div class="question"> What is your RMSE? How does it compare to others'? </div> --- ## RMSE on training vs. testing <div class="question"> Would you expect the RMSE to be higher for your training data or your testing data? Why? </div> --- ## RMSE on training vs. testing RMSE for testing: .small[ ```r sqrt(sum((testing_data$score - y_hat)^2) /nrow(testing_data)) ``` ``` ## [1] 0.5140652 ``` ] RMSE for training: .small[ ```r sqrt(sum((training_data$score - selected$fitted.values)^2) / nrow(training_data)) ``` ``` ## [1] 0.4958672 ``` ]