Sta112FS 11. Model selection, prediction, and model validation

October 1, 2015

Today's agenda

Answer any project related questions
Recap App Ex 5
Model selection, prediction, and model validation
- Model selection with step and AIC
- Prediction
- Model validation
Before leaving class: make appoinment to meet with me as a team during my Monday OH
Due Tuesday: Work on project!

Prediction

Setup

Load packages:

library(ggplot2)
library(dplyr)
library(nycflights13) # may need to install first

Load data from nycflights13 package:

data(flights)

Take a look at the `flights` data

str(flights)

## Classes 'tbl_df', 'tbl' and 'data.frame':    336776 obs. of  16 variables:
##  $ year     : int  2013 2013 2013 2013 2013 2013 2013 2013 2013 2013 ...
##  $ month    : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ day      : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ dep_time : int  517 533 542 544 554 554 555 557 557 558 ...
##  $ dep_delay: num  2 4 2 -1 -6 -4 -5 -3 -3 -2 ...
##  $ arr_time : int  830 850 923 1004 812 740 913 709 838 753 ...
##  $ arr_delay: num  11 20 33 -18 -25 12 19 -14 -8 8 ...
##  $ carrier  : chr  "UA" "UA" "AA" "B6" ...
##  $ tailnum  : chr  "N14228" "N24211" "N619AA" "N804JB" ...
##  $ flight   : int  1545 1714 1141 725 461 1696 507 5708 79 301 ...
##  $ origin   : chr  "EWR" "LGA" "JFK" "JFK" ...
##  $ dest     : chr  "IAH" "IAH" "MIA" "BQN" ...
##  $ air_time : num  227 227 160 183 116 150 158 53 140 138 ...
##  $ distance : num  1400 1416 1089 1576 762 ...
##  $ hour     : num  5 5 5 5 5 5 5 5 5 5 ...
##  $ minute   : num  17 33 42 44 54 54 55 57 57 58 ...

Goal: Predict airline delays - `arr_delay`

Describe the distribution of arrival delays. What does arr_delay = 0 mean?

ggplot(data = flights, aes(x = arr_delay)) +
  geom_histogram() +
  xlab("arrival delay, mins")

Only consider flights that are actually delayed

Let's say "actually delayed" means delayed more than 1 minute:

delayed_flights <- flights %>%
  filter(arr_delay > 1)

There are 127929 such flights in the sample:

nrow(delayed_flights)

## [1] 127929

Log transformation?

ggplot(data = delayed_flights, 
       aes(x = arr_delay)) +
  geom_histogram() +
  xlab("arrival delay, mins")

ggplot(data = delayed_flights, 
       aes(x = log(arr_delay))) +
  geom_histogram() +
  xlab("logged arrival delay, log(mins)")

Get a feeling for the data

On average flights are delayed 42 minutes. The median is 22 minutes, and the middle 50% of flights are delayed 9 to 53 minutes. The flight that is the most delayed has arrived 21.2 hours late at its destination.

delayed_flights %>%
  summarise(mean_mins = mean(arr_delay), median_mins = median(arr_delay), 
            Q1_mins = quantile(arr_delay, 0.25), Q3_mins = quantile(arr_delay, 0.75),
            max_hrs = max(arr_delay)/60)

## Source: local data frame [1 x 5]
## 
##   mean_mins median_mins Q1_mins Q3_mins max_hrs
## 1  41.90324          22       9      53    21.2

Candidate variables

Variables to exclude: year since they're all 2013, dep_delay since that would be cheating, tailnum and flight since they are not very informative, dep_time since it is redundent with below, and arr_time and air_time since we wouldn't know these before taking the flight
Create a new variable from hour and minute, \((hour \times 60 + minute)\), so that a linear increase of by a minute in departure time can be easily interpreted

delayed_flights <- delayed_flights %>% mutate(time_in_mins = hour * 60 + minute)
ggplot(data = delayed_flights, aes(x = time_in_mins)) + geom_histogram()

Fit full model

full_model <- lm(log(arr_delay) ~ month + day + carrier + origin + dest + 
                   distance + time_in_mins, data = delayed_flights)
summary(full_model)$r.squared

## [1] 0.09497813

Akaike Information Criterion

\[ AIC = -2log(L) + 2k \]

\(L\): likelihood of the model, i.e. likelihood of seeing these data given the model
parameters
Applies a penalty for number of parameters in the model, \(k\)
Used for model selection, lower the better
- Value is not informative on its own
In R to get the AIC, use AIC(model)

AIC(full_model)

## [1] 395172

Model selection – a little faster

final_model <- step(full_model, direction = "backward")

AIC(final_model)

## [1] 395169.2

Parsimony

Take a look at the variables in the full and the final model. Can you guess why some of them may have been dropped? Remember: We like parsimonous models.

full_model$call

## lm(formula = log(arr_delay) ~ month + day + carrier + origin + 
##     dest + distance + time_in_mins, data = delayed_flights)

final_model$call

## lm(formula = log(arr_delay) ~ month + carrier + dest + distance + 
##     time_in_mins, data = delayed_flights)

Prediction

New observation

Suppose I'm taking a flight from JFK to RDU, in October, with Jet Blue (B6), and I'm taking the flight at 4pm (\(16 * 60 + 0 = 960\)).

mine_flight <- data.frame(month = 10, carrier = "B6", origin = "JFK", dest = "RDU", distance = 427, time_in_mins = 960)

Prediction

exp(predict(final_model, newdata = mine_flight))

##        1 
## 23.40069

Model validation

One commonly used model validation technique is to partition your data into training and testing set
That is, fit the model on the training data
And test it on the testing data
Evaluate model performance using \(RMSE\), root-mean squared error

\[ \sqrt{\frac{\sum_{i = 1}^n (y_i - \hat{y}_i)^2}{n}} \]

Do you think we should prefer low or high RMSE?

Random sampling and reproducibility

Gotta set a seed!

set.seed(10012015)

Don't reuse your seeds
Use different seeds from each other
Need inspiration? https://www.random.org/

80% training, 20% testing

Find index numbers for the observations to be included in the trainining dataset:

n <- nrow(delayed_flights)
training_index <- sample(1:n, size = n * 0.80, replace = FALSE)

Create training dataset:

training_data <- delayed_flights %>%
  slice(training_index)
nrow(training_data)

## [1] 102343

Create testing dataset:

testing_data <- delayed_flights %>%
  slice(-training_index)
nrow(testing_data)

## [1] 25586

Your turn!

library(ggplot2); library(dplyr); library(nycflights13)
set.seed( ) #insert some value for the seed

# delayed flights
delayed_flights <- flights %>%
  filter(arr_delay > 1)

# time_in_mins
delayed_flights <- delayed_flights 
  %>% mutate(time_in_mins = hour * 60 + minute)

# training indices
n <- nrow(delayed_flights)
training_index <- sample(1:n, size = n * 0.80, replace = FALSE)

# training data
training_data <- delayed_flights %>%
  slice(training_index)
nrow(training_data)

# testing data
testing_data <- delayed_flights %>%
  slice(-training_index)
nrow(testing_data)

Step 1: Fit the full model on your training data

full <- lm(log(arr_delay) ~ month + day + carrier + origin + dest + 
                   distance + time_in_mins, data = training_data)

Compare the output of your full model with neighbors. Are the parameter estimates the same? Different? Very different?

Step 2: Model selection

final <- step(full, direction = "backward")

## Start:  AIC=25649.11
## log(arr_delay) ~ month + day + carrier + origin + dest + distance + 
##     time_in_mins
## 
##                 Df Sum of Sq    RSS   AIC
## - origin         2       1.9 131176 25647
## - distance       1       1.1 131175 25648
## - day            1       1.2 131175 25648
## <none>                       131174 25649
## - month          1      24.3 131198 25666
## - dest         102     441.2 131615 25789
## - carrier       15     784.3 131958 26229
## - time_in_mins   1   11227.4 142402 34052
## 
## Step:  AIC=25646.6
## log(arr_delay) ~ month + day + carrier + dest + distance + time_in_mins
## 
##                 Df Sum of Sq    RSS   AIC
## - day            1       1.2 131177 25646
## <none>                       131176 25647
## - distance       1       8.3 131184 25651
## - month          1      23.7 131200 25663
## - dest         102     454.5 131631 25797
## - carrier       15     825.6 132002 26259
## - time_in_mins   1   11234.0 142410 34054
## 
## Step:  AIC=25645.53
## log(arr_delay) ~ month + carrier + dest + distance + time_in_mins
## 
##                 Df Sum of Sq    RSS   AIC
## <none>                       131177 25646
## - distance       1       8.3 131185 25650
## - month          1      23.9 131201 25662
## - dest         102     454.5 131632 25795
## - carrier       15     825.5 132003 26258
## - time_in_mins   1   11236.5 142414 34055

Compare your final model with mine. Do we have the same variables chosen?

final$call

## lm(formula = log(arr_delay) ~ month + carrier + dest + distance + 
##     time_in_mins, data = training_data)

Step 3: Prediction and RMSE

Calculate predicted delay time, and don't forget to exponentiate!

log_y_hat <- predict(final, newdata = testing_data)
y_hat <- exp(log_y_hat)

Calculate RMSE

rmse <- sqrt(sum((y_hat - testing_data$arr_delay)^2) / nrow(testing_data))
rmse

## [1] 54.62043

What is your RMSE? How does it compare to others'?

Today's agenda

Today's agenda

Prediction

Setup

Take a look at the flights data

Goal: Predict airline delays - arr_delay

Only consider flights that are actually delayed

Log transformation?

Get a feeling for the data

Candidate variables

Fit full model

Akaike Information Criterion

Model selection – a little faster

Parsimony

Prediction

New observation

Prediction

Model validation

Model validation

Random sampling and reproducibility

80% training, 20% testing

Your turn!

Step 1: Fit the full model on your training data

Step 2: Model selection

Step 3: Prediction and RMSE

Take a look at the `flights` data

Goal: Predict airline delays - `arr_delay`