October 1, 2015

Today's agenda

Today's agenda

  • Answer any project related questions

  • Recap App Ex 5

  • Model selection, prediction, and model validation
    • Model selection with step and AIC
    • Prediction
    • Model validation
  • Before leaving class: make appoinment to meet with me as a team during my Monday OH

  • Due Tuesday: Work on project!

Prediction

Setup

Load packages:

library(ggplot2)
library(dplyr)
library(nycflights13) # may need to install first

Load data from nycflights13 package:

data(flights)

Take a look at the flights data

str(flights)
## Classes 'tbl_df', 'tbl' and 'data.frame':    336776 obs. of  16 variables:
##  $ year     : int  2013 2013 2013 2013 2013 2013 2013 2013 2013 2013 ...
##  $ month    : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ day      : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ dep_time : int  517 533 542 544 554 554 555 557 557 558 ...
##  $ dep_delay: num  2 4 2 -1 -6 -4 -5 -3 -3 -2 ...
##  $ arr_time : int  830 850 923 1004 812 740 913 709 838 753 ...
##  $ arr_delay: num  11 20 33 -18 -25 12 19 -14 -8 8 ...
##  $ carrier  : chr  "UA" "UA" "AA" "B6" ...
##  $ tailnum  : chr  "N14228" "N24211" "N619AA" "N804JB" ...
##  $ flight   : int  1545 1714 1141 725 461 1696 507 5708 79 301 ...
##  $ origin   : chr  "EWR" "LGA" "JFK" "JFK" ...
##  $ dest     : chr  "IAH" "IAH" "MIA" "BQN" ...
##  $ air_time : num  227 227 160 183 116 150 158 53 140 138 ...
##  $ distance : num  1400 1416 1089 1576 762 ...
##  $ hour     : num  5 5 5 5 5 5 5 5 5 5 ...
##  $ minute   : num  17 33 42 44 54 54 55 57 57 58 ...

Goal: Predict airline delays - arr_delay

Describe the distribution of arrival delays. What does arr_delay = 0 mean?
ggplot(data = flights, aes(x = arr_delay)) +
  geom_histogram() +
  xlab("arrival delay, mins")

Only consider flights that are actually delayed

Let's say "actually delayed" means delayed more than 1 minute:

delayed_flights <- flights %>%
  filter(arr_delay > 1)

There are 127929 such flights in the sample:

nrow(delayed_flights)
## [1] 127929

Log transformation?

ggplot(data = delayed_flights, 
       aes(x = arr_delay)) +
  geom_histogram() +
  xlab("arrival delay, mins")

ggplot(data = delayed_flights, 
       aes(x = log(arr_delay))) +
  geom_histogram() +
  xlab("logged arrival delay, log(mins)")

Get a feeling for the data

On average flights are delayed 42 minutes. The median is 22 minutes, and the middle 50% of flights are delayed 9 to 53 minutes. The flight that is the most delayed has arrived 21.2 hours late at its destination.

delayed_flights %>%
  summarise(mean_mins = mean(arr_delay), median_mins = median(arr_delay), 
            Q1_mins = quantile(arr_delay, 0.25), Q3_mins = quantile(arr_delay, 0.75),
            max_hrs = max(arr_delay)/60)
## Source: local data frame [1 x 5]
## 
##   mean_mins median_mins Q1_mins Q3_mins max_hrs
## 1  41.90324          22       9      53    21.2

Candidate variables

  • Variables to exclude: year since they're all 2013, dep_delay since that would be cheating, tailnum and flight since they are not very informative, dep_time since it is redundent with below, and arr_time and air_time since we wouldn't know these before taking the flight

  • Create a new variable from hour and minute, \((hour \times 60 + minute)\), so that a linear increase of by a minute in departure time can be easily interpreted

delayed_flights <- delayed_flights %>% mutate(time_in_mins = hour * 60 + minute)
ggplot(data = delayed_flights, aes(x = time_in_mins)) + geom_histogram()

Fit full model

full_model <- lm(log(arr_delay) ~ month + day + carrier + origin + dest + 
                   distance + time_in_mins, data = delayed_flights)
summary(full_model)$r.squared
## [1] 0.09497813

Akaike Information Criterion

\[ AIC = -2log(L) + 2k \]

  • \(L\): likelihood of the model, i.e. likelihood of seeing these data given the model
    parameters
  • Applies a penalty for number of parameters in the model, \(k\)
  • Used for model selection, lower the better
    • Value is not informative on its own
  • In R to get the AIC, use AIC(model)
AIC(full_model)
## [1] 395172

Model selection – a little faster

final_model <- step(full_model, direction = "backward")
AIC(final_model)
## [1] 395169.2

Parsimony

Take a look at the variables in the full and the final model. Can you guess why some of them may have been dropped? Remember: We like parsimonous models.
full_model$call
## lm(formula = log(arr_delay) ~ month + day + carrier + origin + 
##     dest + distance + time_in_mins, data = delayed_flights)
final_model$call
## lm(formula = log(arr_delay) ~ month + carrier + dest + distance + 
##     time_in_mins, data = delayed_flights)

Prediction

New observation

Suppose I'm taking a flight from JFK to RDU, in October, with Jet Blue (B6), and I'm taking the flight at 4pm (\(16 * 60 + 0 = 960\)).

mine_flight <- data.frame(month = 10, carrier = "B6", origin = "JFK", dest = "RDU", distance = 427, time_in_mins = 960)

Prediction

exp(predict(final_model, newdata = mine_flight))
##        1 
## 23.40069

Model validation

Model validation

  • One commonly used model validation technique is to partition your data into training and testing set

  • That is, fit the model on the training data

  • And test it on the testing data

  • Evaluate model performance using \(RMSE\), root-mean squared error

\[ \sqrt{\frac{\sum_{i = 1}^n (y_i - \hat{y}_i)^2}{n}} \]

Do you think we should prefer low or high RMSE?

Random sampling and reproducibility

Gotta set a seed!

set.seed(10012015)

80% training, 20% testing

Find index numbers for the observations to be included in the trainining dataset:

n <- nrow(delayed_flights)
training_index <- sample(1:n, size = n * 0.80, replace = FALSE)


Create training dataset:

training_data <- delayed_flights %>%
  slice(training_index)
nrow(training_data)
## [1] 102343


Create testing dataset:

testing_data <- delayed_flights %>%
  slice(-training_index)
nrow(testing_data)
## [1] 25586

Your turn!

library(ggplot2); library(dplyr); library(nycflights13)
set.seed( ) #insert some value for the seed

# delayed flights
delayed_flights <- flights %>%
  filter(arr_delay > 1)

# time_in_mins
delayed_flights <- delayed_flights 
  %>% mutate(time_in_mins = hour * 60 + minute)

# training indices
n <- nrow(delayed_flights)
training_index <- sample(1:n, size = n * 0.80, replace = FALSE)

# training data
training_data <- delayed_flights %>%
  slice(training_index)
nrow(training_data)

# testing data
testing_data <- delayed_flights %>%
  slice(-training_index)
nrow(testing_data)

Step 1: Fit the full model on your training data

full <- lm(log(arr_delay) ~ month + day + carrier + origin + dest + 
                   distance + time_in_mins, data = training_data)
Compare the output of your full model with neighbors. Are the parameter estimates the same? Different? Very different?

Step 2: Model selection

final <- step(full, direction = "backward")
## Start:  AIC=25649.11
## log(arr_delay) ~ month + day + carrier + origin + dest + distance + 
##     time_in_mins
## 
##                 Df Sum of Sq    RSS   AIC
## - origin         2       1.9 131176 25647
## - distance       1       1.1 131175 25648
## - day            1       1.2 131175 25648
## <none>                       131174 25649
## - month          1      24.3 131198 25666
## - dest         102     441.2 131615 25789
## - carrier       15     784.3 131958 26229
## - time_in_mins   1   11227.4 142402 34052
## 
## Step:  AIC=25646.6
## log(arr_delay) ~ month + day + carrier + dest + distance + time_in_mins
## 
##                 Df Sum of Sq    RSS   AIC
## - day            1       1.2 131177 25646
## <none>                       131176 25647
## - distance       1       8.3 131184 25651
## - month          1      23.7 131200 25663
## - dest         102     454.5 131631 25797
## - carrier       15     825.6 132002 26259
## - time_in_mins   1   11234.0 142410 34054
## 
## Step:  AIC=25645.53
## log(arr_delay) ~ month + carrier + dest + distance + time_in_mins
## 
##                 Df Sum of Sq    RSS   AIC
## <none>                       131177 25646
## - distance       1       8.3 131185 25650
## - month          1      23.9 131201 25662
## - dest         102     454.5 131632 25795
## - carrier       15     825.5 132003 26258
## - time_in_mins   1   11236.5 142414 34055
Compare your final model with mine. Do we have the same variables chosen?
final$call
## lm(formula = log(arr_delay) ~ month + carrier + dest + distance + 
##     time_in_mins, data = training_data)

Step 3: Prediction and RMSE

Calculate predicted delay time, and don't forget to exponentiate!

log_y_hat <- predict(final, newdata = testing_data)
y_hat <- exp(log_y_hat)

Calculate RMSE

rmse <- sqrt(sum((y_hat - testing_data$arr_delay)^2) / nrow(testing_data))
rmse
## [1] 54.62043
What is your RMSE? How does it compare to others'?