October 1, 2015

Answer any project related questions

Recap App Ex 5

- Model selection, prediction, and model validation
- Model selection with
`step`

and AIC - Prediction
- Model validation

- Model selection with
Before leaving class: make appoinment to meet with me as a team during my Monday OH

**Due Tuesday:**Work on project!

Load packages:

library(ggplot2) library(dplyr) library(nycflights13) # may need to install first

Load data from `nycflights13`

package:

data(flights)

`flights`

datastr(flights)

## Classes 'tbl_df', 'tbl' and 'data.frame': 336776 obs. of 16 variables: ## $ year : int 2013 2013 2013 2013 2013 2013 2013 2013 2013 2013 ... ## $ month : int 1 1 1 1 1 1 1 1 1 1 ... ## $ day : int 1 1 1 1 1 1 1 1 1 1 ... ## $ dep_time : int 517 533 542 544 554 554 555 557 557 558 ... ## $ dep_delay: num 2 4 2 -1 -6 -4 -5 -3 -3 -2 ... ## $ arr_time : int 830 850 923 1004 812 740 913 709 838 753 ... ## $ arr_delay: num 11 20 33 -18 -25 12 19 -14 -8 8 ... ## $ carrier : chr "UA" "UA" "AA" "B6" ... ## $ tailnum : chr "N14228" "N24211" "N619AA" "N804JB" ... ## $ flight : int 1545 1714 1141 725 461 1696 507 5708 79 301 ... ## $ origin : chr "EWR" "LGA" "JFK" "JFK" ... ## $ dest : chr "IAH" "IAH" "MIA" "BQN" ... ## $ air_time : num 227 227 160 183 116 150 158 53 140 138 ... ## $ distance : num 1400 1416 1089 1576 762 ... ## $ hour : num 5 5 5 5 5 5 5 5 5 5 ... ## $ minute : num 17 33 42 44 54 54 55 57 57 58 ...

`arr_delay`

Describe the distribution of arrival delays. What does

`arr_delay = 0`

mean?ggplot(data = flights, aes(x = arr_delay)) + geom_histogram() + xlab("arrival delay, mins")

Let's say "actually delayed" means delayed more than 1 minute:

delayed_flights <- flights %>% filter(arr_delay > 1)

There are 127929 such flights in the sample:

nrow(delayed_flights)

## [1] 127929

ggplot(data = delayed_flights, aes(x = arr_delay)) + geom_histogram() + xlab("arrival delay, mins")

ggplot(data = delayed_flights, aes(x = log(arr_delay))) + geom_histogram() + xlab("logged arrival delay, log(mins)")

On average flights are delayed 42 minutes. The median is 22 minutes, and the middle 50% of flights are delayed 9 to 53 minutes. The flight that is the most delayed has arrived 21.2 hours late at its destination.

delayed_flights %>% summarise(mean_mins = mean(arr_delay), median_mins = median(arr_delay), Q1_mins = quantile(arr_delay, 0.25), Q3_mins = quantile(arr_delay, 0.75), max_hrs = max(arr_delay)/60)

## Source: local data frame [1 x 5] ## ## mean_mins median_mins Q1_mins Q3_mins max_hrs ## 1 41.90324 22 9 53 21.2

Variables to exclude:

`year`

since they're all 2013,`dep_delay`

since that would be cheating,`tailnum`

and`flight`

since they are not very informative,`dep_time`

since it is redundent with below, and`arr_time`

and`air_time`

since we wouldn't know these before taking the flightCreate a new variable from

`hour`

and`minute`

, \((hour \times 60 + minute)\), so that a linear increase of by a minute in departure time can be easily interpreted

delayed_flights <- delayed_flights %>% mutate(time_in_mins = hour * 60 + minute) ggplot(data = delayed_flights, aes(x = time_in_mins)) + geom_histogram()

full_model <- lm(log(arr_delay) ~ month + day + carrier + origin + dest + distance + time_in_mins, data = delayed_flights) summary(full_model)$r.squared

## [1] 0.09497813

\[ AIC = -2log(L) + 2k \]

- \(L\): likelihood of the model, i.e. likelihood of seeing these data given the model

parameters - Applies a penalty for number of parameters in the model, \(k\)
- Used for model selection, lower the better
- Value is not informative on its own

- In R to get the AIC, use
`AIC(model)`

AIC(full_model)

## [1] 395172

final_model <- step(full_model, direction = "backward")

AIC(final_model)

## [1] 395169.2

Take a look at the variables in the full and the final model. Can you guess why some of them may have been dropped? Remember: We like parsimonous models.

full_model$call

## lm(formula = log(arr_delay) ~ month + day + carrier + origin + ## dest + distance + time_in_mins, data = delayed_flights)

final_model$call

## lm(formula = log(arr_delay) ~ month + carrier + dest + distance + ## time_in_mins, data = delayed_flights)

Suppose I'm taking a flight from JFK to RDU, in October, with Jet Blue (`B6`

), and I'm taking the flight at 4pm (\(16 * 60 + 0 = 960\)).

mine_flight <- data.frame(month = 10, carrier = "B6", origin = "JFK", dest = "RDU", distance = 427, time_in_mins = 960)

exp(predict(final_model, newdata = mine_flight))

## 1 ## 23.40069

One commonly used model validation technique is to partition your data into training and testing set

That is, fit the model on the training data

And test it on the testing data

Evaluate model performance using \(RMSE\), root-mean squared error

\[ \sqrt{\frac{\sum_{i = 1}^n (y_i - \hat{y}_i)^2}{n}} \]

Do you think we should prefer low or high RMSE?

Gotta set a seed!

set.seed(10012015)

Don't reuse your seeds

Use different seeds from each other

Need inspiration? https://www.random.org/

Find index numbers for the observations to be included in the trainining dataset:

n <- nrow(delayed_flights) training_index <- sample(1:n, size = n * 0.80, replace = FALSE)

Create training dataset:

training_data <- delayed_flights %>% slice(training_index) nrow(training_data)

## [1] 102343

Create testing dataset:

testing_data <- delayed_flights %>% slice(-training_index) nrow(testing_data)

## [1] 25586

library(ggplot2); library(dplyr); library(nycflights13) set.seed( ) #insert some value for the seed # delayed flights delayed_flights <- flights %>% filter(arr_delay > 1) # time_in_mins delayed_flights <- delayed_flights %>% mutate(time_in_mins = hour * 60 + minute) # training indices n <- nrow(delayed_flights) training_index <- sample(1:n, size = n * 0.80, replace = FALSE) # training data training_data <- delayed_flights %>% slice(training_index) nrow(training_data) # testing data testing_data <- delayed_flights %>% slice(-training_index) nrow(testing_data)

full <- lm(log(arr_delay) ~ month + day + carrier + origin + dest + distance + time_in_mins, data = training_data)

Compare the output of your full model with neighbors. Are the parameter estimates the same? Different? Very different?

final <- step(full, direction = "backward")

## Start: AIC=25649.11 ## log(arr_delay) ~ month + day + carrier + origin + dest + distance + ## time_in_mins ## ## Df Sum of Sq RSS AIC ## - origin 2 1.9 131176 25647 ## - distance 1 1.1 131175 25648 ## - day 1 1.2 131175 25648 ## <none> 131174 25649 ## - month 1 24.3 131198 25666 ## - dest 102 441.2 131615 25789 ## - carrier 15 784.3 131958 26229 ## - time_in_mins 1 11227.4 142402 34052 ## ## Step: AIC=25646.6 ## log(arr_delay) ~ month + day + carrier + dest + distance + time_in_mins ## ## Df Sum of Sq RSS AIC ## - day 1 1.2 131177 25646 ## <none> 131176 25647 ## - distance 1 8.3 131184 25651 ## - month 1 23.7 131200 25663 ## - dest 102 454.5 131631 25797 ## - carrier 15 825.6 132002 26259 ## - time_in_mins 1 11234.0 142410 34054 ## ## Step: AIC=25645.53 ## log(arr_delay) ~ month + carrier + dest + distance + time_in_mins ## ## Df Sum of Sq RSS AIC ## <none> 131177 25646 ## - distance 1 8.3 131185 25650 ## - month 1 23.9 131201 25662 ## - dest 102 454.5 131632 25795 ## - carrier 15 825.5 132003 26258 ## - time_in_mins 1 11236.5 142414 34055

Compare your final model with mine. Do we have the same variables chosen?

final$call

## lm(formula = log(arr_delay) ~ month + carrier + dest + distance + ## time_in_mins, data = training_data)

Calculate predicted delay time, and don't forget to exponentiate!

log_y_hat <- predict(final, newdata = testing_data) y_hat <- exp(log_y_hat)

Calculate RMSE

rmse <- sqrt(sum((y_hat - testing_data$arr_delay)^2) / nrow(testing_data)) rmse

## [1] 54.62043

What is your RMSE? How does it compare to others'?