October 1, 2015
Answer any project related questions
Recap App Ex 5
step
and AICBefore leaving class: make appoinment to meet with me as a team during my Monday OH
Due Tuesday: Work on project!
Load packages:
library(ggplot2) library(dplyr) library(nycflights13) # may need to install first
Load data from nycflights13
package:
data(flights)
flights
datastr(flights)
## Classes 'tbl_df', 'tbl' and 'data.frame': 336776 obs. of 16 variables: ## $ year : int 2013 2013 2013 2013 2013 2013 2013 2013 2013 2013 ... ## $ month : int 1 1 1 1 1 1 1 1 1 1 ... ## $ day : int 1 1 1 1 1 1 1 1 1 1 ... ## $ dep_time : int 517 533 542 544 554 554 555 557 557 558 ... ## $ dep_delay: num 2 4 2 -1 -6 -4 -5 -3 -3 -2 ... ## $ arr_time : int 830 850 923 1004 812 740 913 709 838 753 ... ## $ arr_delay: num 11 20 33 -18 -25 12 19 -14 -8 8 ... ## $ carrier : chr "UA" "UA" "AA" "B6" ... ## $ tailnum : chr "N14228" "N24211" "N619AA" "N804JB" ... ## $ flight : int 1545 1714 1141 725 461 1696 507 5708 79 301 ... ## $ origin : chr "EWR" "LGA" "JFK" "JFK" ... ## $ dest : chr "IAH" "IAH" "MIA" "BQN" ... ## $ air_time : num 227 227 160 183 116 150 158 53 140 138 ... ## $ distance : num 1400 1416 1089 1576 762 ... ## $ hour : num 5 5 5 5 5 5 5 5 5 5 ... ## $ minute : num 17 33 42 44 54 54 55 57 57 58 ...
arr_delay
arr_delay = 0
mean?ggplot(data = flights, aes(x = arr_delay)) + geom_histogram() + xlab("arrival delay, mins")
Let's say "actually delayed" means delayed more than 1 minute:
delayed_flights <- flights %>% filter(arr_delay > 1)
There are 127929 such flights in the sample:
nrow(delayed_flights)
## [1] 127929
ggplot(data = delayed_flights, aes(x = arr_delay)) + geom_histogram() + xlab("arrival delay, mins")
ggplot(data = delayed_flights, aes(x = log(arr_delay))) + geom_histogram() + xlab("logged arrival delay, log(mins)")
On average flights are delayed 42 minutes. The median is 22 minutes, and the middle 50% of flights are delayed 9 to 53 minutes. The flight that is the most delayed has arrived 21.2 hours late at its destination.
delayed_flights %>% summarise(mean_mins = mean(arr_delay), median_mins = median(arr_delay), Q1_mins = quantile(arr_delay, 0.25), Q3_mins = quantile(arr_delay, 0.75), max_hrs = max(arr_delay)/60)
## Source: local data frame [1 x 5] ## ## mean_mins median_mins Q1_mins Q3_mins max_hrs ## 1 41.90324 22 9 53 21.2
Variables to exclude: year
since they're all 2013, dep_delay
since that would be cheating, tailnum
and flight
since they are not very informative, dep_time
since it is redundent with below, and arr_time
and air_time
since we wouldn't know these before taking the flight
Create a new variable from hour
and minute
, \((hour \times 60 + minute)\), so that a linear increase of by a minute in departure time can be easily interpreted
delayed_flights <- delayed_flights %>% mutate(time_in_mins = hour * 60 + minute) ggplot(data = delayed_flights, aes(x = time_in_mins)) + geom_histogram()
full_model <- lm(log(arr_delay) ~ month + day + carrier + origin + dest + distance + time_in_mins, data = delayed_flights) summary(full_model)$r.squared
## [1] 0.09497813
\[ AIC = -2log(L) + 2k \]
AIC(model)
AIC(full_model)
## [1] 395172
final_model <- step(full_model, direction = "backward")
AIC(final_model)
## [1] 395169.2
full_model$call
## lm(formula = log(arr_delay) ~ month + day + carrier + origin + ## dest + distance + time_in_mins, data = delayed_flights)
final_model$call
## lm(formula = log(arr_delay) ~ month + carrier + dest + distance + ## time_in_mins, data = delayed_flights)
Suppose I'm taking a flight from JFK to RDU, in October, with Jet Blue (B6
), and I'm taking the flight at 4pm (\(16 * 60 + 0 = 960\)).
mine_flight <- data.frame(month = 10, carrier = "B6", origin = "JFK", dest = "RDU", distance = 427, time_in_mins = 960)
exp(predict(final_model, newdata = mine_flight))
## 1 ## 23.40069
One commonly used model validation technique is to partition your data into training and testing set
That is, fit the model on the training data
And test it on the testing data
Evaluate model performance using \(RMSE\), root-mean squared error
\[ \sqrt{\frac{\sum_{i = 1}^n (y_i - \hat{y}_i)^2}{n}} \]
Gotta set a seed!
set.seed(10012015)
Don't reuse your seeds
Use different seeds from each other
Need inspiration? https://www.random.org/
Find index numbers for the observations to be included in the trainining dataset:
n <- nrow(delayed_flights) training_index <- sample(1:n, size = n * 0.80, replace = FALSE)
Create training dataset:
training_data <- delayed_flights %>% slice(training_index) nrow(training_data)
## [1] 102343
Create testing dataset:
testing_data <- delayed_flights %>% slice(-training_index) nrow(testing_data)
## [1] 25586
library(ggplot2); library(dplyr); library(nycflights13) set.seed( ) #insert some value for the seed # delayed flights delayed_flights <- flights %>% filter(arr_delay > 1) # time_in_mins delayed_flights <- delayed_flights %>% mutate(time_in_mins = hour * 60 + minute) # training indices n <- nrow(delayed_flights) training_index <- sample(1:n, size = n * 0.80, replace = FALSE) # training data training_data <- delayed_flights %>% slice(training_index) nrow(training_data) # testing data testing_data <- delayed_flights %>% slice(-training_index) nrow(testing_data)
full <- lm(log(arr_delay) ~ month + day + carrier + origin + dest + distance + time_in_mins, data = training_data)
final <- step(full, direction = "backward")
## Start: AIC=25649.11 ## log(arr_delay) ~ month + day + carrier + origin + dest + distance + ## time_in_mins ## ## Df Sum of Sq RSS AIC ## - origin 2 1.9 131176 25647 ## - distance 1 1.1 131175 25648 ## - day 1 1.2 131175 25648 ## <none> 131174 25649 ## - month 1 24.3 131198 25666 ## - dest 102 441.2 131615 25789 ## - carrier 15 784.3 131958 26229 ## - time_in_mins 1 11227.4 142402 34052 ## ## Step: AIC=25646.6 ## log(arr_delay) ~ month + day + carrier + dest + distance + time_in_mins ## ## Df Sum of Sq RSS AIC ## - day 1 1.2 131177 25646 ## <none> 131176 25647 ## - distance 1 8.3 131184 25651 ## - month 1 23.7 131200 25663 ## - dest 102 454.5 131631 25797 ## - carrier 15 825.6 132002 26259 ## - time_in_mins 1 11234.0 142410 34054 ## ## Step: AIC=25645.53 ## log(arr_delay) ~ month + carrier + dest + distance + time_in_mins ## ## Df Sum of Sq RSS AIC ## <none> 131177 25646 ## - distance 1 8.3 131185 25650 ## - month 1 23.9 131201 25662 ## - dest 102 454.5 131632 25795 ## - carrier 15 825.5 132003 26258 ## - time_in_mins 1 11236.5 142414 34055
final$call
## lm(formula = log(arr_delay) ~ month + carrier + dest + distance + ## time_in_mins, data = training_data)
Calculate predicted delay time, and don't forget to exponentiate!
log_y_hat <- predict(final, newdata = testing_data) y_hat <- exp(log_y_hat)
Calculate RMSE
rmse <- sqrt(sum((y_hat - testing_data$arr_delay)^2) / nrow(testing_data)) rmse
## [1] 54.62043