Pipes

magrittr

    

You can think about the following sequence of actions - find key, unlock car, start car, drive to school, park.

Expressed as a set of nested functions in R pseudocode this would look like:

park(drive(start_car(find("keys")), to="campus"))

Writing it out using pipes give it a more natural (and easier to read) structure:

find("keys") %>%
    start_car() %>%
    drive(to="campus") %>%
    park()

Approaches

All of the following are fine, it comes down to personal preference:


Nested:

h( g( f(x), y=1), z=1 )

Piped:

f(x) %>% g(y=1) %>% h(z=1)

Intermediate:

res = f(x)
res = g(res, y=1)
res = h(res, z=1)

What about other arguments?

Sometimes we want to send our results to an function argument other than first one or we want to use the previous result for multiple arguments. In these cases we can refer to the previous result using ..

data.frame(a=1:3,b=3:1) %>% lm(a~b,data=.)
## 
## Call:
## lm(formula = a ~ b, data = .)
## 
## Coefficients:
## (Intercept)            b  
##           4           -1
data.frame(a=1:3,b=3:1) %>% .[[1]]
## [1] 1 2 3
data.frame(a=1:3,b=3:1) %>% .[[length(.)]]
## [1] 3 2 1

dplyr

A Grammar of Data Manipulation

dplyr is based on the concepts of functions as verbs that manipulate data frames.

Single data frame functions / verbs:

  • tbl_df(): add the tbl_df class
  • filter(): pick rows matching criteria
  • slice(): pick rows using index(es)
  • select(): pick columns by name
  • rename(): rename specific columns
  • arrange(): reorder rows
  • mutate(): add new variables
  • transmute(): create new data frame with variables
  • distinct(): filter for unique rows
  • sample_n() / sample_frac(): randomly sample rows
  • summarise(): reduce variables to values
  • … (many more)

dplyr rules for functions

  1. First argument is a data frame

  2. Subsequent arguments say what to do with data frame

  3. Always return a data frame

  4. Don't modify in place

  5. Performance via lazy evaluation

Example Data

We will demonstrate dplyr's functionality using the nycflights13 data.

suppressMessages(library(dplyr))
library(nycflights13)
flights = flights %>% tbl_df()
str(flights)
## Classes 'tbl_df', 'tbl' and 'data.frame':    336776 obs. of  19 variables:
##  $ year          : int  2013 2013 2013 2013 2013 2013 2013 2013 2013 2013 ...
##  $ month         : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ day           : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ dep_time      : int  517 533 542 544 554 554 555 557 557 558 ...
##  $ sched_dep_time: int  515 529 540 545 600 558 600 600 600 600 ...
##  $ dep_delay     : num  2 4 2 -1 -6 -4 -5 -3 -3 -2 ...
##  $ arr_time      : int  830 850 923 1004 812 740 913 709 838 753 ...
##  $ sched_arr_time: int  819 830 850 1022 837 728 854 723 846 745 ...
##  $ arr_delay     : num  11 20 33 -18 -25 12 19 -14 -8 8 ...
##  $ carrier       : chr  "UA" "UA" "AA" "B6" ...
##  $ flight        : int  1545 1714 1141 725 461 1696 507 5708 79 301 ...
##  $ tailnum       : chr  "N14228" "N24211" "N619AA" "N804JB" ...
##  $ origin        : chr  "EWR" "LGA" "JFK" "JFK" ...
##  $ dest          : chr  "IAH" "IAH" "MIA" "BQN" ...
##  $ air_time      : num  227 227 160 183 116 150 158 53 140 138 ...
##  $ distance      : num  1400 1416 1089 1576 762 ...
##  $ hour          : num  5 5 5 5 6 5 6 6 6 6 ...
##  $ minute        : num  15 29 40 45 0 58 0 0 0 0 ...
##  $ time_hour     : POSIXct, format: "2013-01-01 05:00:00" "2013-01-01 05:00:00" "2013-01-01 05:00:00" ...

tbl_df()?

flights
## # A tibble: 336,776 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>     <dbl>   <chr>  <int>
## 1   2013     1     1      517            515         2      830            819        11      UA   1545
## 2   2013     1     1      533            529         4      850            830        20      UA   1714
## 3   2013     1     1      542            540         2      923            850        33      AA   1141
## 4   2013     1     1      544            545        -1     1004           1022       -18      B6    725
## 5   2013     1     1      554            600        -6      812            837       -25      DL    461
## 6   2013     1     1      554            558        -4      740            728        12      UA   1696
## 7   2013     1     1      555            600        -5      913            854        19      B6    507
## 8   2013     1     1      557            600        -3      709            723       -14      EV   5708
## 9   2013     1     1      557            600        -3      838            846        -8      B6     79
## 10  2013     1     1      558            600        -2      753            745         8      AA    301
## # ... with 336,766 more rows, and 8 more variables: tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>,
## #   distance <dbl>, hour <dbl>, minute <dbl>, time_hour <time>

tbl_df() - Class?

class(flights)
## [1] "tbl_df"     "tbl"        "data.frame"

dplyr verbs

filter() - March flights

flights %>% filter(month == 3)
## # A tibble: 28,834 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>     <dbl>   <chr>  <int>
## 1   2013     3     1        4           2159       125      318             56       142      B6     11
## 2   2013     3     1       50           2358        52      526            438        48      B6    707
## 3   2013     3     1      117           2245       152      223           2354       149      B6    608
## 4   2013     3     1      454            500        -6      633            648       -15      US   1117
## 5   2013     3     1      505            515       -10      746            810       -24      UA    475
## 6   2013     3     1      521            530        -9      813            827       -14      UA   1714
## 7   2013     3     1      537            540        -3      856            850         6      AA   1141
## 8   2013     3     1      541            545        -4     1014           1023        -9      B6    725
## 9   2013     3     1      549            600       -11      639            703       -24      US   2114
## 10  2013     3     1      550            600       -10      747            801       -14      EV   4911
## # ... with 28,824 more rows, and 8 more variables: tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>,
## #   distance <dbl>, hour <dbl>, minute <dbl>, time_hour <time>

filter() - Flights in the first 7 days of March

flights %>% filter(month == 3, day <= 7)
## # A tibble: 6,530 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>     <dbl>   <chr>  <int>
## 1   2013     3     1        4           2159       125      318             56       142      B6     11
## 2   2013     3     1       50           2358        52      526            438        48      B6    707
## 3   2013     3     1      117           2245       152      223           2354       149      B6    608
## 4   2013     3     1      454            500        -6      633            648       -15      US   1117
## 5   2013     3     1      505            515       -10      746            810       -24      UA    475
## 6   2013     3     1      521            530        -9      813            827       -14      UA   1714
## 7   2013     3     1      537            540        -3      856            850         6      AA   1141
## 8   2013     3     1      541            545        -4     1014           1023        -9      B6    725
## 9   2013     3     1      549            600       -11      639            703       -24      US   2114
## 10  2013     3     1      550            600       -10      747            801       -14      EV   4911
## # ... with 6,520 more rows, and 8 more variables: tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>,
## #   distance <dbl>, hour <dbl>, minute <dbl>, time_hour <time>

filter() - Flights to LAX or RDU in March

flights %>% filter(dest == "LAX" | dest == "RDU", month==3)
## # A tibble: 1,935 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>     <dbl>   <chr>  <int>
## 1   2013     3     1      607            610        -3      832            925       -53      UA    797
## 2   2013     3     1      608            615        -7      737            750       -13      MQ   4518
## 3   2013     3     1      623            630        -7      753            810       -17      EV   5667
## 4   2013     3     1      629            632        -3      844            952       -68      UA   1702
## 5   2013     3     1      657            700        -3      953           1034       -41      DL    763
## 6   2013     3     1      714            715        -1      939           1037       -58      B6    671
## 7   2013     3     1      716            710         6      958           1035       -37      VX    399
## 8   2013     3     1      727            730        -3     1007           1100       -53      AA     33
## 9   2013     3     1      803            810        -7      923            955       -32      MQ   4406
## 10  2013     3     1      823            824        -1      954           1014       -20      EV   4548
## # ... with 1,925 more rows, and 8 more variables: tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>,
## #   distance <dbl>, hour <dbl>, minute <dbl>, time_hour <time>

slice() - First 10 flights

flights %>% slice(1:10)
## # A tibble: 10 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>     <dbl>   <chr>  <int>
## 1   2013     1     1      517            515         2      830            819        11      UA   1545
## 2   2013     1     1      533            529         4      850            830        20      UA   1714
## 3   2013     1     1      542            540         2      923            850        33      AA   1141
## 4   2013     1     1      544            545        -1     1004           1022       -18      B6    725
## 5   2013     1     1      554            600        -6      812            837       -25      DL    461
## 6   2013     1     1      554            558        -4      740            728        12      UA   1696
## 7   2013     1     1      555            600        -5      913            854        19      B6    507
## 8   2013     1     1      557            600        -3      709            723       -14      EV   5708
## 9   2013     1     1      557            600        -3      838            846        -8      B6     79
## 10  2013     1     1      558            600        -2      753            745         8      AA    301
## # ... with 8 more variables: tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <time>

slice() - Last 5 flights

flights %>% slice((n()-5):n())
## # A tibble: 6 x 19
##    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight
##   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>     <dbl>   <chr>  <int>
## 1  2013     9    30       NA           1842        NA       NA           2019        NA      EV   5274
## 2  2013     9    30       NA           1455        NA       NA           1634        NA      9E   3393
## 3  2013     9    30       NA           2200        NA       NA           2312        NA      9E   3525
## 4  2013     9    30       NA           1210        NA       NA           1330        NA      MQ   3461
## 5  2013     9    30       NA           1159        NA       NA           1344        NA      MQ   3572
## 6  2013     9    30       NA            840        NA       NA           1020        NA      MQ   3531
## # ... with 8 more variables: tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <time>

select() - Individual Columns

flights %>% select(year, month, day)
## # A tibble: 336,776 x 3
##     year month   day
##    <int> <int> <int>
## 1   2013     1     1
## 2   2013     1     1
## 3   2013     1     1
## 4   2013     1     1
## 5   2013     1     1
## 6   2013     1     1
## 7   2013     1     1
## 8   2013     1     1
## 9   2013     1     1
## 10  2013     1     1
## # ... with 336,766 more rows

select() - Exclude Columns

flights %>% select(-year, -month, -day)
## # A tibble: 336,776 x 16
##    dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight tailnum origin  dest
##       <int>          <int>     <dbl>    <int>          <int>     <dbl>   <chr>  <int>   <chr>  <chr> <chr>
## 1       517            515         2      830            819        11      UA   1545  N14228    EWR   IAH
## 2       533            529         4      850            830        20      UA   1714  N24211    LGA   IAH
## 3       542            540         2      923            850        33      AA   1141  N619AA    JFK   MIA
## 4       544            545        -1     1004           1022       -18      B6    725  N804JB    JFK   BQN
## 5       554            600        -6      812            837       -25      DL    461  N668DN    LGA   ATL
## 6       554            558        -4      740            728        12      UA   1696  N39463    EWR   ORD
## 7       555            600        -5      913            854        19      B6    507  N516JB    EWR   FLL
## 8       557            600        -3      709            723       -14      EV   5708  N829AS    LGA   IAD
## 9       557            600        -3      838            846        -8      B6     79  N593JB    JFK   MCO
## 10      558            600        -2      753            745         8      AA    301  N3ALAA    LGA   ORD
## # ... with 336,766 more rows, and 5 more variables: air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
## #   time_hour <time>

select() - Ranges

flights %>% select(year:day)
## # A tibble: 336,776 x 3
##     year month   day
##    <int> <int> <int>
## 1   2013     1     1
## 2   2013     1     1
## 3   2013     1     1
## 4   2013     1     1
## 5   2013     1     1
## 6   2013     1     1
## 7   2013     1     1
## 8   2013     1     1
## 9   2013     1     1
## 10  2013     1     1
## # ... with 336,766 more rows

select() - Exclusion Ranges

flights %>% select(-(year:day))
## # A tibble: 336,776 x 16
##    dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight tailnum origin  dest
##       <int>          <int>     <dbl>    <int>          <int>     <dbl>   <chr>  <int>   <chr>  <chr> <chr>
## 1       517            515         2      830            819        11      UA   1545  N14228    EWR   IAH
## 2       533            529         4      850            830        20      UA   1714  N24211    LGA   IAH
## 3       542            540         2      923            850        33      AA   1141  N619AA    JFK   MIA
## 4       544            545        -1     1004           1022       -18      B6    725  N804JB    JFK   BQN
## 5       554            600        -6      812            837       -25      DL    461  N668DN    LGA   ATL
## 6       554            558        -4      740            728        12      UA   1696  N39463    EWR   ORD
## 7       555            600        -5      913            854        19      B6    507  N516JB    EWR   FLL
## 8       557            600        -3      709            723       -14      EV   5708  N829AS    LGA   IAD
## 9       557            600        -3      838            846        -8      B6     79  N593JB    JFK   MCO
## 10      558            600        -2      753            745         8      AA    301  N3ALAA    LGA   ORD
## # ... with 336,766 more rows, and 5 more variables: air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
## #   time_hour <time>

select() - Matching

flights %>% select(contains("dep"), 
                   contains("arr"))
## # A tibble: 336,776 x 7
##    dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier
##       <int>          <int>     <dbl>    <int>          <int>     <dbl>   <chr>
## 1       517            515         2      830            819        11      UA
## 2       533            529         4      850            830        20      UA
## 3       542            540         2      923            850        33      AA
## 4       544            545        -1     1004           1022       -18      B6
## 5       554            600        -6      812            837       -25      DL
## 6       554            558        -4      740            728        12      UA
## 7       555            600        -5      913            854        19      B6
## 8       557            600        -3      709            723       -14      EV
## 9       557            600        -3      838            846        -8      B6
## 10      558            600        -2      753            745         8      AA
## # ... with 336,766 more rows
flights %>% select(starts_with("dep"), 
                   starts_with("arr"))
## # A tibble: 336,776 x 4
##    dep_time dep_delay arr_time arr_delay
##       <int>     <dbl>    <int>     <dbl>
## 1       517         2      830        11
## 2       533         4      850        20
## 3       542         2      923        33
## 4       544        -1     1004       -18
## 5       554        -6      812       -25
## 6       554        -4      740        12
## 7       555        -5      913        19
## 8       557        -3      709       -14
## 9       557        -3      838        -8
## 10      558        -2      753         8
## # ... with 336,766 more rows

rename() - Change column names

flights %>% rename(tail_number = tailnum)
## # A tibble: 336,776 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>     <dbl>   <chr>  <int>
## 1   2013     1     1      517            515         2      830            819        11      UA   1545
## 2   2013     1     1      533            529         4      850            830        20      UA   1714
## 3   2013     1     1      542            540         2      923            850        33      AA   1141
## 4   2013     1     1      544            545        -1     1004           1022       -18      B6    725
## 5   2013     1     1      554            600        -6      812            837       -25      DL    461
## 6   2013     1     1      554            558        -4      740            728        12      UA   1696
## 7   2013     1     1      555            600        -5      913            854        19      B6    507
## 8   2013     1     1      557            600        -3      709            723       -14      EV   5708
## 9   2013     1     1      557            600        -3      838            846        -8      B6     79
## 10  2013     1     1      558            600        -2      753            745         8      AA    301
## # ... with 336,766 more rows, and 8 more variables: tail_number <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <time>

arrange() - Sort data

flights %>% filter(month==3,day==2) %>% arrange(origin, dest)
## # A tibble: 765 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>     <dbl>   <chr>  <int>
## 1   2013     3     2     1336           1329         7     1426           1432        -6      EV   4263
## 2   2013     3     2      628            629        -1      837            849       -12      DL    575
## 3   2013     3     2      637            640        -3      903            915       -12      EV   4209
## 4   2013     3     2      743            745        -2      945           1010       -25      DL    807
## 5   2013     3     2      857            900        -3     1117           1126        -9      DL    485
## 6   2013     3     2     1027           1030        -3     1234           1247       -13      DL   2343
## 7   2013     3     2     1134           1145       -11     1332           1359       -27      DL    401
## 8   2013     3     2     1412           1415        -3     1636           1630         6      DL    935
## 9   2013     3     2     1633           1636        -3     1848           1908       -20      EV   3273
## 10  2013     3     2     1655           1700        -5     1857           1924       -27      DL   2042
## # ... with 755 more rows, and 8 more variables: tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>,
## #   distance <dbl>, hour <dbl>, minute <dbl>, time_hour <time>

arrange() & desc() - Descending order

flights %>% filter(month==3,day==2) %>% arrange(desc(origin), dest) %>% select(origin, dest, tailnum)
## # A tibble: 765 x 3
##    origin  dest tailnum
##     <chr> <chr>   <chr>
## 1     LGA   ATL  N928AT
## 2     LGA   ATL  N623DL
## 3     LGA   ATL  N680DA
## 4     LGA   ATL  N996AT
## 5     LGA   ATL  N510MQ
## 6     LGA   ATL  N663DN
## 7     LGA   ATL  N942DL
## 8     LGA   ATL  N511MQ
## 9     LGA   ATL  N910DE
## 10    LGA   ATL  N902DE
## # ... with 755 more rows

mutate() - Modify columns

suppressMessages(library(lubridate))
flights %>% select(1:3) %>% mutate(date = paste(month,day,year,sep="/") %>% mdy())
## # A tibble: 336,776 x 4
##     year month   day       date
##    <int> <int> <int>     <date>
## 1   2013     1     1 2013-01-01
## 2   2013     1     1 2013-01-01
## 3   2013     1     1 2013-01-01
## 4   2013     1     1 2013-01-01
## 5   2013     1     1 2013-01-01
## 6   2013     1     1 2013-01-01
## 7   2013     1     1 2013-01-01
## 8   2013     1     1 2013-01-01
## 9   2013     1     1 2013-01-01
## 10  2013     1     1 2013-01-01
## # ... with 336,766 more rows

transmute() - Create new tibble from existing columns

flights %>% select(1:3) %>% transmute(date = paste(month,day,year,sep="/") %>% mdy())
## # A tibble: 336,776 x 1
##          date
##        <date>
## 1  2013-01-01
## 2  2013-01-01
## 3  2013-01-01
## 4  2013-01-01
## 5  2013-01-01
## 6  2013-01-01
## 7  2013-01-01
## 8  2013-01-01
## 9  2013-01-01
## 10 2013-01-01
## # ... with 336,766 more rows

distinct() - Find unique rows

flights %>% select(origin, dest) %>% distinct() %>% arrange(origin,dest)
## # A tibble: 224 x 2
##    origin  dest
##     <chr> <chr>
## 1     EWR   ALB
## 2     EWR   ANC
## 3     EWR   ATL
## 4     EWR   AUS
## 5     EWR   AVL
## 6     EWR   BDL
## 7     EWR   BNA
## 8     EWR   BOS
## 9     EWR   BQN
## 10    EWR   BTV
## # ... with 214 more rows

sample_n() - Sample rows

flights %>% sample_n(10)
## # A tibble: 10 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>     <dbl>   <chr>  <int>
## 1   2013     1    18     1029           1030        -1     1251           1252        -1      DL   2343
## 2   2013     8    27      828            836        -8      931            949       -18      B6   2280
## 3   2013     1     9     1841           1855       -14       24           2245        NA      VX     29
## 4   2013     6    12      833            840        -7     1103           1100         3      MQ   3419
## 5   2013     8     5      624            630        -6      853            847         6      B6    211
## 6   2013     2     6     1156           1200        -4     1438           1457       -19      UA    394
## 7   2013    11    14     1750           1755        -5     2009           2015        -6      MQ   3713
## 8   2013    11     8     1137           1129         8     1420           1429        -9      AA   1139
## 9   2013     9    13     1013           1015        -2     1122           1139       -17      EV   4187
## 10  2013    11    18     1937           1930         7     2130           2047        43      EV   5714
## # ... with 8 more variables: tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <time>

sample_frac() - Sample rows

flights %>% sample_frac(0.001)
## # A tibble: 337 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>     <dbl>   <chr>  <int>
## 1   2013     1    10      847            855        -8     1111           1101        10      EV   4383
## 2   2013     1    25     2045           2045         0     2215           2154        21      B6   1178
## 3   2013    11    14     1025           1025         0     1311           1325       -14      AA   1131
## 4   2013     7    17      805            810        -5     1019           1020        -1      FL    346
## 5   2013     1    26     1451           1440        11     1746           1746         0      UA   1579
## 6   2013     4     4     1038           1044        -6     1344           1356       -12      UA   1183
## 7   2013     1    22     1713           1640        33     1912           1835        37      WN    372
## 8   2013    11     9     1652           1700        -8     1806           1837       -31      EV   5513
## 9   2013     3    11      855            900        -5     1211           1216        -5      DL    874
## 10  2013     4    11     1158           1200        -2     1447           1510       -23      AA    977
## # ... with 327 more rows, and 8 more variables: tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>,
## #   distance <dbl>, hour <dbl>, minute <dbl>, time_hour <time>

summarise()

flights %>% mutate(date = paste(month,day,year,sep="/") %>% mdy()) %>% 
            summarize(n(), min(date), max(date))
## # A tibble: 1 x 3
##      n()  min(date)  max(date)
##    <int>     <date>     <date>
## 1 336776 2013-01-01 2013-12-31

group_by()

flights %>% group_by(origin)
## Source: local data frame [336,776 x 19]
## Groups: origin [3]
## 
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>     <dbl>   <chr>  <int>
## 1   2013     1     1      517            515         2      830            819        11      UA   1545
## 2   2013     1     1      533            529         4      850            830        20      UA   1714
## 3   2013     1     1      542            540         2      923            850        33      AA   1141
## 4   2013     1     1      544            545        -1     1004           1022       -18      B6    725
## 5   2013     1     1      554            600        -6      812            837       -25      DL    461
## 6   2013     1     1      554            558        -4      740            728        12      UA   1696
## 7   2013     1     1      555            600        -5      913            854        19      B6    507
## 8   2013     1     1      557            600        -3      709            723       -14      EV   5708
## 9   2013     1     1      557            600        -3      838            846        -8      B6     79
## 10  2013     1     1      558            600        -2      753            745         8      AA    301
## # ... with 336,766 more rows, and 8 more variables: tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>,
## #   distance <dbl>, hour <dbl>, minute <dbl>, time_hour <time>

summarise() with group_by()

flights %>% group_by(origin) %>%
            mutate(date = paste(month,day,year,sep="/") %>% mdy()) %>% 
            summarize(n(), min(date), max(date))
## # A tibble: 3 x 4
##   origin    n()  min(date)  max(date)
##    <chr>  <int>     <date>     <date>
## 1    EWR 120835 2013-01-01 2013-12-31
## 2    JFK 111279 2013-01-01 2013-12-31
## 3    LGA 104662 2013-01-01 2013-12-31

Demo 1

How many flights to Los Angeles (LAX) did each of the legacy carriers (AA, UA, DL or US) have in May from JFK, and what was their average duration?

Demo 2

Create a time series plot of each of the legacy carriers' average departure delay by day and origin airport.

Exercises

  1. Which plane (check the tail number) flew out of each New York airport the most?

  2. What was the shortest flight out of each airport in terms of distance? In terms of duration?

  3. Which date should you fly on if you want to have the lowest possible average departure delay? What about arrival delay?