dplyr


dplyr

A Grammar of Data Manipulation

dplyr is based on the concepts of functions as verbs that manipulate data frames.


Single data frame functions / verbs:

  • tbl_df(): add the tbl_df class
  • filter(): pick rows matching criteria
  • slice(): pick rows using index(es)
  • select(): pick columns by name
  • rename(): rename specific columns
  • arrange(): reorder rows
  • mutate(): add new variables
  • transmute(): create new data frame with variables
  • distinct(): filter for unique rows
  • sample_n() / sample_frac(): randomly sample rows
  • summarise(): reduce variables to values

dplyr rules for functions

  • First argument is a data frame

  • Subsequent arguments say what to do with data frame

  • Always return a data frame

  • Don’t modify in place

Example Data

We will demonstrate dplyr’s functionality using the nycflights13 data.

suppressMessages(library(dplyr))
library(nycflights13)

flights = flights %>% tbl_df()
str(flights)
## Classes 'tbl_df', 'tbl' and 'data.frame':    336776 obs. of  16 variables:
##  $ year     : int  2013 2013 2013 2013 2013 2013 2013 2013 2013 2013 ...
##  $ month    : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ day      : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ dep_time : int  517 533 542 544 554 554 555 557 557 558 ...
##  $ dep_delay: num  2 4 2 -1 -6 -4 -5 -3 -3 -2 ...
##  $ arr_time : int  830 850 923 1004 812 740 913 709 838 753 ...
##  $ arr_delay: num  11 20 33 -18 -25 12 19 -14 -8 8 ...
##  $ carrier  : chr  "UA" "UA" "AA" "B6" ...
##  $ tailnum  : chr  "N14228" "N24211" "N619AA" "N804JB" ...
##  $ flight   : int  1545 1714 1141 725 461 1696 507 5708 79 301 ...
##  $ origin   : chr  "EWR" "LGA" "JFK" "JFK" ...
##  $ dest     : chr  "IAH" "IAH" "MIA" "BQN" ...
##  $ air_time : num  227 227 160 183 116 150 158 53 140 138 ...
##  $ distance : num  1400 1416 1089 1576 762 ...
##  $ hour     : num  5 5 5 5 5 5 5 5 5 5 ...
##  $ minute   : num  17 33 42 44 54 54 55 57 57 58 ...

tbl_df()?

flights
## Source: local data frame [336,776 x 16]
## 
##     year month   day dep_time dep_delay arr_time arr_delay carrier tailnum flight origin  dest air_time
##    (int) (int) (int)    (int)     (dbl)    (int)     (dbl)   (chr)   (chr)  (int)  (chr) (chr)    (dbl)
## 1   2013     1     1      517         2      830        11      UA  N14228   1545    EWR   IAH      227
## 2   2013     1     1      533         4      850        20      UA  N24211   1714    LGA   IAH      227
## 3   2013     1     1      542         2      923        33      AA  N619AA   1141    JFK   MIA      160
## 4   2013     1     1      544        -1     1004       -18      B6  N804JB    725    JFK   BQN      183
## 5   2013     1     1      554        -6      812       -25      DL  N668DN    461    LGA   ATL      116
## 6   2013     1     1      554        -4      740        12      UA  N39463   1696    EWR   ORD      150
## 7   2013     1     1      555        -5      913        19      B6  N516JB    507    EWR   FLL      158
## 8   2013     1     1      557        -3      709       -14      EV  N829AS   5708    LGA   IAD       53
## 9   2013     1     1      557        -3      838        -8      B6  N593JB     79    JFK   MCO      140
## 10  2013     1     1      558        -2      753         8      AA  N3ALAA    301    LGA   ORD      138
## ..   ...   ...   ...      ...       ...      ...       ...     ...     ...    ...    ...   ...      ...
## Variables not shown: distance (dbl), hour (dbl), minute (dbl)

filter() - March flights

flights %>% filter(month == 3)
## Source: local data frame [28,834 x 16]
## 
##     year month   day dep_time dep_delay arr_time arr_delay carrier tailnum flight origin  dest air_time
##    (int) (int) (int)    (int)     (dbl)    (int)     (dbl)   (chr)   (chr)  (int)  (chr) (chr)    (dbl)
## 1   2013     3     1        4       125      318       142      B6  N706JB     11    JFK   FLL      166
## 2   2013     3     1       50        52      526        48      B6  N794JB    707    JFK   SJU      198
## 3   2013     3     1      117       152      223       149      B6  N328JB    608    JFK   PWM       48
## 4   2013     3     1      454        -6      633       -15      US  N177US   1117    EWR   CLT       79
## 5   2013     3     1      505       -10      746       -24      UA  N527UA    475    EWR   IAH      199
## 6   2013     3     1      521        -9      813       -14      UA  N76523   1714    LGA   IAH      213
## 7   2013     3     1      537        -3      856         6      AA  N5EPAA   1141    JFK   MIA      173
## 8   2013     3     1      541        -4     1014        -9      B6  N653JB    725    JFK   BQN      191
## 9   2013     3     1      549       -11      639       -24      US  N749US   2114    LGA   BOS       31
## 10  2013     3     1      550       -10      747       -14      EV  N760EV   4911    EWR   DTW       89
## ..   ...   ...   ...      ...       ...      ...       ...     ...     ...    ...    ...   ...      ...
## Variables not shown: distance (dbl), hour (dbl), minute (dbl)

filter() - Flights in the first 7 days of March

flights %>% filter(month == 3, day <= 7)
## Source: local data frame [6,530 x 16]
## 
##     year month   day dep_time dep_delay arr_time arr_delay carrier tailnum flight origin  dest air_time
##    (int) (int) (int)    (int)     (dbl)    (int)     (dbl)   (chr)   (chr)  (int)  (chr) (chr)    (dbl)
## 1   2013     3     1        4       125      318       142      B6  N706JB     11    JFK   FLL      166
## 2   2013     3     1       50        52      526        48      B6  N794JB    707    JFK   SJU      198
## 3   2013     3     1      117       152      223       149      B6  N328JB    608    JFK   PWM       48
## 4   2013     3     1      454        -6      633       -15      US  N177US   1117    EWR   CLT       79
## 5   2013     3     1      505       -10      746       -24      UA  N527UA    475    EWR   IAH      199
## 6   2013     3     1      521        -9      813       -14      UA  N76523   1714    LGA   IAH      213
## 7   2013     3     1      537        -3      856         6      AA  N5EPAA   1141    JFK   MIA      173
## 8   2013     3     1      541        -4     1014        -9      B6  N653JB    725    JFK   BQN      191
## 9   2013     3     1      549       -11      639       -24      US  N749US   2114    LGA   BOS       31
## 10  2013     3     1      550       -10      747       -14      EV  N760EV   4911    EWR   DTW       89
## ..   ...   ...   ...      ...       ...      ...       ...     ...     ...    ...    ...   ...      ...
## Variables not shown: distance (dbl), hour (dbl), minute (dbl)

filter() - Flights to LAX or RDU in March

flights %>% filter(dest == "LAX" | dest == "RDU", month==3)
## Source: local data frame [1,935 x 16]
## 
##     year month   day dep_time dep_delay arr_time arr_delay carrier tailnum flight origin  dest air_time
##    (int) (int) (int)    (int)     (dbl)    (int)     (dbl)   (chr)   (chr)  (int)  (chr) (chr)    (dbl)
## 1   2013     3     1      607        -3      832       -53      UA  N557UA    797    JFK   LAX      292
## 2   2013     3     1      608        -7      737       -13      MQ  N723MQ   4518    LGA   RDU       64
## 3   2013     3     1      623        -7      753       -17      EV  N14959   5667    EWR   RDU       64
## 4   2013     3     1      629        -3      844       -68      UA  N77518   1702    EWR   LAX      294
## 5   2013     3     1      657        -3      953       -41      DL  N712TW    763    JFK   LAX      326
## 6   2013     3     1      714        -1      939       -58      B6  N804JB    671    JFK   LAX      297
## 7   2013     3     1      716         6      958       -37      VX  N636VA    399    JFK   LAX      315
## 8   2013     3     1      727        -3     1007       -53      AA  N319AA     33    JFK   LAX      315
## 9   2013     3     1      803        -7      923       -32      MQ  N853MQ   4406    JFK   RDU       67
## 10  2013     3     1      823        -1      954       -20      EV  N22909   4548    EWR   RDU       66
## ..   ...   ...   ...      ...       ...      ...       ...     ...     ...    ...    ...   ...      ...
## Variables not shown: distance (dbl), hour (dbl), minute (dbl)

slice() - First 10 flights

flights %>% slice(1:10)
## Source: local data frame [10 x 16]
## 
##     year month   day dep_time dep_delay arr_time arr_delay carrier tailnum flight origin  dest air_time
##    (int) (int) (int)    (int)     (dbl)    (int)     (dbl)   (chr)   (chr)  (int)  (chr) (chr)    (dbl)
## 1   2013     1     1      517         2      830        11      UA  N14228   1545    EWR   IAH      227
## 2   2013     1     1      533         4      850        20      UA  N24211   1714    LGA   IAH      227
## 3   2013     1     1      542         2      923        33      AA  N619AA   1141    JFK   MIA      160
## 4   2013     1     1      544        -1     1004       -18      B6  N804JB    725    JFK   BQN      183
## 5   2013     1     1      554        -6      812       -25      DL  N668DN    461    LGA   ATL      116
## 6   2013     1     1      554        -4      740        12      UA  N39463   1696    EWR   ORD      150
## 7   2013     1     1      555        -5      913        19      B6  N516JB    507    EWR   FLL      158
## 8   2013     1     1      557        -3      709       -14      EV  N829AS   5708    LGA   IAD       53
## 9   2013     1     1      557        -3      838        -8      B6  N593JB     79    JFK   MCO      140
## 10  2013     1     1      558        -2      753         8      AA  N3ALAA    301    LGA   ORD      138
## Variables not shown: distance (dbl), hour (dbl), minute (dbl)

slice() - Last 5 flights

flights %>% slice((n()-5):n())
## Source: local data frame [6 x 16]
## 
##    year month   day dep_time dep_delay arr_time arr_delay carrier tailnum flight origin  dest air_time
##   (int) (int) (int)    (int)     (dbl)    (int)     (dbl)   (chr)   (chr)  (int)  (chr) (chr)    (dbl)
## 1  2013     9    30       NA        NA       NA        NA      EV  N740EV   5274    LGA   BNA       NA
## 2  2013     9    30       NA        NA       NA        NA      9E           3393    JFK   DCA       NA
## 3  2013     9    30       NA        NA       NA        NA      9E           3525    LGA   SYR       NA
## 4  2013     9    30       NA        NA       NA        NA      MQ  N535MQ   3461    LGA   BNA       NA
## 5  2013     9    30       NA        NA       NA        NA      MQ  N511MQ   3572    LGA   CLE       NA
## 6  2013     9    30       NA        NA       NA        NA      MQ  N839MQ   3531    LGA   RDU       NA
## Variables not shown: distance (dbl), hour (dbl), minute (dbl)

select() - Individual Columns

flights %>% select(year, month, day)
## Source: local data frame [336,776 x 3]
## 
##     year month   day
##    (int) (int) (int)
## 1   2013     1     1
## 2   2013     1     1
## 3   2013     1     1
## 4   2013     1     1
## 5   2013     1     1
## 6   2013     1     1
## 7   2013     1     1
## 8   2013     1     1
## 9   2013     1     1
## 10  2013     1     1
## ..   ...   ...   ...

select() - Exclude Columns

flights %>% select(-year, -month, -day)
## Source: local data frame [336,776 x 13]
## 
##    dep_time dep_delay arr_time arr_delay carrier tailnum flight origin  dest air_time distance  hour minute
##       (int)     (dbl)    (int)     (dbl)   (chr)   (chr)  (int)  (chr) (chr)    (dbl)    (dbl) (dbl)  (dbl)
## 1       517         2      830        11      UA  N14228   1545    EWR   IAH      227     1400     5     17
## 2       533         4      850        20      UA  N24211   1714    LGA   IAH      227     1416     5     33
## 3       542         2      923        33      AA  N619AA   1141    JFK   MIA      160     1089     5     42
## 4       544        -1     1004       -18      B6  N804JB    725    JFK   BQN      183     1576     5     44
## 5       554        -6      812       -25      DL  N668DN    461    LGA   ATL      116      762     5     54
## 6       554        -4      740        12      UA  N39463   1696    EWR   ORD      150      719     5     54
## 7       555        -5      913        19      B6  N516JB    507    EWR   FLL      158     1065     5     55
## 8       557        -3      709       -14      EV  N829AS   5708    LGA   IAD       53      229     5     57
## 9       557        -3      838        -8      B6  N593JB     79    JFK   MCO      140      944     5     57
## 10      558        -2      753         8      AA  N3ALAA    301    LGA   ORD      138      733     5     58
## ..      ...       ...      ...       ...     ...     ...    ...    ...   ...      ...      ...   ...    ...

select() - Ranges

flights %>% select(year:day)
## Source: local data frame [336,776 x 3]
## 
##     year month   day
##    (int) (int) (int)
## 1   2013     1     1
## 2   2013     1     1
## 3   2013     1     1
## 4   2013     1     1
## 5   2013     1     1
## 6   2013     1     1
## 7   2013     1     1
## 8   2013     1     1
## 9   2013     1     1
## 10  2013     1     1
## ..   ...   ...   ...

select() - Exclusion Ranges

flights %>% select(-(year:day))
## Source: local data frame [336,776 x 13]
## 
##    dep_time dep_delay arr_time arr_delay carrier tailnum flight origin  dest air_time distance  hour minute
##       (int)     (dbl)    (int)     (dbl)   (chr)   (chr)  (int)  (chr) (chr)    (dbl)    (dbl) (dbl)  (dbl)
## 1       517         2      830        11      UA  N14228   1545    EWR   IAH      227     1400     5     17
## 2       533         4      850        20      UA  N24211   1714    LGA   IAH      227     1416     5     33
## 3       542         2      923        33      AA  N619AA   1141    JFK   MIA      160     1089     5     42
## 4       544        -1     1004       -18      B6  N804JB    725    JFK   BQN      183     1576     5     44
## 5       554        -6      812       -25      DL  N668DN    461    LGA   ATL      116      762     5     54
## 6       554        -4      740        12      UA  N39463   1696    EWR   ORD      150      719     5     54
## 7       555        -5      913        19      B6  N516JB    507    EWR   FLL      158     1065     5     55
## 8       557        -3      709       -14      EV  N829AS   5708    LGA   IAD       53      229     5     57
## 9       557        -3      838        -8      B6  N593JB     79    JFK   MCO      140      944     5     57
## 10      558        -2      753         8      AA  N3ALAA    301    LGA   ORD      138      733     5     58
## ..      ...       ...      ...       ...     ...     ...    ...    ...   ...      ...      ...   ...    ...

select() - Matching

flights %>% select(contains("dep"), 
                   contains("arr"))
## Source: local data frame [336,776 x 5]
## 
##    dep_time dep_delay arr_time arr_delay carrier
##       (int)     (dbl)    (int)     (dbl)   (chr)
## 1       517         2      830        11      UA
## 2       533         4      850        20      UA
## 3       542         2      923        33      AA
## 4       544        -1     1004       -18      B6
## 5       554        -6      812       -25      DL
## 6       554        -4      740        12      UA
## 7       555        -5      913        19      B6
## 8       557        -3      709       -14      EV
## 9       557        -3      838        -8      B6
## 10      558        -2      753         8      AA
## ..      ...       ...      ...       ...     ...
flights %>% select(contains("dep"), 
                   starts_with("arr"))
## Source: local data frame [336,776 x 4]
## 
##    dep_time dep_delay arr_time arr_delay
##       (int)     (dbl)    (int)     (dbl)
## 1       517         2      830        11
## 2       533         4      850        20
## 3       542         2      923        33
## 4       544        -1     1004       -18
## 5       554        -6      812       -25
## 6       554        -4      740        12
## 7       555        -5      913        19
## 8       557        -3      709       -14
## 9       557        -3      838        -8
## 10      558        -2      753         8
## ..      ...       ...      ...       ...

rename()

flights %>% rename(tail_number = tailnum)
## Source: local data frame [336,776 x 16]
## 
##     year month   day dep_time dep_delay arr_time arr_delay carrier tail_number flight origin  dest air_time
##    (int) (int) (int)    (int)     (dbl)    (int)     (dbl)   (chr)       (chr)  (int)  (chr) (chr)    (dbl)
## 1   2013     1     1      517         2      830        11      UA      N14228   1545    EWR   IAH      227
## 2   2013     1     1      533         4      850        20      UA      N24211   1714    LGA   IAH      227
## 3   2013     1     1      542         2      923        33      AA      N619AA   1141    JFK   MIA      160
## 4   2013     1     1      544        -1     1004       -18      B6      N804JB    725    JFK   BQN      183
## 5   2013     1     1      554        -6      812       -25      DL      N668DN    461    LGA   ATL      116
## 6   2013     1     1      554        -4      740        12      UA      N39463   1696    EWR   ORD      150
## 7   2013     1     1      555        -5      913        19      B6      N516JB    507    EWR   FLL      158
## 8   2013     1     1      557        -3      709       -14      EV      N829AS   5708    LGA   IAD       53
## 9   2013     1     1      557        -3      838        -8      B6      N593JB     79    JFK   MCO      140
## 10  2013     1     1      558        -2      753         8      AA      N3ALAA    301    LGA   ORD      138
## ..   ...   ...   ...      ...       ...      ...       ...     ...         ...    ...    ...   ...      ...
## Variables not shown: distance (dbl), hour (dbl), minute (dbl)

arrange()

flights %>% filter(month==3,day==2) %>% arrange(origin, dest)
## Source: local data frame [765 x 16]
## 
##     year month   day dep_time dep_delay arr_time arr_delay carrier tailnum flight origin  dest air_time
##    (int) (int) (int)    (int)     (dbl)    (int)     (dbl)   (chr)   (chr)  (int)  (chr) (chr)    (dbl)
## 1   2013     3     2     1336         7     1426        -6      EV  N11535   4263    EWR   ALB       32
## 2   2013     3     2      628        -1      837       -12      DL  N357NB    575    EWR   ATL      109
## 3   2013     3     2      637        -3      903       -12      EV  N16919   4209    EWR   ATL      120
## 4   2013     3     2      743        -2      945       -25      DL  N338NB    807    EWR   ATL      106
## 5   2013     3     2      857        -3     1117        -9      DL  N306DQ    485    EWR   ATL      110
## 6   2013     3     2     1027        -3     1234       -13      DL  N301DQ   2343    EWR   ATL      109
## 7   2013     3     2     1134       -11     1332       -27      DL  N301NB    401    EWR   ATL      102
## 8   2013     3     2     1412        -3     1636         6      DL  N368NB    935    EWR   ATL      108
## 9   2013     3     2     1633        -3     1848       -20      EV  N16183   3273    EWR   ATL      108
## 10  2013     3     2     1655        -5     1857       -27      DL  N305DQ   2042    EWR   ATL      103
## ..   ...   ...   ...      ...       ...      ...       ...     ...     ...    ...    ...   ...      ...
## Variables not shown: distance (dbl), hour (dbl), minute (dbl)

arrange() & desc()

flights %>% filter(month==3,day==2) %>% arrange(desc(origin), dest)
## Source: local data frame [765 x 16]
## 
##     year month   day dep_time dep_delay arr_time arr_delay carrier tailnum flight origin  dest air_time
##    (int) (int) (int)    (int)     (dbl)    (int)     (dbl)   (chr)   (chr)  (int)  (chr) (chr)    (dbl)
## 1   2013     3     2      555        -5      822         7      FL  N928AT    345    LGA   ATL      113
## 2   2013     3     2      556        -4      822        -7      DL  N623DL    461    LGA   ATL      110
## 3   2013     3     2      758        -2     1009       -22      DL  N680DA   2047    LGA   ATL      105
## 4   2013     3     2      808        -2     1022       -21      FL  N996AT    361    LGA   ATL      107
## 5   2013     3     2      828        -7     1050       -15      MQ  N510MQ   4610    LGA   ATL      110
## 6   2013     3     2      857        -3     1126       -16      DL  N663DN   1747    LGA   ATL      110
## 7   2013     3     2      957        -3     1224       -10      DL  N942DL   1847    LGA   ATL      105
## 8   2013     3     2     1158        -2     1411       -19      MQ  N511MQ   4658    LGA   ATL      105
## 9   2013     3     2     1159        -1     1429        -9      DL  N910DE   1947    LGA   ATL      107
## 10  2013     3     2     1259        -1     1510       -25      DL  N902DE    781    LGA   ATL      106
## ..   ...   ...   ...      ...       ...      ...       ...     ...     ...    ...    ...   ...      ...
## Variables not shown: distance (dbl), hour (dbl), minute (dbl)

mutate()

library(lubridate)
## Loading required package: methods
flights %>% select(1:3) %>% mutate(date = paste(month,day,year,sep="/") %>% mdy())
## Source: local data frame [336,776 x 4]
## 
##     year month   day       date
##    (int) (int) (int)     (time)
## 1   2013     1     1 2013-01-01
## 2   2013     1     1 2013-01-01
## 3   2013     1     1 2013-01-01
## 4   2013     1     1 2013-01-01
## 5   2013     1     1 2013-01-01
## 6   2013     1     1 2013-01-01
## 7   2013     1     1 2013-01-01
## 8   2013     1     1 2013-01-01
## 9   2013     1     1 2013-01-01
## 10  2013     1     1 2013-01-01
## ..   ...   ...   ...        ...

transmute()

flights %>% select(1:3) %>% transmute(date = paste(month,day,year,sep="/") %>% mdy())
## Source: local data frame [336,776 x 1]
## 
##          date
##        (time)
## 1  2013-01-01
## 2  2013-01-01
## 3  2013-01-01
## 4  2013-01-01
## 5  2013-01-01
## 6  2013-01-01
## 7  2013-01-01
## 8  2013-01-01
## 9  2013-01-01
## 10 2013-01-01
## ..        ...

distinct()

flights %>% select(origin, dest) %>% distinct() %>% arrange(origin,dest)
## Source: local data frame [224 x 2]
## 
##    origin  dest
##     (chr) (chr)
## 1     EWR   ALB
## 2     EWR   ANC
## 3     EWR   ATL
## 4     EWR   AUS
## 5     EWR   AVL
## 6     EWR   BDL
## 7     EWR   BNA
## 8     EWR   BOS
## 9     EWR   BQN
## 10    EWR   BTV
## ..    ...   ...

sample_n()

flights %>% sample_n(10)
## Source: local data frame [10 x 16]
## 
##     year month   day dep_time dep_delay arr_time arr_delay carrier tailnum flight origin  dest air_time
##    (int) (int) (int)    (int)     (dbl)    (int)     (dbl)   (chr)   (chr)  (int)  (chr) (chr)    (dbl)
## 1   2013    12    12     1936        31     2123        23      9E  N292PQ   2931    JFK   PIT       74
## 2   2013     3    12     2233        93     2331        71      MQ  N521MQ   3744    EWR   ORD      102
## 3   2013    11     6     1529        -1     1905         4      DL  N187DN    417    JFK   LAX      354
## 4   2013    11    18     1953        -7     2111        -6      9E  N8960A   2950    JFK   BWI       44
## 5   2013     8     1      818        -3      926       -12      UA  N17730   1199    EWR   BOS       40
## 6   2013     6    25     2058        90     2313       102      EV  N12569   4480    EWR   CLT       82
## 7   2013     8    30      813        -2     1058       -23      UA  N825UA    504    EWR   PDX      327
## 8   2013     1    29     1453        -7     1633       -22      MQ  N725MQ   4429    LGA   CMH       79
## 9   2013     5    23     1816       241     1941       226      AA  N4WNAA    337    LGA   ORD      116
## 10  2013     8    31     1054        -1     1433       -12      B6  N644JB    403    JFK   SJU      198
## Variables not shown: distance (dbl), hour (dbl), minute (dbl)

sample_frac()

flights %>% sample_frac(0.001)
## Source: local data frame [337 x 16]
## 
##     year month   day dep_time dep_delay arr_time arr_delay carrier tailnum flight origin  dest air_time
##    (int) (int) (int)    (int)     (dbl)    (int)     (dbl)   (chr)   (chr)  (int)  (chr) (chr)    (dbl)
## 1   2013     9     1      822        -7     1108       -31      AA  N3KNAA   1938    LGA   MIA      146
## 2   2013     2     7     2228       143      108       117      UA  N826UA    754    EWR   TPA      147
## 3   2013     8     1     1111       -14     1205       -26      B6  N236JB   2380    EWR   BOS       36
## 4   2013     8     3     1701         2     1834         2      UA  N71411   1190    EWR   ORD      127
## 5   2013     2    28      738        33      953        13      MQ  N737MQ   4534    LGA   XNA      179
## 6   2013    10    12     1706        -9     1935       -46      UA  N461UA    387    EWR   PDX      309
## 7   2013     5    17     1902        -3     2200       -24      B6  N635JB    111    JFK   FLL      140
## 8   2013     5    27     1506        -4     1742       -33      UA  N451UA    490    EWR   AUS      198
## 9   2013     8    21     1107        -8     1341       -23      UA  N512UA    703    JFK   LAX      313
## 10  2013    10    20     1802         2     2040        10      DL  N342NW    926    EWR   ATL      112
## ..   ...   ...   ...      ...       ...      ...       ...     ...     ...    ...    ...   ...      ...
## Variables not shown: distance (dbl), hour (dbl), minute (dbl)

summarise()

flights %>% mutate(date = paste(month,day,year,sep="/") %>% mdy()) %>% 
            summarize(n(), min(date), max(date))
## Source: local data frame [1 x 3]
## 
##      n()  min(date)  max(date)
##    (int)     (time)     (time)
## 1 336776 2013-01-01 2013-12-31

group_by()

flights %>% group_by(origin)
## Source: local data frame [336,776 x 16]
## Groups: origin [3]
## 
##     year month   day dep_time dep_delay arr_time arr_delay carrier tailnum flight origin  dest air_time
##    (int) (int) (int)    (int)     (dbl)    (int)     (dbl)   (chr)   (chr)  (int)  (chr) (chr)    (dbl)
## 1   2013     1     1      517         2      830        11      UA  N14228   1545    EWR   IAH      227
## 2   2013     1     1      533         4      850        20      UA  N24211   1714    LGA   IAH      227
## 3   2013     1     1      542         2      923        33      AA  N619AA   1141    JFK   MIA      160
## 4   2013     1     1      544        -1     1004       -18      B6  N804JB    725    JFK   BQN      183
## 5   2013     1     1      554        -6      812       -25      DL  N668DN    461    LGA   ATL      116
## 6   2013     1     1      554        -4      740        12      UA  N39463   1696    EWR   ORD      150
## 7   2013     1     1      555        -5      913        19      B6  N516JB    507    EWR   FLL      158
## 8   2013     1     1      557        -3      709       -14      EV  N829AS   5708    LGA   IAD       53
## 9   2013     1     1      557        -3      838        -8      B6  N593JB     79    JFK   MCO      140
## 10  2013     1     1      558        -2      753         8      AA  N3ALAA    301    LGA   ORD      138
## ..   ...   ...   ...      ...       ...      ...       ...     ...     ...    ...    ...   ...      ...
## Variables not shown: distance (dbl), hour (dbl), minute (dbl)

summarise() with group_by()

flights %>% group_by(origin) %>%
            mutate(date = paste(month,day,year,sep="/") %>% mdy()) %>% 
            summarize(n(), min(date), max(date))
## Source: local data frame [3 x 4]
## 
##   origin    n()  min(date)  max(date)
##    (chr)  (int)     (time)     (time)
## 1    EWR 120835 2013-01-01 2013-12-31
## 2    JFK 111279 2013-01-01 2013-12-31
## 3    LGA 104662 2013-01-01 2013-12-31

Exercises

  1. Which plane (check the tail number) flew out of each New York airport the most?

  2. What was the shortest flight out of each airport in terms of distance? In terms of duration?

  3. How many flights to Los Angeles (LAX) did each of the legacy carriers (AA, UA, DL or US) have in May from JFK, and what was their average duration?

  4. Which date should you fly on if you want to have the lowest possible average depature delay? What about arrival delay?

  5. Create a time series plot of each of the legacy carriers’ average departure delay by day and origin airport.