dplyr

A Grammar of Data Manipulation

dplyr is based on the concepts of functions as verbs that manipulate data frames.


Single data frame functions / verbs:

  • tbl_df(): add the tbl_df class
  • filter(): pick rows matching criteria
  • slice(): pick rows using index(es)
  • select(): pick columns by name
  • rename(): rename specific columns
  • arrange(): reorder rows
  • mutate(): add new variables
  • transmute(): create new data frame with variables
  • distinct(): filter for unique rows
  • sample_n() / sample_frac(): randomly sample rows
  • summarise(): reduce variables to values

dplyr rules for functions

  • First argument is a data frame

  • Subsequent arguments say what to do with data frame

  • Always return a data frame

  • Don't modify in place

Example Data

We will demonstrate dplyr's functionality using the nycflights13 data.

suppressMessages(library(dplyr))
library(nycflights13)

flights = flights %>% tbl_df()
str(flights)
## Classes 'tbl_df', 'tbl' and 'data.frame':    336776 obs. of  16 variables:
##  $ year     : int  2013 2013 2013 2013 2013 2013 2013 2013 2013 2013 ...
##  $ month    : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ day      : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ dep_time : int  517 533 542 544 554 554 555 557 557 558 ...
##  $ dep_delay: num  2 4 2 -1 -6 -4 -5 -3 -3 -2 ...
##  $ arr_time : int  830 850 923 1004 812 740 913 709 838 753 ...
##  $ arr_delay: num  11 20 33 -18 -25 12 19 -14 -8 8 ...
##  $ carrier  : chr  "UA" "UA" "AA" "B6" ...
##  $ tailnum  : chr  "N14228" "N24211" "N619AA" "N804JB" ...
##  $ flight   : int  1545 1714 1141 725 461 1696 507 5708 79 301 ...
##  $ origin   : chr  "EWR" "LGA" "JFK" "JFK" ...
##  $ dest     : chr  "IAH" "IAH" "MIA" "BQN" ...
##  $ air_time : num  227 227 160 183 116 150 158 53 140 138 ...
##  $ distance : num  1400 1416 1089 1576 762 ...
##  $ hour     : num  5 5 5 5 5 5 5 5 5 5 ...
##  $ minute   : num  17 33 42 44 54 54 55 57 57 58 ...

tbl_df()?

flights
## Source: local data frame [336,776 x 16]
## 
##     year month   day dep_time dep_delay arr_time arr_delay carrier tailnum flight origin  dest air_time
##    (int) (int) (int)    (int)     (dbl)    (int)     (dbl)   (chr)   (chr)  (int)  (chr) (chr)    (dbl)
## 1   2013     1     1      517         2      830        11      UA  N14228   1545    EWR   IAH      227
## 2   2013     1     1      533         4      850        20      UA  N24211   1714    LGA   IAH      227
## 3   2013     1     1      542         2      923        33      AA  N619AA   1141    JFK   MIA      160
## 4   2013     1     1      544        -1     1004       -18      B6  N804JB    725    JFK   BQN      183
## 5   2013     1     1      554        -6      812       -25      DL  N668DN    461    LGA   ATL      116
## 6   2013     1     1      554        -4      740        12      UA  N39463   1696    EWR   ORD      150
## 7   2013     1     1      555        -5      913        19      B6  N516JB    507    EWR   FLL      158
## 8   2013     1     1      557        -3      709       -14      EV  N829AS   5708    LGA   IAD       53
## 9   2013     1     1      557        -3      838        -8      B6  N593JB     79    JFK   MCO      140
## 10  2013     1     1      558        -2      753         8      AA  N3ALAA    301    LGA   ORD      138
## ..   ...   ...   ...      ...       ...      ...       ...     ...     ...    ...    ...   ...      ...
## Variables not shown: distance (dbl), hour (dbl), minute (dbl)

filter() - March flights

flights %>% filter(month == 3)
## Source: local data frame [28,834 x 16]
## 
##     year month   day dep_time dep_delay arr_time arr_delay carrier tailnum flight origin  dest air_time
##    (int) (int) (int)    (int)     (dbl)    (int)     (dbl)   (chr)   (chr)  (int)  (chr) (chr)    (dbl)
## 1   2013     3     1        4       125      318       142      B6  N706JB     11    JFK   FLL      166
## 2   2013     3     1       50        52      526        48      B6  N794JB    707    JFK   SJU      198
## 3   2013     3     1      117       152      223       149      B6  N328JB    608    JFK   PWM       48
## 4   2013     3     1      454        -6      633       -15      US  N177US   1117    EWR   CLT       79
## 5   2013     3     1      505       -10      746       -24      UA  N527UA    475    EWR   IAH      199
## 6   2013     3     1      521        -9      813       -14      UA  N76523   1714    LGA   IAH      213
## 7   2013     3     1      537        -3      856         6      AA  N5EPAA   1141    JFK   MIA      173
## 8   2013     3     1      541        -4     1014        -9      B6  N653JB    725    JFK   BQN      191
## 9   2013     3     1      549       -11      639       -24      US  N749US   2114    LGA   BOS       31
## 10  2013     3     1      550       -10      747       -14      EV  N760EV   4911    EWR   DTW       89
## ..   ...   ...   ...      ...       ...      ...       ...     ...     ...    ...    ...   ...      ...
## Variables not shown: distance (dbl), hour (dbl), minute (dbl)

filter() - Flights in the first 7 days of March

flights %>% filter(month == 3, day <= 7)
## Source: local data frame [6,530 x 16]
## 
##     year month   day dep_time dep_delay arr_time arr_delay carrier tailnum flight origin  dest air_time
##    (int) (int) (int)    (int)     (dbl)    (int)     (dbl)   (chr)   (chr)  (int)  (chr) (chr)    (dbl)
## 1   2013     3     1        4       125      318       142      B6  N706JB     11    JFK   FLL      166
## 2   2013     3     1       50        52      526        48      B6  N794JB    707    JFK   SJU      198
## 3   2013     3     1      117       152      223       149      B6  N328JB    608    JFK   PWM       48
## 4   2013     3     1      454        -6      633       -15      US  N177US   1117    EWR   CLT       79
## 5   2013     3     1      505       -10      746       -24      UA  N527UA    475    EWR   IAH      199
## 6   2013     3     1      521        -9      813       -14      UA  N76523   1714    LGA   IAH      213
## 7   2013     3     1      537        -3      856         6      AA  N5EPAA   1141    JFK   MIA      173
## 8   2013     3     1      541        -4     1014        -9      B6  N653JB    725    JFK   BQN      191
## 9   2013     3     1      549       -11      639       -24      US  N749US   2114    LGA   BOS       31
## 10  2013     3     1      550       -10      747       -14      EV  N760EV   4911    EWR   DTW       89
## ..   ...   ...   ...      ...       ...      ...       ...     ...     ...    ...    ...   ...      ...
## Variables not shown: distance (dbl), hour (dbl), minute (dbl)

filter() - Flights to LAX or RDU in March

flights %>% filter(dest == "LAX" | dest == "RDU", month==3)
## Source: local data frame [1,935 x 16]
## 
##     year month   day dep_time dep_delay arr_time arr_delay carrier tailnum flight origin  dest air_time
##    (int) (int) (int)    (int)     (dbl)    (int)     (dbl)   (chr)   (chr)  (int)  (chr) (chr)    (dbl)
## 1   2013     3     1      607        -3      832       -53      UA  N557UA    797    JFK   LAX      292
## 2   2013     3     1      608        -7      737       -13      MQ  N723MQ   4518    LGA   RDU       64
## 3   2013     3     1      623        -7      753       -17      EV  N14959   5667    EWR   RDU       64
## 4   2013     3     1      629        -3      844       -68      UA  N77518   1702    EWR   LAX      294
## 5   2013     3     1      657        -3      953       -41      DL  N712TW    763    JFK   LAX      326
## 6   2013     3     1      714        -1      939       -58      B6  N804JB    671    JFK   LAX      297
## 7   2013     3     1      716         6      958       -37      VX  N636VA    399    JFK   LAX      315
## 8   2013     3     1      727        -3     1007       -53      AA  N319AA     33    JFK   LAX      315
## 9   2013     3     1      803        -7      923       -32      MQ  N853MQ   4406    JFK   RDU       67
## 10  2013     3     1      823        -1      954       -20      EV  N22909   4548    EWR   RDU       66
## ..   ...   ...   ...      ...       ...      ...       ...     ...     ...    ...    ...   ...      ...
## Variables not shown: distance (dbl), hour (dbl), minute (dbl)

slice() - First 10 flights

flights %>% slice(1:10)
## Source: local data frame [10 x 16]
## 
##     year month   day dep_time dep_delay arr_time arr_delay carrier tailnum flight origin  dest air_time
##    (int) (int) (int)    (int)     (dbl)    (int)     (dbl)   (chr)   (chr)  (int)  (chr) (chr)    (dbl)
## 1   2013     1     1      517         2      830        11      UA  N14228   1545    EWR   IAH      227
## 2   2013     1     1      533         4      850        20      UA  N24211   1714    LGA   IAH      227
## 3   2013     1     1      542         2      923        33      AA  N619AA   1141    JFK   MIA      160
## 4   2013     1     1      544        -1     1004       -18      B6  N804JB    725    JFK   BQN      183
## 5   2013     1     1      554        -6      812       -25      DL  N668DN    461    LGA   ATL      116
## 6   2013     1     1      554        -4      740        12      UA  N39463   1696    EWR   ORD      150
## 7   2013     1     1      555        -5      913        19      B6  N516JB    507    EWR   FLL      158
## 8   2013     1     1      557        -3      709       -14      EV  N829AS   5708    LGA   IAD       53
## 9   2013     1     1      557        -3      838        -8      B6  N593JB     79    JFK   MCO      140
## 10  2013     1     1      558        -2      753         8      AA  N3ALAA    301    LGA   ORD      138
## Variables not shown: distance (dbl), hour (dbl), minute (dbl)

slice() - Last 5 flights

flights %>% slice((n()-5):n())
## Source: local data frame [6 x 16]
## 
##    year month   day dep_time dep_delay arr_time arr_delay carrier tailnum flight origin  dest air_time
##   (int) (int) (int)    (int)     (dbl)    (int)     (dbl)   (chr)   (chr)  (int)  (chr) (chr)    (dbl)
## 1  2013     9    30       NA        NA       NA        NA      EV  N740EV   5274    LGA   BNA       NA
## 2  2013     9    30       NA        NA       NA        NA      9E           3393    JFK   DCA       NA
## 3  2013     9    30       NA        NA       NA        NA      9E           3525    LGA   SYR       NA
## 4  2013     9    30       NA        NA       NA        NA      MQ  N535MQ   3461    LGA   BNA       NA
## 5  2013     9    30       NA        NA       NA        NA      MQ  N511MQ   3572    LGA   CLE       NA
## 6  2013     9    30       NA        NA       NA        NA      MQ  N839MQ   3531    LGA   RDU       NA
## Variables not shown: distance (dbl), hour (dbl), minute (dbl)

select() - Individual Columns

flights %>% select(year, month, day)
## Source: local data frame [336,776 x 3]
## 
##     year month   day
##    (int) (int) (int)
## 1   2013     1     1
## 2   2013     1     1
## 3   2013     1     1
## 4   2013     1     1
## 5   2013     1     1
## 6   2013     1     1
## 7   2013     1     1
## 8   2013     1     1
## 9   2013     1     1
## 10  2013     1     1
## ..   ...   ...   ...

select() - Exclude Columns

flights %>% select(-year, -month, -day)
## Source: local data frame [336,776 x 13]
## 
##    dep_time dep_delay arr_time arr_delay carrier tailnum flight origin  dest air_time distance  hour minute
##       (int)     (dbl)    (int)     (dbl)   (chr)   (chr)  (int)  (chr) (chr)    (dbl)    (dbl) (dbl)  (dbl)
## 1       517         2      830        11      UA  N14228   1545    EWR   IAH      227     1400     5     17
## 2       533         4      850        20      UA  N24211   1714    LGA   IAH      227     1416     5     33
## 3       542         2      923        33      AA  N619AA   1141    JFK   MIA      160     1089     5     42
## 4       544        -1     1004       -18      B6  N804JB    725    JFK   BQN      183     1576     5     44
## 5       554        -6      812       -25      DL  N668DN    461    LGA   ATL      116      762     5     54
## 6       554        -4      740        12      UA  N39463   1696    EWR   ORD      150      719     5     54
## 7       555        -5      913        19      B6  N516JB    507    EWR   FLL      158     1065     5     55
## 8       557        -3      709       -14      EV  N829AS   5708    LGA   IAD       53      229     5     57
## 9       557        -3      838        -8      B6  N593JB     79    JFK   MCO      140      944     5     57
## 10      558        -2      753         8      AA  N3ALAA    301    LGA   ORD      138      733     5     58
## ..      ...       ...      ...       ...     ...     ...    ...    ...   ...      ...      ...   ...    ...

select() - Ranges

flights %>% select(year:day)
## Source: local data frame [336,776 x 3]
## 
##     year month   day
##    (int) (int) (int)
## 1   2013     1     1
## 2   2013     1     1
## 3   2013     1     1
## 4   2013     1     1
## 5   2013     1     1
## 6   2013     1     1
## 7   2013     1     1
## 8   2013     1     1
## 9   2013     1     1
## 10  2013     1     1
## ..   ...   ...   ...

select() - Exclusion Ranges

flights %>% select(-(year:day))
## Source: local data frame [336,776 x 13]
## 
##    dep_time dep_delay arr_time arr_delay carrier tailnum flight origin  dest air_time distance  hour minute
##       (int)     (dbl)    (int)     (dbl)   (chr)   (chr)  (int)  (chr) (chr)    (dbl)    (dbl) (dbl)  (dbl)
## 1       517         2      830        11      UA  N14228   1545    EWR   IAH      227     1400     5     17
## 2       533         4      850        20      UA  N24211   1714    LGA   IAH      227     1416     5     33
## 3       542         2      923        33      AA  N619AA   1141    JFK   MIA      160     1089     5     42
## 4       544        -1     1004       -18      B6  N804JB    725    JFK   BQN      183     1576     5     44
## 5       554        -6      812       -25      DL  N668DN    461    LGA   ATL      116      762     5     54
## 6       554        -4      740        12      UA  N39463   1696    EWR   ORD      150      719     5     54
## 7       555        -5      913        19      B6  N516JB    507    EWR   FLL      158     1065     5     55
## 8       557        -3      709       -14      EV  N829AS   5708    LGA   IAD       53      229     5     57
## 9       557        -3      838        -8      B6  N593JB     79    JFK   MCO      140      944     5     57
## 10      558        -2      753         8      AA  N3ALAA    301    LGA   ORD      138      733     5     58
## ..      ...       ...      ...       ...     ...     ...    ...    ...   ...      ...      ...   ...    ...

select() - Matching

flights %>% select(contains("dep"), 
                   contains("arr"))
## Source: local data frame [336,776 x 5]
## 
##    dep_time dep_delay arr_time arr_delay carrier
##       (int)     (dbl)    (int)     (dbl)   (chr)
## 1       517         2      830        11      UA
## 2       533         4      850        20      UA
## 3       542         2      923        33      AA
## 4       544        -1     1004       -18      B6
## 5       554        -6      812       -25      DL
## 6       554        -4      740        12      UA
## 7       555        -5      913        19      B6
## 8       557        -3      709       -14      EV
## 9       557        -3      838        -8      B6
## 10      558        -2      753         8      AA
## ..      ...       ...      ...       ...     ...
flights %>% select(contains("dep"), 
                   starts_with("arr"))
## Source: local data frame [336,776 x 4]
## 
##    dep_time dep_delay arr_time arr_delay
##       (int)     (dbl)    (int)     (dbl)
## 1       517         2      830        11
## 2       533         4      850        20
## 3       542         2      923        33
## 4       544        -1     1004       -18
## 5       554        -6      812       -25
## 6       554        -4      740        12
## 7       555        -5      913        19
## 8       557        -3      709       -14
## 9       557        -3      838        -8
## 10      558        -2      753         8
## ..      ...       ...      ...       ...

rename()

flights %>% rename(tail_number = tailnum)
## Source: local data frame [336,776 x 16]
## 
##     year month   day dep_time dep_delay arr_time arr_delay carrier tail_number flight origin  dest air_time
##    (int) (int) (int)    (int)     (dbl)    (int)     (dbl)   (chr)       (chr)  (int)  (chr) (chr)    (dbl)
## 1   2013     1     1      517         2      830        11      UA      N14228   1545    EWR   IAH      227
## 2   2013     1     1      533         4      850        20      UA      N24211   1714    LGA   IAH      227
## 3   2013     1     1      542         2      923        33      AA      N619AA   1141    JFK   MIA      160
## 4   2013     1     1      544        -1     1004       -18      B6      N804JB    725    JFK   BQN      183
## 5   2013     1     1      554        -6      812       -25      DL      N668DN    461    LGA   ATL      116
## 6   2013     1     1      554        -4      740        12      UA      N39463   1696    EWR   ORD      150
## 7   2013     1     1      555        -5      913        19      B6      N516JB    507    EWR   FLL      158
## 8   2013     1     1      557        -3      709       -14      EV      N829AS   5708    LGA   IAD       53
## 9   2013     1     1      557        -3      838        -8      B6      N593JB     79    JFK   MCO      140
## 10  2013     1     1      558        -2      753         8      AA      N3ALAA    301    LGA   ORD      138
## ..   ...   ...   ...      ...       ...      ...       ...     ...         ...    ...    ...   ...      ...
## Variables not shown: distance (dbl), hour (dbl), minute (dbl)

arrange()

flights %>% filter(month==3,day==2) %>% arrange(origin, dest)
## Source: local data frame [765 x 16]
## 
##     year month   day dep_time dep_delay arr_time arr_delay carrier tailnum flight origin  dest air_time
##    (int) (int) (int)    (int)     (dbl)    (int)     (dbl)   (chr)   (chr)  (int)  (chr) (chr)    (dbl)
## 1   2013     3     2     1336         7     1426        -6      EV  N11535   4263    EWR   ALB       32
## 2   2013     3     2      628        -1      837       -12      DL  N357NB    575    EWR   ATL      109
## 3   2013     3     2      637        -3      903       -12      EV  N16919   4209    EWR   ATL      120
## 4   2013     3     2      743        -2      945       -25      DL  N338NB    807    EWR   ATL      106
## 5   2013     3     2      857        -3     1117        -9      DL  N306DQ    485    EWR   ATL      110
## 6   2013     3     2     1027        -3     1234       -13      DL  N301DQ   2343    EWR   ATL      109
## 7   2013     3     2     1134       -11     1332       -27      DL  N301NB    401    EWR   ATL      102
## 8   2013     3     2     1412        -3     1636         6      DL  N368NB    935    EWR   ATL      108
## 9   2013     3     2     1633        -3     1848       -20      EV  N16183   3273    EWR   ATL      108
## 10  2013     3     2     1655        -5     1857       -27      DL  N305DQ   2042    EWR   ATL      103
## ..   ...   ...   ...      ...       ...      ...       ...     ...     ...    ...    ...   ...      ...
## Variables not shown: distance (dbl), hour (dbl), minute (dbl)

arrange() & desc()

flights %>% filter(month==3,day==2) %>% arrange(desc(origin), dest)
## Source: local data frame [765 x 16]
## 
##     year month   day dep_time dep_delay arr_time arr_delay carrier tailnum flight origin  dest air_time
##    (int) (int) (int)    (int)     (dbl)    (int)     (dbl)   (chr)   (chr)  (int)  (chr) (chr)    (dbl)
## 1   2013     3     2      555        -5      822         7      FL  N928AT    345    LGA   ATL      113
## 2   2013     3     2      556        -4      822        -7      DL  N623DL    461    LGA   ATL      110
## 3   2013     3     2      758        -2     1009       -22      DL  N680DA   2047    LGA   ATL      105
## 4   2013     3     2      808        -2     1022       -21      FL  N996AT    361    LGA   ATL      107
## 5   2013     3     2      828        -7     1050       -15      MQ  N510MQ   4610    LGA   ATL      110
## 6   2013     3     2      857        -3     1126       -16      DL  N663DN   1747    LGA   ATL      110
## 7   2013     3     2      957        -3     1224       -10      DL  N942DL   1847    LGA   ATL      105
## 8   2013     3     2     1158        -2     1411       -19      MQ  N511MQ   4658    LGA   ATL      105
## 9   2013     3     2     1159        -1     1429        -9      DL  N910DE   1947    LGA   ATL      107
## 10  2013     3     2     1259        -1     1510       -25      DL  N902DE    781    LGA   ATL      106
## ..   ...   ...   ...      ...       ...      ...       ...     ...     ...    ...    ...   ...      ...
## Variables not shown: distance (dbl), hour (dbl), minute (dbl)

mutate()

library(lubridate)
## Loading required package: methods
flights %>% select(1:3) %>% mutate(date = paste(month,day,year,sep="/") %>% mdy())
## Source: local data frame [336,776 x 4]
## 
##     year month   day       date
##    (int) (int) (int)     (time)
## 1   2013     1     1 2013-01-01
## 2   2013     1     1 2013-01-01
## 3   2013     1     1 2013-01-01
## 4   2013     1     1 2013-01-01
## 5   2013     1     1 2013-01-01
## 6   2013     1     1 2013-01-01
## 7   2013     1     1 2013-01-01
## 8   2013     1     1 2013-01-01
## 9   2013     1     1 2013-01-01
## 10  2013     1     1 2013-01-01
## ..   ...   ...   ...        ...

transmute()

flights %>% select(1:3) %>% transmute(date = paste(month,day,year,sep="/") %>% mdy())
## Source: local data frame [336,776 x 1]
## 
##          date
##        (time)
## 1  2013-01-01
## 2  2013-01-01
## 3  2013-01-01
## 4  2013-01-01
## 5  2013-01-01
## 6  2013-01-01
## 7  2013-01-01
## 8  2013-01-01
## 9  2013-01-01
## 10 2013-01-01
## ..        ...

distinct()

flights %>% select(origin, dest) %>% distinct() %>% arrange(origin,dest)
## Source: local data frame [224 x 2]
## 
##    origin  dest
##     (chr) (chr)
## 1     EWR   ALB
## 2     EWR   ANC
## 3     EWR   ATL
## 4     EWR   AUS
## 5     EWR   AVL
## 6     EWR   BDL
## 7     EWR   BNA
## 8     EWR   BOS
## 9     EWR   BQN
## 10    EWR   BTV
## ..    ...   ...

sample_n()

flights %>% sample_n(10)
## Source: local data frame [10 x 16]
## 
##     year month   day dep_time dep_delay arr_time arr_delay carrier tailnum flight origin  dest air_time
##    (int) (int) (int)    (int)     (dbl)    (int)     (dbl)   (chr)   (chr)  (int)  (chr) (chr)    (dbl)
## 1   2013    12     5      843        -2     1153        47      EV  N13133   4087    EWR   IND      116
## 2   2013     6    20     1536         7     1737       -23      UA  N75428   1107    EWR   DEN      211
## 3   2013     3    24     1610         2     1805       -10      9E  N935XJ   3436    JFK   RDU       81
## 4   2013     9     5     1350        -5     1451       -29      WN  N213WN    491    LGA   MDW      107
## 5   2013     4     4     1935         0     2228       -14      DL  N354NW   2370    LGA   PBI      155
## 6   2013     4    13      637        -7      855       -21      UA  N504UA    669    EWR   LAS      299
## 7   2013     3    13     1621        85     1721        78      EV  N14923   4372    EWR   DCA       40
## 8   2013     3    13     1957        22     2206        -1      EV  N13124   4333    EWR   TUL      168
## 9   2013    11     6     2130       -10       44         4      AA  N339AA    185    JFK   LAX      341
## 10  2013     9     9     1201        -9     1341        11      MQ  N527MQ   3461    LGA   BNA       97
## Variables not shown: distance (dbl), hour (dbl), minute (dbl)

sample_frac()

flights %>% sample_frac(0.001)
## Source: local data frame [337 x 16]
## 
##     year month   day dep_time dep_delay arr_time arr_delay carrier tailnum flight origin  dest air_time
##    (int) (int) (int)    (int)     (dbl)    (int)     (dbl)   (chr)   (chr)  (int)  (chr) (chr)    (dbl)
## 1   2013     4    14      621        -8      805       -21      US  N432US   1125    EWR   CLT       80
## 2   2013    11    17     1538         8     1805        12      DL  N773NC   1942    EWR   ATL      120
## 3   2013    11    23     1111        -8     1437         5      UA  N588UA    703    JFK   LAX      359
## 4   2013     8     2     1952        -8     2125       -15      MQ  N735MQ   3591    LGA   RDU       65
## 5   2013     9    20     1200        -5     1341        -9      MQ  N724MQ   3404    LGA   RDU       73
## 6   2013    10    15      829        24     1014        14      EV  N744EV   5242    LGA   GSO       69
## 7   2013     1    25       NA        NA       NA        NA      9E           3961    LGA   GRR       NA
## 8   2013     8    16     2141         1     2306        -4      MQ  N516MQ   3317    LGA   RDU       73
## 9   2013    12     8     1547         2     1925         5      AA  N335AA    133    JFK   LAX      371
## 10  2013     3     4      757        -7     1006       -20      EV  N12135   4157    EWR   OMA      170
## ..   ...   ...   ...      ...       ...      ...       ...     ...     ...    ...    ...   ...      ...
## Variables not shown: distance (dbl), hour (dbl), minute (dbl)

summarise()

flights %>% mutate(date = paste(month,day,year,sep="/") %>% mdy()) %>% 
            summarize(n(), min(date), max(date))
## Source: local data frame [1 x 3]
## 
##      n()  min(date)  max(date)
##    (int)     (time)     (time)
## 1 336776 2013-01-01 2013-12-31

group_by()

flights %>% group_by(origin)
## Source: local data frame [336,776 x 16]
## Groups: origin [3]
## 
##     year month   day dep_time dep_delay arr_time arr_delay carrier tailnum flight origin  dest air_time
##    (int) (int) (int)    (int)     (dbl)    (int)     (dbl)   (chr)   (chr)  (int)  (chr) (chr)    (dbl)
## 1   2013     1     1      517         2      830        11      UA  N14228   1545    EWR   IAH      227
## 2   2013     1     1      533         4      850        20      UA  N24211   1714    LGA   IAH      227
## 3   2013     1     1      542         2      923        33      AA  N619AA   1141    JFK   MIA      160
## 4   2013     1     1      544        -1     1004       -18      B6  N804JB    725    JFK   BQN      183
## 5   2013     1     1      554        -6      812       -25      DL  N668DN    461    LGA   ATL      116
## 6   2013     1     1      554        -4      740        12      UA  N39463   1696    EWR   ORD      150
## 7   2013     1     1      555        -5      913        19      B6  N516JB    507    EWR   FLL      158
## 8   2013     1     1      557        -3      709       -14      EV  N829AS   5708    LGA   IAD       53
## 9   2013     1     1      557        -3      838        -8      B6  N593JB     79    JFK   MCO      140
## 10  2013     1     1      558        -2      753         8      AA  N3ALAA    301    LGA   ORD      138
## ..   ...   ...   ...      ...       ...      ...       ...     ...     ...    ...    ...   ...      ...
## Variables not shown: distance (dbl), hour (dbl), minute (dbl)

summarise() with group_by()

flights %>% group_by(origin) %>%
            mutate(date = paste(month,day,year,sep="/") %>% mdy()) %>% 
            summarize(n(), min(date), max(date))
## Source: local data frame [3 x 4]
## 
##   origin    n()  min(date)  max(date)
##    (chr)  (int)     (time)     (time)
## 1    EWR 120835 2013-01-01 2013-12-31
## 2    JFK 111279 2013-01-01 2013-12-31
## 3    LGA 104662 2013-01-01 2013-12-31

Exercises

  1. Which plane (check the tail number) flew out of each New York airport the most?

  2. What was the shortest flight out of each airport in terms of distance? In terms of duration?

  3. How many flights to Los Angeles (LAX) did each of the legacy carriers (AA, UA, DL or US) have in May from JFK, and what was their average duration?

  4. Which date should you fly on if you want to have the lowest possible average depature delay? What about arrival delay?

  5. Create a time series plot of each of the legacy carriers' average departure delay by day and origin airport.