Â
You can think about the following sequence of actions - find key, unlock car, start car, drive to school, park.
Expressed as a set of nested functions in R pseudocode this would look like:
park(drive(start_car(find("keys")), to="campus"))
Writing it out using pipes give it a more natural (and easier to read) structure:
find("keys") %>% start_car() %>% drive(to="campus") %>% park()
All of the following are fine, it comes down to personal preference:
Nested:
h( g( f(x), y=1), z=1 )
Piped:
f(x) %>% g(y=1) %>% h(z=1)
Intermediate:
res = f(x) res = g(res, y=1) res = h(res, z=1)
Sometimes we want to send our results to an function argument other than first one or we want to use the previous result for multiple arguments. In these cases we can refer to the previous result using .
.
data.frame(a=1:3,b=3:1) %>% lm(a~b,data=.)
## ## Call: ## lm(formula = a ~ b, data = .) ## ## Coefficients: ## (Intercept) b ## 4 -1
data.frame(a=1:3,b=3:1) %>% .[[1]]
## [1] 1 2 3
data.frame(a=1:3,b=3:1) %>% .[[length(.)]]
## [1] 3 2 1
Hadley Wickham has a package that modifies data frames to be more modern, or as he calls them surly and lazy.
library(tibble) class(iris)
## [1] "data.frame"
tbl_iris = as_tibble(iris) class(tbl_iris)
## [1] "tbl_df" "tbl" "data.frame"
tbl_iris
## # A tibble: 150 x 5 ## Sepal.Length Sepal.Width Petal.Length Petal.Width Species ## <dbl> <dbl> <dbl> <dbl> <fctr> ## 1 5.1 3.5 1.4 0.2 setosa ## 2 4.9 3.0 1.4 0.2 setosa ## 3 4.7 3.2 1.3 0.2 setosa ## 4 4.6 3.1 1.5 0.2 setosa ## 5 5.0 3.6 1.4 0.2 setosa ## 6 5.4 3.9 1.7 0.4 setosa ## 7 4.6 3.4 1.4 0.3 setosa ## 8 5.0 3.4 1.5 0.2 setosa ## 9 4.4 2.9 1.4 0.2 setosa ## 10 4.9 3.1 1.5 0.1 setosa ## # ... with 140 more rows
tbl_iris[1,]
## # A tibble: 1 x 5 ## Sepal.Length Sepal.Width Petal.Length Petal.Width Species ## <dbl> <dbl> <dbl> <dbl> <fctr> ## 1 5.1 3.5 1.4 0.2 setosa
tbl_iris[,"Species"]
## # A tibble: 150 x 1 ## Species ## <fctr> ## 1 setosa ## 2 setosa ## 3 setosa ## 4 setosa ## 5 setosa ## 6 setosa ## 7 setosa ## 8 setosa ## 9 setosa ## 10 setosa ## # ... with 140 more rows
data_frame(x = 1:3, y=c("A","B","C"))
## # A tibble: 3 x 2 ## x y ## <int> <chr> ## 1 1 A ## 2 2 B ## 3 3 C
tbl_iris[,"Name"]
## Error: Column `Name` not found
tbl_iris$Name
## Warning: Unknown or uninitialised column: 'Name'.
## NULL
tbl_iris[160,]
## # A tibble: 1 x 5 ## Sepal.Length Sepal.Width Petal.Length Petal.Width Species ## <dbl> <dbl> <dbl> <dbl> <fctr> ## 1 NA NA NA NA <NA>
dplyr is based on the concepts of functions as verbs that manipulate data frames.
Single data frame functions / verbs:
tbl_df()
: add the tbl_df
classfilter()
: pick rows matching criteriaslice()
: pick rows using index(es)select()
: pick columns by namepull()
: grab a column as a vectorrename()
: rename specific columnsarrange()
: reorder rowsmutate()
: add new variablestransmute()
: create new data frame with variablesdistinct()
: filter for unique rowssample_n()
/ sample_frac()
: randomly sample rowssummarise()
: reduce variables to valuesFirst argument is always a data frame
Subsequent arguments say what to do with that data frame
Always\(^*\) return a data frame
Don't modify in place
Performance via lazy evaluation
We will demonstrate dplyr's functionality using the nycflights13 data.
suppressMessages(library(dplyr)) library(nycflights13)
flights = flights %>% tbl_df() str(flights)
## Classes 'tbl_df', 'tbl' and 'data.frame': 336776 obs. of 19 variables: ## $ year : int 2013 2013 2013 2013 2013 2013 2013 2013 2013 2013 ... ## $ month : int 1 1 1 1 1 1 1 1 1 1 ... ## $ day : int 1 1 1 1 1 1 1 1 1 1 ... ## $ dep_time : int 517 533 542 544 554 554 555 557 557 558 ... ## $ sched_dep_time: int 515 529 540 545 600 558 600 600 600 600 ... ## $ dep_delay : num 2 4 2 -1 -6 -4 -5 -3 -3 -2 ... ## $ arr_time : int 830 850 923 1004 812 740 913 709 838 753 ... ## $ sched_arr_time: int 819 830 850 1022 837 728 854 723 846 745 ... ## $ arr_delay : num 11 20 33 -18 -25 12 19 -14 -8 8 ... ## $ carrier : chr "UA" "UA" "AA" "B6" ... ## $ flight : int 1545 1714 1141 725 461 1696 507 5708 79 301 ... ## $ tailnum : chr "N14228" "N24211" "N619AA" "N804JB" ... ## $ origin : chr "EWR" "LGA" "JFK" "JFK" ... ## $ dest : chr "IAH" "IAH" "MIA" "BQN" ... ## $ air_time : num 227 227 160 183 116 150 158 53 140 138 ... ## $ distance : num 1400 1416 1089 1576 762 ... ## $ hour : num 5 5 5 5 6 5 6 6 6 6 ... ## $ minute : num 15 29 40 45 0 58 0 0 0 0 ... ## $ time_hour : POSIXct, format: "2013-01-01 05:00:00" "2013-01-01 05:00:00" "2013-01-01 05:00:00" ...
flights
## # A tibble: 336,776 x 19 ## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight ## <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr> <int> ## 1 2013 1 1 517 515 2 830 819 11 UA 1545 ## 2 2013 1 1 533 529 4 850 830 20 UA 1714 ## 3 2013 1 1 542 540 2 923 850 33 AA 1141 ## 4 2013 1 1 544 545 -1 1004 1022 -18 B6 725 ## 5 2013 1 1 554 600 -6 812 837 -25 DL 461 ## 6 2013 1 1 554 558 -4 740 728 12 UA 1696 ## 7 2013 1 1 555 600 -5 913 854 19 B6 507 ## 8 2013 1 1 557 600 -3 709 723 -14 EV 5708 ## 9 2013 1 1 557 600 -3 838 846 -8 B6 79 ## 10 2013 1 1 558 600 -2 753 745 8 AA 301 ## # ... with 336,766 more rows, and 8 more variables: tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, ## # distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
class(flights)
## [1] "tbl_df" "tbl" "data.frame"
flights %>% filter(month == 3)
## # A tibble: 28,834 x 19 ## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight ## <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr> <int> ## 1 2013 3 1 4 2159 125 318 56 142 B6 11 ## 2 2013 3 1 50 2358 52 526 438 48 B6 707 ## 3 2013 3 1 117 2245 152 223 2354 149 B6 608 ## 4 2013 3 1 454 500 -6 633 648 -15 US 1117 ## 5 2013 3 1 505 515 -10 746 810 -24 UA 475 ## 6 2013 3 1 521 530 -9 813 827 -14 UA 1714 ## 7 2013 3 1 537 540 -3 856 850 6 AA 1141 ## 8 2013 3 1 541 545 -4 1014 1023 -9 B6 725 ## 9 2013 3 1 549 600 -11 639 703 -24 US 2114 ## 10 2013 3 1 550 600 -10 747 801 -14 EV 4911 ## # ... with 28,824 more rows, and 8 more variables: tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, ## # distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
flights %>% filter(month == 3, day <= 7)
## # A tibble: 6,530 x 19 ## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight ## <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr> <int> ## 1 2013 3 1 4 2159 125 318 56 142 B6 11 ## 2 2013 3 1 50 2358 52 526 438 48 B6 707 ## 3 2013 3 1 117 2245 152 223 2354 149 B6 608 ## 4 2013 3 1 454 500 -6 633 648 -15 US 1117 ## 5 2013 3 1 505 515 -10 746 810 -24 UA 475 ## 6 2013 3 1 521 530 -9 813 827 -14 UA 1714 ## 7 2013 3 1 537 540 -3 856 850 6 AA 1141 ## 8 2013 3 1 541 545 -4 1014 1023 -9 B6 725 ## 9 2013 3 1 549 600 -11 639 703 -24 US 2114 ## 10 2013 3 1 550 600 -10 747 801 -14 EV 4911 ## # ... with 6,520 more rows, and 8 more variables: tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, ## # distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
flights %>% filter(dest == "LAX" | dest == "RDU", month==3)
## # A tibble: 1,935 x 19 ## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight ## <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr> <int> ## 1 2013 3 1 607 610 -3 832 925 -53 UA 797 ## 2 2013 3 1 608 615 -7 737 750 -13 MQ 4518 ## 3 2013 3 1 623 630 -7 753 810 -17 EV 5667 ## 4 2013 3 1 629 632 -3 844 952 -68 UA 1702 ## 5 2013 3 1 657 700 -3 953 1034 -41 DL 763 ## 6 2013 3 1 714 715 -1 939 1037 -58 B6 671 ## 7 2013 3 1 716 710 6 958 1035 -37 VX 399 ## 8 2013 3 1 727 730 -3 1007 1100 -53 AA 33 ## 9 2013 3 1 803 810 -7 923 955 -32 MQ 4406 ## 10 2013 3 1 823 824 -1 954 1014 -20 EV 4548 ## # ... with 1,925 more rows, and 8 more variables: tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, ## # distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
flights %>% slice(1:10)
## # A tibble: 10 x 19 ## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight ## <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr> <int> ## 1 2013 1 1 517 515 2 830 819 11 UA 1545 ## 2 2013 1 1 533 529 4 850 830 20 UA 1714 ## 3 2013 1 1 542 540 2 923 850 33 AA 1141 ## 4 2013 1 1 544 545 -1 1004 1022 -18 B6 725 ## 5 2013 1 1 554 600 -6 812 837 -25 DL 461 ## 6 2013 1 1 554 558 -4 740 728 12 UA 1696 ## 7 2013 1 1 555 600 -5 913 854 19 B6 507 ## 8 2013 1 1 557 600 -3 709 723 -14 EV 5708 ## 9 2013 1 1 557 600 -3 838 846 -8 B6 79 ## 10 2013 1 1 558 600 -2 753 745 8 AA 301 ## # ... with 8 more variables: tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, ## # hour <dbl>, minute <dbl>, time_hour <dttm>
flights %>% slice((n()-5):n())
## # A tibble: 6 x 19 ## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight ## <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr> <int> ## 1 2013 9 30 NA 1842 NA NA 2019 NA EV 5274 ## 2 2013 9 30 NA 1455 NA NA 1634 NA 9E 3393 ## 3 2013 9 30 NA 2200 NA NA 2312 NA 9E 3525 ## 4 2013 9 30 NA 1210 NA NA 1330 NA MQ 3461 ## 5 2013 9 30 NA 1159 NA NA 1344 NA MQ 3572 ## 6 2013 9 30 NA 840 NA NA 1020 NA MQ 3531 ## # ... with 8 more variables: tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, ## # hour <dbl>, minute <dbl>, time_hour <dttm>
flights %>% select(year, month, day)
## # A tibble: 336,776 x 3 ## year month day ## <int> <int> <int> ## 1 2013 1 1 ## 2 2013 1 1 ## 3 2013 1 1 ## 4 2013 1 1 ## 5 2013 1 1 ## 6 2013 1 1 ## 7 2013 1 1 ## 8 2013 1 1 ## 9 2013 1 1 ## 10 2013 1 1 ## # ... with 336,766 more rows
flights %>% select(-year, -month, -day)
## # A tibble: 336,776 x 16 ## dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight tailnum origin dest ## <int> <int> <dbl> <int> <int> <dbl> <chr> <int> <chr> <chr> <chr> ## 1 517 515 2 830 819 11 UA 1545 N14228 EWR IAH ## 2 533 529 4 850 830 20 UA 1714 N24211 LGA IAH ## 3 542 540 2 923 850 33 AA 1141 N619AA JFK MIA ## 4 544 545 -1 1004 1022 -18 B6 725 N804JB JFK BQN ## 5 554 600 -6 812 837 -25 DL 461 N668DN LGA ATL ## 6 554 558 -4 740 728 12 UA 1696 N39463 EWR ORD ## 7 555 600 -5 913 854 19 B6 507 N516JB EWR FLL ## 8 557 600 -3 709 723 -14 EV 5708 N829AS LGA IAD ## 9 557 600 -3 838 846 -8 B6 79 N593JB JFK MCO ## 10 558 600 -2 753 745 8 AA 301 N3ALAA LGA ORD ## # ... with 336,766 more rows, and 5 more variables: air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, ## # time_hour <dttm>
flights %>% select(year:day)
## # A tibble: 336,776 x 3 ## year month day ## <int> <int> <int> ## 1 2013 1 1 ## 2 2013 1 1 ## 3 2013 1 1 ## 4 2013 1 1 ## 5 2013 1 1 ## 6 2013 1 1 ## 7 2013 1 1 ## 8 2013 1 1 ## 9 2013 1 1 ## 10 2013 1 1 ## # ... with 336,766 more rows
flights %>% select(-(year:day))
## # A tibble: 336,776 x 16 ## dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight tailnum origin dest ## <int> <int> <dbl> <int> <int> <dbl> <chr> <int> <chr> <chr> <chr> ## 1 517 515 2 830 819 11 UA 1545 N14228 EWR IAH ## 2 533 529 4 850 830 20 UA 1714 N24211 LGA IAH ## 3 542 540 2 923 850 33 AA 1141 N619AA JFK MIA ## 4 544 545 -1 1004 1022 -18 B6 725 N804JB JFK BQN ## 5 554 600 -6 812 837 -25 DL 461 N668DN LGA ATL ## 6 554 558 -4 740 728 12 UA 1696 N39463 EWR ORD ## 7 555 600 -5 913 854 19 B6 507 N516JB EWR FLL ## 8 557 600 -3 709 723 -14 EV 5708 N829AS LGA IAD ## 9 557 600 -3 838 846 -8 B6 79 N593JB JFK MCO ## 10 558 600 -2 753 745 8 AA 301 N3ALAA LGA ORD ## # ... with 336,766 more rows, and 5 more variables: air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, ## # time_hour <dttm>
flights %>% select(contains("dep"), contains("arr"))
## # A tibble: 336,776 x 7 ## dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier ## <int> <int> <dbl> <int> <int> <dbl> <chr> ## 1 517 515 2 830 819 11 UA ## 2 533 529 4 850 830 20 UA ## 3 542 540 2 923 850 33 AA ## 4 544 545 -1 1004 1022 -18 B6 ## 5 554 600 -6 812 837 -25 DL ## 6 554 558 -4 740 728 12 UA ## 7 555 600 -5 913 854 19 B6 ## 8 557 600 -3 709 723 -14 EV ## 9 557 600 -3 838 846 -8 B6 ## 10 558 600 -2 753 745 8 AA ## # ... with 336,766 more rows
flights %>% select(starts_with("dep"), starts_with("arr"))
## # A tibble: 336,776 x 4 ## dep_time dep_delay arr_time arr_delay ## <int> <dbl> <int> <dbl> ## 1 517 2 830 11 ## 2 533 4 850 20 ## 3 542 2 923 33 ## 4 544 -1 1004 -18 ## 5 554 -6 812 -25 ## 6 554 -4 740 12 ## 7 555 -5 913 19 ## 8 557 -3 709 -14 ## 9 557 -3 838 -8 ## 10 558 -2 753 8 ## # ... with 336,766 more rows
Other helpers: current_vars
, starts_with
, ends_with
, contains
, matches
, num_range
, one_of
, everything
.
names(flights)
## [1] "year" "month" "day" "dep_time" "sched_dep_time" "dep_delay" ## [7] "arr_time" "sched_arr_time" "arr_delay" "carrier" "flight" "tailnum" ## [13] "origin" "dest" "air_time" "distance" "hour" "minute" ## [19] "time_hour"
flights %>% pull("year") %>% head()
## [1] 2013 2013 2013 2013 2013 2013
flights %>% pull(1) %>% head()
## [1] 2013 2013 2013 2013 2013 2013
flights %>% pull(-1) %>% head()
## [1] "2013-01-01 05:00:00 UTC" "2013-01-01 05:00:00 UTC" "2013-01-01 05:00:00 UTC" "2013-01-01 05:00:00 UTC" ## [5] "2013-01-01 06:00:00 UTC" "2013-01-01 05:00:00 UTC"
flights %>% rename(tail_number = tailnum)
## # A tibble: 336,776 x 19 ## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight ## <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr> <int> ## 1 2013 1 1 517 515 2 830 819 11 UA 1545 ## 2 2013 1 1 533 529 4 850 830 20 UA 1714 ## 3 2013 1 1 542 540 2 923 850 33 AA 1141 ## 4 2013 1 1 544 545 -1 1004 1022 -18 B6 725 ## 5 2013 1 1 554 600 -6 812 837 -25 DL 461 ## 6 2013 1 1 554 558 -4 740 728 12 UA 1696 ## 7 2013 1 1 555 600 -5 913 854 19 B6 507 ## 8 2013 1 1 557 600 -3 709 723 -14 EV 5708 ## 9 2013 1 1 557 600 -3 838 846 -8 B6 79 ## 10 2013 1 1 558 600 -2 753 745 8 AA 301 ## # ... with 336,766 more rows, and 8 more variables: tail_number <chr>, origin <chr>, dest <chr>, ## # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
flights %>% select(tail_number = tailnum, -tailnum)
## # A tibble: 336,776 x 0
flights %>% select(-tailnum, tail_number = tailnum)
## # A tibble: 336,776 x 19 ## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight ## <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr> <int> ## 1 2013 1 1 517 515 2 830 819 11 UA 1545 ## 2 2013 1 1 533 529 4 850 830 20 UA 1714 ## 3 2013 1 1 542 540 2 923 850 33 AA 1141 ## 4 2013 1 1 544 545 -1 1004 1022 -18 B6 725 ## 5 2013 1 1 554 600 -6 812 837 -25 DL 461 ## 6 2013 1 1 554 558 -4 740 728 12 UA 1696 ## 7 2013 1 1 555 600 -5 913 854 19 B6 507 ## 8 2013 1 1 557 600 -3 709 723 -14 EV 5708 ## 9 2013 1 1 557 600 -3 838 846 -8 B6 79 ## 10 2013 1 1 558 600 -2 753 745 8 AA 301 ## # ... with 336,766 more rows, and 8 more variables: origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, ## # hour <dbl>, minute <dbl>, time_hour <dttm>, tail_number <chr>
flights %>% filter(month==3,day==2) %>% arrange(origin, dest)
## # A tibble: 765 x 19 ## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight ## <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr> <int> ## 1 2013 3 2 1336 1329 7 1426 1432 -6 EV 4263 ## 2 2013 3 2 628 629 -1 837 849 -12 DL 575 ## 3 2013 3 2 637 640 -3 903 915 -12 EV 4209 ## 4 2013 3 2 743 745 -2 945 1010 -25 DL 807 ## 5 2013 3 2 857 900 -3 1117 1126 -9 DL 485 ## 6 2013 3 2 1027 1030 -3 1234 1247 -13 DL 2343 ## 7 2013 3 2 1134 1145 -11 1332 1359 -27 DL 401 ## 8 2013 3 2 1412 1415 -3 1636 1630 6 DL 935 ## 9 2013 3 2 1633 1636 -3 1848 1908 -20 EV 3273 ## 10 2013 3 2 1655 1700 -5 1857 1924 -27 DL 2042 ## # ... with 755 more rows, and 8 more variables: tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, ## # distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
flights %>% filter(month==3,day==2) %>% arrange(desc(origin), dest) %>% select(origin, dest, tailnum)
## # A tibble: 765 x 3 ## origin dest tailnum ## <chr> <chr> <chr> ## 1 LGA ATL N928AT ## 2 LGA ATL N623DL ## 3 LGA ATL N680DA ## 4 LGA ATL N996AT ## 5 LGA ATL N510MQ ## 6 LGA ATL N663DN ## 7 LGA ATL N942DL ## 8 LGA ATL N511MQ ## 9 LGA ATL N910DE ## 10 LGA ATL N902DE ## # ... with 755 more rows
suppressMessages(library(lubridate)) flights %>% select(1:3) %>% mutate(date = paste(month,day,year,sep="/") %>% mdy())
## # A tibble: 336,776 x 4 ## year month day date ## <int> <int> <int> <date> ## 1 2013 1 1 2013-01-01 ## 2 2013 1 1 2013-01-01 ## 3 2013 1 1 2013-01-01 ## 4 2013 1 1 2013-01-01 ## 5 2013 1 1 2013-01-01 ## 6 2013 1 1 2013-01-01 ## 7 2013 1 1 2013-01-01 ## 8 2013 1 1 2013-01-01 ## 9 2013 1 1 2013-01-01 ## 10 2013 1 1 2013-01-01 ## # ... with 336,766 more rows
flights %>% select(1:3) %>% transmute(date = paste(month,day,year,sep="/") %>% mdy())
## # A tibble: 336,776 x 1 ## date ## <date> ## 1 2013-01-01 ## 2 2013-01-01 ## 3 2013-01-01 ## 4 2013-01-01 ## 5 2013-01-01 ## 6 2013-01-01 ## 7 2013-01-01 ## 8 2013-01-01 ## 9 2013-01-01 ## 10 2013-01-01 ## # ... with 336,766 more rows
flights %>% select(origin, dest) %>% distinct() %>% arrange(origin,dest)
## # A tibble: 224 x 2 ## origin dest ## <chr> <chr> ## 1 EWR ALB ## 2 EWR ANC ## 3 EWR ATL ## 4 EWR AUS ## 5 EWR AVL ## 6 EWR BDL ## 7 EWR BNA ## 8 EWR BOS ## 9 EWR BQN ## 10 EWR BTV ## # ... with 214 more rows
flights %>% sample_n(10)
## # A tibble: 10 x 19 ## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight ## <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr> <int> ## 1 2013 7 11 1512 1453 19 1812 1755 17 B6 1171 ## 2 2013 3 23 558 600 -2 723 735 -12 MQ 3768 ## 3 2013 9 24 826 830 -4 956 1005 -9 9E 3579 ## 4 2013 9 13 555 600 -5 828 840 -12 B6 27 ## 5 2013 4 24 1915 1830 45 2109 2010 59 MQ 4674 ## 6 2013 12 10 1935 1930 5 2241 2238 3 B6 225 ## 7 2013 10 10 1450 1455 -5 1652 1645 7 MQ 2815 ## 8 2013 12 30 1507 1441 26 1749 1724 25 9E 2918 ## 9 2013 12 1 1629 1455 94 1822 1650 92 MQ 3391 ## 10 2013 11 20 1307 1315 -8 1639 1657 -18 US 456 ## # ... with 8 more variables: tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, ## # hour <dbl>, minute <dbl>, time_hour <dttm>
flights %>% sample_frac(0.001)
## # A tibble: 337 x 19 ## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight ## <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr> <int> ## 1 2013 7 8 1947 1711 156 2246 2039 127 B6 167 ## 2 2013 7 1 1439 1155 164 1624 1359 145 UA 779 ## 3 2013 1 17 746 745 1 1135 1125 10 AA 59 ## 4 2013 9 4 1116 1128 -12 1212 1230 -18 B6 2380 ## 5 2013 11 18 1738 1740 -2 1845 1900 -15 B6 1407 ## 6 2013 3 14 853 900 -7 1113 1129 -16 DL 485 ## 7 2013 7 31 2022 2030 -8 2132 2159 -27 B6 686 ## 8 2013 6 28 1403 1300 63 1502 1425 37 UA 1734 ## 9 2013 2 16 836 839 -3 1008 1005 3 EV 4497 ## 10 2013 1 11 2243 2250 -7 2349 2359 -10 B6 608 ## # ... with 327 more rows, and 8 more variables: tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, ## # distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
flights %>% mutate(date = paste(month,day,year,sep="/") %>% mdy()) %>% summarize(n(), min(date), max(date))
## # A tibble: 1 x 3 ## `n()` `min(date)` `max(date)` ## <int> <date> <date> ## 1 336776 2013-01-01 2013-12-31
flights %>% group_by(origin)
## # A tibble: 336,776 x 19 ## # Groups: origin [3] ## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight ## <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr> <int> ## 1 2013 1 1 517 515 2 830 819 11 UA 1545 ## 2 2013 1 1 533 529 4 850 830 20 UA 1714 ## 3 2013 1 1 542 540 2 923 850 33 AA 1141 ## 4 2013 1 1 544 545 -1 1004 1022 -18 B6 725 ## 5 2013 1 1 554 600 -6 812 837 -25 DL 461 ## 6 2013 1 1 554 558 -4 740 728 12 UA 1696 ## 7 2013 1 1 555 600 -5 913 854 19 B6 507 ## 8 2013 1 1 557 600 -3 709 723 -14 EV 5708 ## 9 2013 1 1 557 600 -3 838 846 -8 B6 79 ## 10 2013 1 1 558 600 -2 753 745 8 AA 301 ## # ... with 336,766 more rows, and 8 more variables: tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, ## # distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
flights %>% group_by(origin) %>% mutate(date = paste(month,day,year,sep="/") %>% mdy()) %>% summarize(n(), min(date), max(date))
## # A tibble: 3 x 4 ## origin `n()` `min(date)` `max(date)` ## <chr> <int> <date> <date> ## 1 EWR 120835 2013-01-01 2013-12-31 ## 2 JFK 111279 2013-01-01 2013-12-31 ## 3 LGA 104662 2013-01-01 2013-12-31
How many flights to Los Angeles (LAX) did each of the legacy carriers (AA, UA, DL or US) have in May from JFK, and what was their average duration?
Create a time series plot of each of the legacy carriers' average departure delay by day and origin airport.
Which plane (check the tail number) flew out of each New York airport the most?
What was the shortest flight out of each airport in terms of distance? In terms of duration?
Which date should you fly on if you want to have the lowest possible average departure delay? What about arrival delay?