--- title: Bigish data author: "Colin Rundel" date: "2018-04-03" output: xaringan::moon_reader: css: "slides.css" lib_dir: libs nature: highlightStyle: github highlightLines: true countIncrementalSlides: false --- exclude: true ```{r setup, echo=FALSE, message=FALSE, warning=FALSE, include=FALSE} options( htmltools.dir.version = FALSE, # for blogdown width = 80, tibble.width = 80 ) knitr::opts_chunk$set( fig.align = "center" ) htmltools::tagList(rmarkdown::html_dependency_font_awesome()) library(dplyr) ``` --- class: middle count: false # Background --- ## Big vs Bigish data * We will be working with data that is large, but will still fit in memory. * R *loves* to make extra copies of objects, so we need to be careful - even a handful of copies with exhaust the memory on most systems. * Less of an issue on saxon (256 GB of Ram), but this is a shared resource - use too much and your session will be killed. * In general you should prefer: .center[
*subsetting/vectorization >> apply > loops*
*built-in/base > user C/C++ functions > user R functions* ] --- class: middle count: false # Reading Data --- ## Reading parking data If we use the basic approach of `read.csv`, we end up waiting a really long time, ```{r, eval=FALSE} system.time(read.csv("/data/nyc_parking/NYParkingViolations.csv")) ## user system elapsed ## 377.359 7.080 384.411 ``` Over 6 minutes to read in a 1.7 gigabyte CSV file. --- ## Improvements If we use `stringsAsFactors` and `comment.char` arguments we can speed things up a bit. ```{r, eval=FALSE} system.time( read.csv( "/data/nyc_parking/NYParkingViolations.csv", stringsAsFactors=FALSE, comment.char="" ) ) ## user system elapsed ## 281.399 4.615 285.975 ``` We can take this farther by specifying the structure of the data using the `colClasses` argument. --- ## Alternatives - data.table ```{r, eval=FALSE} system.time({ nyc = data.table::fread("/data/nyc_parking/NYParkingViolations.csv") }) ## Read 9100278 rows and 43 (of 43) columns from 1.673 GB file in 00:00:52 ## user system elapsed ## 50.855 0.970 51.793 class(nyc) ## "data.table" "data.frame" nyc = as.data.frame(nyc) class(nyc) ## [1] "data.frame" ``` --- ## Alternatives - readr ```{r, eval=FALSE} nyc = readr::read_csv("/data/nyc_parking/NYParkingViolations.csv") ## Parsed with column specification: ## cols( ## .default = col_character(), ## `Summons Number` = col_double(), ## `Violation Code` = col_integer(), ## `Street Code1` = col_integer(), ## `Street Code2` = col_integer(), ## `Street Code3` = col_integer(), ## `Vehicle Expiration Date` = col_integer(), ## `Violation Precinct` = col_integer(), ## `Issuer Precinct` = col_integer(), ## `Issuer Code` = col_integer(), ## `Date First Observed` = col_integer(), ## `Law Section` = col_integer(), ## `Violation Legal Code` = col_integer(), ## `Unregistered Vehicle?` = col_integer(), ## `Vehicle Year` = col_integer(), ## `Feet From Curb` = col_integer() ## ) ## See spec(...) for full column specifications. ## |========================================| 100% 1713 MB ``` --- ## ```{r, eval=FALSE} ## Warning: 654437 parsing failures. ## row col expected actual ## 2647 Violation Legal Code an integer T ## 3792 Violation Legal Code an integer T ## 4001 Violation Legal Code an integer T ## 4002 Violation Legal Code an integer T ## 4003 Violation Legal Code an integer T ## .... .................... .......... ...... ## See problems(...) for more details. ## ## user system elapsed ## 103.196 6.792 108.993 ``` --- ## readr This is a recent package that is designed to be a fast and friendly way of reading tabular data into R.
Core features: * Faster than base R (~3-4x) * No strings as factors * No column name mangling * Consistent argument/function naming scheme * Plays nice with dplyr (`tbl_df`) * Progress bars --- ```{r, eval=FALSE} nyc ## Source: local data frame [9,100,278 x 43] ## ## Summons Number Plate ID Registration State Plate Type Issue_Date Violation Code ## (dbl) (chr) (chr) (chr) (chr) (int) ## 1 1361929741 FCJ5493 NY PAS 12/18/1970 20 ## 2 1366962000 63540MC NY COM 02/02/1971 46 ## 3 1356906515 GFM1421 NY PAS 09/18/1971 40 ## 4 1342296217 FYM5117 NY SRF 09/18/1971 21 ## 5 1342296199 95V6675 TX PAS 09/18/1971 21 ## 6 1342296187 GCY4187 NY SRF 09/18/1971 21 ## 7 1337077380 18972BB NY 999 10/10/1971 14 ## 8 1364523796 WNJ4730 VA PAS 04/05/1973 14 ## 9 1359914924 68091JZ NY COM 07/22/1973 46 ## 10 1355498326 EWV4127 NY PAS 08/12/1973 21 ## .. ... ... ... ... ... ... ## Variables not shown: Vehicle Body Type (chr), Vehicle Make (chr), Issuing Agency (chr), ## Street Code1 (int), Street Code2 (int), Street Code3 (int), Vehicle Expiration Date ## (int), Violation Location (chr), Violation Precinct (int), Issuer Precinct (int), ## Issuer Code (int), Issuer Command (chr), Issuer Squad (chr), Violation Time (chr), Time ## First Observed (chr), Violation County (chr), Violation In Front Of Or Opposite (chr), ## House Number (chr), Street Name (chr), Intersecting Street (chr), Date First Observed ## (int), Law Section (int), Sub Division (chr), Violation Legal Code (int), Days Parking ## In Effect (chr), From Hours In Effect (chr), To Hours In Effect (chr), Vehicle Color ## (chr), Unregistered Vehicle? (int), Vehicle Year (int), Meter Number (chr), Feet From ## Curb (int), Violation Post Code (chr), Violation Description (chr), No Standing or ## Stopping Violation (chr), Hydrant Violation (chr), Double Parking Violation (chr) ``` --- ## Fixing column names ```{r eval=FALSE} library(stringr) names(nyc) = str_replace_all(names(nyc), " ", "_") nyc ## # A tibble: 9,100,278 x 43 ## Summons_Number Plate_ID Registration_State Plate_Type Issue_Date Violation_Code ## ## 1 1361929741 FCJ5493 NY PAS 12/18/1970 20 ## 2 1366962000 63540MC NY COM 02/02/1971 46 ## 3 1356906515 GFM1421 NY PAS 09/18/1971 40 ## 4 1342296217 FYM5117 NY SRF 09/18/1971 21 ## 5 1342296199 95V6675 TX PAS 09/18/1971 21 ## 6 1342296187 GCY4187 NY SRF 09/18/1971 21 ## 7 1337077380 18972BB NY 999 10/10/1971 14 ## 8 1364523796 WNJ4730 VA PAS 04/05/1973 14 ## 9 1359914924 68091JZ NY COM 07/22/1973 46 ## 10 1355498326 EWV4127 NY PAS 08/12/1973 21 ## # ... with 9,100,268 more rows, and 37 more variables: Vehicle_Body_Type , ## # Vehicle_Make , Issuing_Agency , Street_Code1 , Street_Code2 , ## # Street_Code3 , Vehicle_Expiration_Date , Violation_Location , ## # Violation_Precinct , Issuer_Precinct , Issuer_Code , Issuer_Command , ## # Issuer_Squad , Violation_Time , Time_First_Observed , ## # Violation_County , Violation_In_Front_Of_Or_Opposite , House_Number , ## # Street_Name , Intersecting_Street , Date_First_Observed , ## # Law_Section , Sub_Division , Violation_Legal_Code , ## # Days_Parking_In_Effect , From_Hours_In_Effect , To_Hours_In_Effect , ## # Vehicle_Color , `Unregistered_Vehicle?` , Vehicle_Year , ## # Meter_Number , Feet_From_Curb , Violation_Post_Code , ## # Violation_Description , No_Standing_or_Stopping_Violation , ## # Hydrant_Violation , Double_Parking_Violation ``` --- ## Simplifying There is a lot of variables we won't care about for the time being, so lets make life easier by selecting a subset of columns. ```{r eval=FALSE} (nyc %<>% select(Registration_State:Issuing_Agency, Violation_Location, Violation_Precinct, Violation_Time, House_Number:Intersecting_Street, Vehicle_Color)) ## # A tibble: 9,100,278 x 14 ## Registration_State Plate_Type Issue_Date Violation_Code Vehicle_Body_Type Vehicle_Make ## ## 1 NY PAS 12/18/1970 20 SUBN GMC ## 2 NY COM 02/02/1971 46 DELV FRUEH ## 3 NY PAS 09/18/1971 40 SDN MAZDA ## 4 NY SRF 09/18/1971 21 SUBN NISSA ## 5 TX PAS 09/18/1971 21 GMC ## 6 NY SRF 09/18/1971 21 VAN FORD ## 7 NY 999 10/10/1971 14 BUS INTER ## 8 VA PAS 04/05/1973 14 SDN TOYOT ## 9 NY COM 07/22/1973 46 DELV TOYOT ## 10 NY PAS 08/12/1973 21 SUBN ACURA ## # ... with 9,100,268 more rows, and 8 more variables: Issuing_Agency , ## # Violation_Location , Violation_Precinct , Violation_Time , ## # House_Number , Street_Name , Intersecting_Street , Vehicle_Color ``` --- ## Fixing Dates ```{r, eval=FALSE} library(lubridate) class(nyc$Issue_Date) ## [1] "character" nyc %<>% mutate(Issue_Date = mdy(Issue_Date, tz="America/New_York")) class(nyc$Issue_Date) ## [1] "Date" ``` --- ## ```{r, eval=FALSE} nyc ## # A tibble: 9,100,278 × 14 ## Registration.State Plate.Type Issue_Date Violation.Code Vehicle.Body.Type Vehicle.Make ## ## 1 NY PAS 1970-12-18 20 SUBN GMC ## 2 NY COM 1971-02-02 46 DELV FRUEH ## 3 NY PAS 1971-09-18 40 SDN MAZDA ## 4 NY SRF 1971-09-18 21 SUBN NISSA ## 5 TX PAS 1971-09-18 21 GMC ## 6 NY SRF 1971-09-18 21 VAN FORD ## 7 NY 999 1971-10-10 14 BUS INTER ## 8 VA PAS 1973-04-05 14 SDN TOYOT ## 9 NY COM 1973-07-22 46 DELV TOYOT ## 10 NY PAS 1973-08-12 21 SUBN ACURA ## # ... with 9,100,268 more rows, and 8 more variables: Issuing.Agency , ## # Violation.Location , Violation.Precinct , Violation.Time , ## # House.Number , Street.Name , Intersecting.Street , Vehicle.Color ``` --- ## More fixing dates ```{r, eval=FALSE} range(nyc$Issue_Date) ## [1] "1970-12-18 EST" "2069-12-23 EST" nyc$Issue_Date %>% year() %>% table() ## 1970 1971 1973 1974 1976 1977 1979 1981 1983 1984 1987 ## 1 6 10 1 2 1 2 4 1 2 3 ## 1990 1991 1996 2000 2001 2002 2003 2004 2005 2006 2007 ## 2 1 1 319 91 7 39 77 9 11 13 ## 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 ## 8 9 129 251 618 4379109 4716512 1522 296 309 181 ## 2019 2020 2021 2022 2023 2024 2025 2026 2027 2028 2029 ## 329 18 26 1 31 23 10 4 4 7 3 ## 2030 2031 2032 2033 2040 2041 2043 2044 2045 2046 2047 ## 45 93 3 8 1 39 9 9 2 7 6 ## 2048 2049 2050 2051 2052 2053 2060 2061 2063 2064 2066 ## 1 3 1 12 2 1 3 10 9 5 3 ## 2067 2069 ## 2 1 ``` --- ## ```{r, eval=FALSE} filter(nyc, Issue_Date >= mdy("1/1/2013"), Issue_Date <= mdy("12/31/2014")) ## # A tibble: 9,095,621 x 43 ## Summons_Number Plate_ID Registration_State Plate_Type Issue_Date Violation_Code Vehicle_Body_Type ## ## 1 1354300671 S28CMN NJ PAS 2013-01-01 20 SUBN ## 2 1349345910 XTX057 MI PAS 2013-01-01 99 SDN ## 3 1268869855 GJK5565 NY PAS 2013-01-01 20 SUBN ## 4 1268869843 EPS8803 NY PAS 2013-01-01 20 SDN ## 5 1365149122 FWZ5341 NY SRF 2013-01-01 20 SDN ## 6 1364348044 EPE8859 NY PAS 2013-01-01 46 SUBN ## 7 1364348032 ERT3706 NY PAS 2013-01-01 62 SDN ## 8 1364838760 DPA3951 NY PAS 2013-01-01 98 SDN ## 9 1364832835 ETS1289 NY PAS 2013-01-01 71 SDN ## 10 1364805819 FGE4351 NY PAS 2013-01-01 71 SUBN ## # ... with 9,095,611 more rows, and 36 more variables: Vehicle_Make , Issuing_Agency , ## # Street_Code1 , Street_Code2 , Street_Code3 , Vehicle_Expiration_Date , ## # Violation_Location , Violation_Precinct , Issuer_Precinct , Issuer_Code , ## # Issuer_Command , Issuer_Squad , Violation_Time , Time_First_Observed , ## # Violation_County , Violation_In_Front_Of_Or_Opposite , House_Number , Street_Name , ## # Intersecting_Street , Date_First_Observed , Law_Section , Sub_Division , ## # Violation_Legal_Code , Days_Parking_In_Effect , From_Hours_In_Effect , ## # To_Hours_In_Effect , Vehicle_Color , `Unregistered_Vehicle?` , Vehicle_Year , ## # Meter_Number , Feet_From_Curb , Violation_Post_Code , Violation_Description , ## # No_Standing_or_Stopping_Violation , Hydrant_Violation , Double_Parking_Violation ``` --- ## Performance? ```{r, eval=FALSE} system.time(filter(nyc, Issue_Date >= mdy("1/1/2013"), Issue_Date <= mdy("12/31/2014"))) ## user system elapsed ## 7.560 1.800 9.474 system.time(filter(nyc, year(Issue_Date) %in% c(2013,2014))) ## user system elapsed ## 7.003 1.761 8.794 ``` --- ## Putting it all together ```{r, eval=FALSE} nyc = read_csv("/data/nyc_parking/NYParkingViolations.csv") %>% setNames(str_replace_all(names(.)," ", "_")) %>% select(Registration_State:Issuing_Agency, Violation_Location, Violation_Precinct, Violation_Time, House_Number:Intersecting_Street, Vehicle_Color) %>% mutate(Issue_Date = mdy(Issue_Date)) %>% mutate(Issue_Day = day(Issue_Date), Issue_Month = month(Issue_Date), Issue_Year = year(Issue_Date), Issue_WDay = wday(Issue_Date, label=TRUE)) %>% filter(Issue_Year %in% 2013:2014) ``` --- ## ```{r, eval=FALSE} ## # A tibble: 9,095,621 x 18 ## Registration_State Plate_Type Issue_Date Violation_Code Vehicle_Body_Type Vehicle_Make Issuing_Agency ## ## 1 NJ PAS 2013-01-01 20 SUBN CHRYS P ## 2 MI PAS 2013-01-01 99 SDN TOYOT P ## 3 NY PAS 2013-01-01 20 SUBN FORD P ## 4 NY PAS 2013-01-01 20 SDN INFIN P ## 5 NY SRF 2013-01-01 20 SDN NISSA P ## 6 NY PAS 2013-01-01 46 SUBN CHRYS P ## 7 NY PAS 2013-01-01 62 SDN HYUND P ## 8 NY PAS 2013-01-01 98 SDN TOYOT P ## 9 NY PAS 2013-01-01 71 SDN TOYOT P ## 10 NY PAS 2013-01-01 71 SUBN FORD P ## # ... with 9,095,611 more rows, and 11 more variables: Violation_Location , Violation_Precinct , ## # Violation_Time , House_Number , Street_Name , Intersecting_Street , Vehicle_Color , ## # Issue_Day , Issue_Month , Issue_Year , Issue_WDay ``` --- ## Ticket Frequency ```{r, eval=FALSE} nyc %>% group_by(Issue_Date) %>% summarize(n=n()) %>% ggplot(aes(x=Issue_Date, y=n)) + geom_line() + xlim(mdy("7/1/2013"), mdy("6/30/2014")) ``` ```{r echo=FALSE} knitr::include_graphics("imgs/nyc_date_freq.png") ``` --- ## Exercise 1 Some more dplyr practice, 1. Create a plot of the weekly pattern (tickets issued per day of the week) - When are you most likely to get a ticket and when are you least likely to get a ticket? 2. Which precinct issued the most tickets to Toyotas? 3. How many different colors of cars were ticketed? --- class: middle count: false # dbplyr --- ## Creating an sqlite database .normal[ ```{r eval=FALSE} (db = src_sqlite("/data/nyc_parking/NYParkingViolations.sqlite", create = TRUE)) ## src: sqlite 3.8.6 [/data/nyc_parking/NYParkingViolations.sqlite] ## tbls: nyc_sql = copy_to(db, nyc, temporary = FALSE) db ## src: sqlite 3.8.6 [/data/nyc_parking/NYParkingViolations.sqlite] ## tbls: nyc, sqlite_stat1 nyc_sql = tbl(db,"nyc") str(nyc_sql) ## List of 9 ## $ src :List of 3 ## ..$ con :Formal class 'SQLiteConnection' [package "RSQLite"] with 5 slots ## .. .. ..@ Id : ## .. .. ..@ dbname : chr "/data/nyc_parking/NYParkingViolations.sqlite" ## .. .. ..@ loadable.extensions: logi TRUE ## .. .. ..@ flags : int 6 ## .. .. ..@ vfs : chr "" ## ..$ path: chr "/data/nyc_parking/NYParkingViolations.sqlite" ## ..$ info:List of 2 ## .. ..$ serverVersion: chr "3.8.6" ## .. ..$ results : logi FALSE ## ..- attr(*, "class")= chr [1:3] "src_sqlite" "src_sql" "src" ## $ from :Classes 'ident', 'sql', 'character' chr "nyc" ## ... ``` ] --- ## ```{r eval=FALSE} nyc_sql ## # Source: table [?? x 18] ## # Database: sqlite 3.19.3 [/data/nyc_parking/NYParkingViolations.sqlite] ## Registration_State Plate_Type Issue_Date Violation_Code Vehicle_Body_Type Vehicle_Make Issuing_Agency ## ## 1 NJ PAS 15706 20 SUBN CHRYS P ## 2 MI PAS 15706 99 SDN TOYOT P ## 3 NY PAS 15706 20 SUBN FORD P ## 4 NY PAS 15706 20 SDN INFIN P ## 5 NY SRF 15706 20 SDN NISSA P ## 6 NY PAS 15706 46 SUBN CHRYS P ## 7 NY PAS 15706 62 SDN HYUND P ## 8 NY PAS 15706 98 SDN TOYOT P ## 9 NY PAS 15706 71 SDN TOYOT P ## 10 NY PAS 15706 71 SUBN FORD P ## # ... with more rows, and 11 more variables: Violation_Location , Violation_Precinct , ## # Violation_Time , House_Number , Street_Name , Intersecting_Street , ## # Vehicle_Color , Issue_Day , Issue_Month , Issue_Year , Issue_WDay ``` --- ## Using dplyr with sqlite ```{r, eval=FALSE} (addr = nyc_sql %>% select(Issue_Date, Issuing_Agency, Violation_Precinct, House_Number, Street_Name) %>% filter(Violation_Precinct >=1, Violation_Precinct <= 34) ) ## # Source: lazy query [?? x 5] ## # Database: sqlite 3.19.3 [/data/nyc_parking/NYParkingViolations.sqlite] ## Issue_Date Issuing_Agency Violation_Precinct House_Number Street_Name ## ## 1 15706 P 28 102 W 123 ST ## 2 15706 P 23 2121 1 AVE ## 3 15706 P 23 60 E 106 ST ## 4 15706 P 5 54 ELIZABETH ST ## 5 15706 P 26 488-490 ST NICHOLAS AVE ## 6 15706 P 26 1420 AMSTERDAM AVE ## 7 15706 P 25 219 E 121 ST ## 8 15706 P 12 630 LEXINGTON AVE ## 9 15706 P 18 413 48 TH ST ## 10 15706 P 25 2123 MADISON AVE--- class: middle count: false # ... with more rows ``` --- ## SQL Query ```{r, eval=FALSE} class(addr) ## [1] "tbl_dbi" "tbl_sql" "tbl_lazy" "tbl" show_query(addr) ## ## SELECT * ## FROM (SELECT `Issue_Date` AS `Issue_Date`, `Issuing_Agency` AS `Issuing_Agency`, `Violation_Precinct` AS `Violation_Precinct`, `House_Number` AS `House_Number`, `Street_Name` AS `Street_Name` ## FROM `nyc`) ## WHERE ((`Violation_Precinct` >= 1.0) AND (`Violation_Precinct` <= 34.0)) ``` --- ## Limitations ```{r, eval=FALSE} addr %>% mutate(address = paste(House_Number, Street_Name)) ## Error in rsqlite_send_query(conn@ptr, statement) : ## no such function: PASTE addr %>% summarize(mean = mean(Violation_Precinct, na.rm=TRUE)) ## na.rm not needed in SQL: NULL are always droppedFALSE ## # Source: lazy query [?? x 1] ## # Database: sqlite 3.19.3 [/data/nyc_parking/NYParkingViolations.sqlite] ## mean ## ## 1 16.09762 addr %>% summarize(mean = mean(Violation_Precinct)) ## # Source: lazy query [?? x 1] ## # Database: sqlite 3.19.3 [/data/nyc_parking/NYParkingViolations.sqlite] ## mean ## ## 1 16.09762 ``` --- ## SQL Grouping ```{r, eval=FALSE} addr %>% group_by(Issuing_Agency, Violation_Precinct) %>% summarize(n=n()) ## # Source: lazy query [?? x 3] ## # Database: sqlite 3.19.3 [/data/nyc_parking/NYParkingViolations.sqlite] ## # Groups: Issuing_Agency ## Issuing_Agency Violation_Precinct n ## ## 1 A 1 13 ## 2 A 7 1 ## 3 A 10 24 ## 4 A 11 1 ## 5 A 14 47 ## 6 A 33 11 ## 7 B 25 2 ## 8 C 5 73 ## 9 C 13 7 ## 10 D 1 1 ## # ... with more rows ``` --- ## SQL Query ```{r, eval=FALSE} addr %>% group_by(Issuing_Agency, Violation_Precinct) %>% summarize(n=n()) %>% show_query() ## ## SELECT `Issuing_Agency`, `Violation_Precinct`, COUNT() AS `n` ## FROM (SELECT `Issue_Date` AS `Issue_Date`, `Issuing_Agency` AS `Issuing_Agency`, `Violation_Precinct` AS `## Violation_Precinct`, `House_Number` AS `House_Number`, `Street_Name` AS `Street_Name` ## FROM `nyc`) ## WHERE ((`Violation_Precinct` >= 1.0) AND (`Violation_Precinct` <= 34.0)) ## GROUP BY `Issuing_Agency`, `Violation_Precinct` ``` --- ## SQL Translation dbplyr has a function, `translate_sql`, that lets you experiment with how R functions are translated to SQL .pull-left[ ```{r, error=TRUE} dbplyr::translate_sql(x == 1 & (y < 2 | z > 3)) dbplyr::translate_sql(x ^ 2 < 10) dbplyr::translate_sql(x %% 2 == 10) ``` ] .pull-right[ ```{r error=TRUE} dbplyr::translate_sql(paste(x,y)) dbplyr::translate_sql(mean(x)) dbplyr::translate_sql(mean(x, na.rm=TRUE)) ``` ] In general, dplyr / dbplyr knows how to translate basic math, logical, and summary functions from R to SQL. --- ## (Unfair) Timings ```{r, eval=FALSE} system.time( nyc %>% select(Issue_Date, Issuing_Agency, Violation_Precinct, House_Number, Street_Name) %>% filter(Violation_Precinct >=1, Violation_Precinct <= 34) %>% group_by(Issuing_Agency, Violation_Precinct) %>% summarize(n=n()) ) ## user system elapsed ## 0.639 0.099 0.740 system.time( nyc_sql %>% select(Issue_Date, Issuing_Agency, Violation_Precinct, House_Number, Street_Name) %>% filter(Violation_Precinct >=1, Violation_Precinct <= 34) %>% group_by(Issuing_Agency, Violation_Precinct) %>% summarize(n=n()) ) ## user system elapsed ## 0.045 0.011 0.056 ``` `nyc_sql` was 30x times faster than `nyc`, but the former is disk based while the latter is in memory, why this discrepancy? --- ## Laziness dplyr / dbplyr uses lazy evaluation as much as possible, particularly when working with non-local backends. * When building a query, we don't want the entire table, often we want just enough to check if our query is working. * Since we would prefer to run one complex query over many simple queries, laziness allows for verbs to be strung together. * Therefore, by default `dplyr` * won't connect and query the database until absolutely necessary (e.g. show output), * and unless explicitly told to, will only query a handful of rows to give a sense of what the result will look like --- ## ```{r eval=FALSE} nyc_sql %>% select(Issue_Date, Issuing_Agency, Violation_Precinct, House_Number, Street_Name) %>% filter(Violation_Precinct >=1, Violation_Precinct <= 34) %>% group_by(Issuing_Agency, Violation_Precinct) %>% summarize(n=n()) ## # Source: lazy query [?? x 3] ## # Database: sqlite 3.19.3 [/data/nyc_parking/NYParkingViolations.sqlite] ## # Groups: Issuing_Agency ## Issuing_Agency Violation_Precinct n ## ## 1 A 1 13 ## 2 A 7 1 ## 3 A 10 24 ## 4 A 11 1 ## 5 A 14 47 ## 6 A 33 11 ## 7 B 25 2 ## 8 C 5 73 ## 9 C 13 7 ## 10 D 1 1 ## # ... with more rows ``` --- ## Full query To force a full query and return a complete it is necessart to use the `collect` function. ```{r, eval=FALSE} nyc_sql %>% select(Issue_Date, Issuing_Agency, Violation_Precinct, House_Number, Street_Name) %>% filter(Violation_Precinct >=1, Violation_Precinct <= 34) %>% group_by(Issuing_Agency, Violation_Precinct) %>% summarize(n=n()) %>% collect() ## # A tibble: 199 x 3 ## # Groups: Issuing_Agency [15] ## Issuing_Agency Violation_Precinct n ## ## 1 A 1 13 ## 2 A 7 1 ## 3 A 10 24 ## 4 A 11 1 ## 5 A 14 47 ## 6 A 33 11 ## 7 B 25 2 ## 8 C 5 73 ## 9 C 13 7 ## 10 D 1 1 ## # ... with 189 more rows ``` `compute` and `collapse` also force a full query but have slightly different behavior and return types. --- ## Creating Indexes ```{r, eval=FALSE} (db_index = src_sqlite("/data/nyc_parking/NYParkingViolations_index.sqlite", create = TRUE)) ## src: sqlite 3.8.6 [/data/nyc_parking/NYParkingViolations_index.sqlite] ## tbls: nyc_index = copy_to(db_index, nyc, temporary = FALSE, overwrite = TRUE, index = list("Violation_Precinct")) ``` --- The indexed database takes up more disk space: ```shell cr173@saxon [~]$ ls -lh /data/nyc_parking/*.sqlite -rwxr--r--. 1 cr173 visitor 1.8G Nov 6 12:18 /data/nyc_parking/NYParkingViolations_index.sqlite -rwxr--r--. 1 cr173 visitor 1.7G Nov 6 12:01 /data/nyc_parking/NYParkingViolations.sqlite ``` --- ## Timings for filtering precincts ```{r, eval=FALSE} system.time(nyc_sql %>% filter(Violation_Precinct <= 34, Violation_Precinct >= 1) %>% collect()) ## user system elapsed ## 22.204 1.524 23.832 system.time(nyc_index %>% filter(Violation_Precinct <= 34, Violation_Precinct >= 1) %>% collect()) ## user system elapsed ## 21.547 1.647 23.284 ``` --- ## Timings for grouping ```{r, eval=FALSE} system.time(nyc_sql %>% group_by(Violation_Precinct) %>% summarize(n=n()) %>% collect()) ## user system elapsed ## 5.131 0.652 5.804 system.time(nyc_index %>% group_by(Violation_Precinct) %>% summarize(n=n()) %>% collect()) ## user system elapsed ## 1.227 0.082 1.314 ``` --- class: middle count: false # Acknowledgments --- ## Acknowledgments Above materials are derived in part from the following sources: * [dbplyr - Introduction Vignette](https://cran.r-project.org/web/packages/dbplyr/vignettes/dbplyr.html) * [dbplyr - SQL Translation](https://cran.r-project.org/web/packages/dbplyr/vignettes/sql-translation.html)