Bigish data

Background

Big vs Bigish data

We will be working with data that is large, but will still fit in memory.
R loves to make extra copies of objects, so we need to be careful - even a handful of copies with exhaust the memory on most systems.
- Less of an issue on saxon (256 GB of Ram), but this is a shared resource - use too much and your session will be killed.
In general you should prefer:

subsetting/vectorization >> apply > loops

built-in/base > user C/C++ functions > user R functions

Reading Data

Reading parking data

If we use the basic approach of read.csv, we end up waiting a really long time,

system.time(read.csv("/data/nyc_parking/NYParkingViolations.csv"))

##    user  system elapsed  
## 377.359   7.080 384.411

Over 6 minutes to read in a 1.7 gigabyte CSV file.

Improvements

If we use stringsAsFactors and comment.char arguments we can speed things up a bit.

system.time(read.csv("/data/nyc_parking/NYParkingViolations.csv", 
                     stringsAsFactors=FALSE, 
                     comment.char="")
           )

##    user  system elapsed 
## 281.399   4.615 285.975

We can take this farther by specifying the structure of the data using the colClasses argument.

Alternatives - data.table

library(data.table)
system.time({nyc = fread("/data/nyc_parking/NYParkingViolations.csv")})

## Read 9100278 rows and 43 (of 43) columns from 1.673 GB file in 00:00:52
##    user  system elapsed 
##  50.855   0.970  51.793 

class(nyc)

## "data.table" "data.frame"

nyc = as.data.frame(nyc)
class(nyc)

## [1] "data.frame"

Alternatives - readr

nyc = read_csv("/data/nyc_parking/NYParkingViolations.csv")

## Parsed with column specification:
## cols(
##   .default = col_character(),
##   `Summons Number` = col_double(),
##   `Violation Code` = col_integer(),
##   `Street Code1` = col_integer(),
##   `Street Code2` = col_integer(),
##   `Street Code3` = col_integer(),
##   `Vehicle Expiration Date` = col_integer(),
##   `Violation Precinct` = col_integer(),
##   `Issuer Precinct` = col_integer(),
##   `Issuer Code` = col_integer(),
##   `Date First Observed` = col_integer(),
##   `Law Section` = col_integer(),
##   `Violation Legal Code` = col_integer(),
##   `Unregistered Vehicle?` = col_integer(),
##   `Vehicle Year` = col_integer(),
##   `Feet From Curb` = col_integer()
## )
## See spec(...) for full column specifications.
## |================================================================================| 100% 1713 MB

## Warning: 654437 parsing failures.
##  row                  col   expected actual
## 2647 Violation Legal Code an integer      T
## 3792 Violation Legal Code an integer      T
## 4001 Violation Legal Code an integer      T
## 4002 Violation Legal Code an integer      T
## 4003 Violation Legal Code an integer      T
## .... .................... .......... ......
## See problems(...) for more details.
## 
##    user  system elapsed 
## 103.196   6.792 108.993

readr

This is a recent package that is designed to be a fast and friendly way of reading tabular data into R.

Core features:

Faster than base R (~3-4x)
No strings as factors
No column name mangling
Consistent argument/function naming scheme
Plays nice with dplyr (tbl_df)
Progress bars

nyc

## Source: local data frame [9,100,278 x 43]
## 
##    Summons Number Plate ID Registration State Plate Type Issue_Date Violation Code
##             (dbl)    (chr)              (chr)      (chr)      (chr)          (int)
## 1      1361929741  FCJ5493                 NY        PAS 12/18/1970             20
## 2      1366962000  63540MC                 NY        COM 02/02/1971             46
## 3      1356906515  GFM1421                 NY        PAS 09/18/1971             40
## 4      1342296217  FYM5117                 NY        SRF 09/18/1971             21
## 5      1342296199  95V6675                 TX        PAS 09/18/1971             21
## 6      1342296187  GCY4187                 NY        SRF 09/18/1971             21
## 7      1337077380  18972BB                 NY        999 10/10/1971             14
## 8      1364523796  WNJ4730                 VA        PAS 04/05/1973             14
## 9      1359914924  68091JZ                 NY        COM 07/22/1973             46
## 10     1355498326  EWV4127                 NY        PAS 08/12/1973             21
## ..            ...      ...                ...        ...        ...            ...
## Variables not shown: Vehicle Body Type (chr), Vehicle Make (chr), Issuing Agency (chr),
##   Street Code1 (int), Street Code2 (int), Street Code3 (int), Vehicle Expiration Date
##   (int), Violation Location (chr), Violation Precinct (int), Issuer Precinct (int),
##   Issuer Code (int), Issuer Command (chr), Issuer Squad (chr), Violation Time (chr), Time
##   First Observed (chr), Violation County (chr), Violation In Front Of Or Opposite (chr),
##   House Number (chr), Street Name (chr), Intersecting Street (chr), Date First Observed
##   (int), Law Section (int), Sub Division (chr), Violation Legal Code (int), Days Parking
##   In Effect (chr), From Hours In Effect (chr), To Hours In Effect (chr), Vehicle Color
##   (chr), Unregistered Vehicle? (int), Vehicle Year (int), Meter Number (chr), Feet From
##   Curb (int), Violation Post Code (chr), Violation Description (chr), No Standing or
##   Stopping Violation (chr), Hydrant Violation (chr), Double Parking Violation (chr)

Fixing column names

library(stringr)
names(nyc) = str_replace_all(names(nyc), " ", "_")
nyc

## # A tibble: 9,100,278 x 43
##    Summons_Number Plate_ID Registration_State Plate_Type Issue_Date Violation_Code
##             <dbl>    <chr>              <chr>      <chr>      <chr>          <int>
##  1     1361929741  FCJ5493                 NY        PAS 12/18/1970             20
##  2     1366962000  63540MC                 NY        COM 02/02/1971             46
##  3     1356906515  GFM1421                 NY        PAS 09/18/1971             40
##  4     1342296217  FYM5117                 NY        SRF 09/18/1971             21
##  5     1342296199  95V6675                 TX        PAS 09/18/1971             21
##  6     1342296187  GCY4187                 NY        SRF 09/18/1971             21
##  7     1337077380  18972BB                 NY        999 10/10/1971             14
##  8     1364523796  WNJ4730                 VA        PAS 04/05/1973             14
##  9     1359914924  68091JZ                 NY        COM 07/22/1973             46
## 10     1355498326  EWV4127                 NY        PAS 08/12/1973             21
## # ... with 9,100,268 more rows, and 37 more variables: Vehicle_Body_Type <chr>,
## #   Vehicle_Make <chr>, Issuing_Agency <chr>, Street_Code1 <int>, Street_Code2 <int>,
## #   Street_Code3 <int>, Vehicle_Expiration_Date <int>, Violation_Location <chr>,
## #   Violation_Precinct <int>, Issuer_Precinct <int>, Issuer_Code <int>, Issuer_Command <chr>,
## #   Issuer_Squad <chr>, Violation_Time <chr>, Time_First_Observed <chr>,
## #   Violation_County <chr>, Violation_In_Front_Of_Or_Opposite <chr>, House_Number <chr>,
## #   Street_Name <chr>, Intersecting_Street <chr>, Date_First_Observed <int>,
## #   Law_Section <int>, Sub_Division <chr>, Violation_Legal_Code <int>,
## #   Days_Parking_In_Effect <chr>, From_Hours_In_Effect <chr>, To_Hours_In_Effect <chr>,
## #   Vehicle_Color <chr>, `Unregistered_Vehicle?` <int>, Vehicle_Year <int>,
## #   Meter_Number <chr>, Feet_From_Curb <int>, Violation_Post_Code <chr>,
## #   Violation_Description <chr>, No_Standing_or_Stopping_Violation <chr>,
## #   Hydrant_Violation <chr>, Double_Parking_Violation <chr>

Simplifying

There is a lot of variables we won't care about for the time being, so lets make life easier by selecting a subset of columns.

(nyc %<>% 
  select(Registration_State:Issuing_Agency, 
         Violation_Location, Violation_Precinct, Violation_Time,
         House_Number:Intersecting_Street, Vehicle_Color))

## # A tibble: 9,100,278 x 14
##    Registration_State Plate_Type Issue_Date Violation_Code Vehicle_Body_Type Vehicle_Make
##                 <chr>      <chr>      <chr>          <int>             <chr>        <chr>
##  1                 NY        PAS 12/18/1970             20              SUBN          GMC
##  2                 NY        COM 02/02/1971             46              DELV        FRUEH
##  3                 NY        PAS 09/18/1971             40               SDN        MAZDA
##  4                 NY        SRF 09/18/1971             21              SUBN        NISSA
##  5                 TX        PAS 09/18/1971             21              <NA>          GMC
##  6                 NY        SRF 09/18/1971             21               VAN         FORD
##  7                 NY        999 10/10/1971             14               BUS        INTER
##  8                 VA        PAS 04/05/1973             14               SDN        TOYOT
##  9                 NY        COM 07/22/1973             46              DELV        TOYOT
## 10                 NY        PAS 08/12/1973             21              SUBN        ACURA
## # ... with 9,100,268 more rows, and 8 more variables: Issuing_Agency <chr>,
## #   Violation_Location <chr>, Violation_Precinct <int>, Violation_Time <chr>,
## #   House_Number <chr>, Street_Name <chr>, Intersecting_Street <chr>, Vehicle_Color <chr>

Fixing Dates

library(lubridate)
class(nyc$Issue_Date)

## [1] "character"

nyc %<>% mutate(Issue_Date = mdy(Issue_Date, tz="America/New_York"))
class(nyc$Issue_Date)

## [1] "Date"

nyc

## # A tibble: 9,100,278 × 14
##    Registration.State Plate.Type Issue_Date Violation.Code Vehicle.Body.Type Vehicle.Make
##                 <chr>      <chr>     <date>          <int>             <chr>        <chr>
## 1                  NY        PAS 1970-12-18             20              SUBN          GMC
## 2                  NY        COM 1971-02-02             46              DELV        FRUEH
## 3                  NY        PAS 1971-09-18             40               SDN        MAZDA
## 4                  NY        SRF 1971-09-18             21              SUBN        NISSA
## 5                  TX        PAS 1971-09-18             21                            GMC
## 6                  NY        SRF 1971-09-18             21               VAN         FORD
## 7                  NY        999 1971-10-10             14               BUS        INTER
## 8                  VA        PAS 1973-04-05             14               SDN        TOYOT
## 9                  NY        COM 1973-07-22             46              DELV        TOYOT
## 10                 NY        PAS 1973-08-12             21              SUBN        ACURA
## # ... with 9,100,268 more rows, and 8 more variables: Issuing.Agency <chr>,
## #   Violation.Location <int>, Violation.Precinct <int>, Violation.Time <chr>,
## #   House.Number <chr>, Street.Name <chr>, Intersecting.Street <chr>, Vehicle.Color <chr>

More fixing dates

range(nyc$Issue_Date)

## [1] "1970-12-18 EST" "2069-12-23 EST"

nyc$Issue_Date %>% year() %>% table()

##   1970    1971    1973    1974    1976    1977    1979    1981    1983    1984    1987 
##      1       6      10       1       2       1       2       4       1       2       3 
##   1990    1991    1996    2000    2001    2002    2003    2004    2005    2006    2007 
##      2       1       1     319      91       7      39      77       9      11      13 
##   2008    2009    2010    2011    2012    2013    2014    2015    2016    2017    2018 
##      8       9     129     251     618 4379109 4716512    1522     296     309     181 
##   2019    2020    2021    2022    2023    2024    2025    2026    2027    2028    2029 
##    329      18      26       1      31      23      10       4       4       7       3 
##   2030    2031    2032    2033    2040    2041    2043    2044    2045    2046    2047 
##     45      93       3       8       1      39       9       9       2       7       6 
##   2048    2049    2050    2051    2052    2053    2060    2061    2063    2064    2066 
##      1       3       1      12       2       1       3      10       9       5       3 
##   2067    2069 
##      2       1

filter(nyc, Issue_Date >= mdy("1/1/2013"), Issue_Date <= mdy("12/31/2014"))

## # A tibble: 9,095,621 x 43
##    Summons_Number Plate_ID Registration_State Plate_Type Issue_Date Violation_Code Vehicle_Body_Type
##             <dbl>    <chr>              <chr>      <chr>     <dttm>          <int>             <chr>
##  1     1354300671   S28CMN                 NJ        PAS 2013-01-01             20              SUBN
##  2     1349345910   XTX057                 MI        PAS 2013-01-01             99               SDN
##  3     1268869855  GJK5565                 NY        PAS 2013-01-01             20              SUBN
##  4     1268869843  EPS8803                 NY        PAS 2013-01-01             20               SDN
##  5     1365149122  FWZ5341                 NY        SRF 2013-01-01             20               SDN
##  6     1364348044  EPE8859                 NY        PAS 2013-01-01             46              SUBN
##  7     1364348032  ERT3706                 NY        PAS 2013-01-01             62               SDN
##  8     1364838760  DPA3951                 NY        PAS 2013-01-01             98               SDN
##  9     1364832835  ETS1289                 NY        PAS 2013-01-01             71               SDN
## 10     1364805819  FGE4351                 NY        PAS 2013-01-01             71              SUBN
## # ... with 9,095,611 more rows, and 36 more variables: Vehicle_Make <chr>, Issuing_Agency <chr>,
## #   Street_Code1 <int>, Street_Code2 <int>, Street_Code3 <int>, Vehicle_Expiration_Date <int>,
## #   Violation_Location <chr>, Violation_Precinct <int>, Issuer_Precinct <int>, Issuer_Code <int>,
## #   Issuer_Command <chr>, Issuer_Squad <chr>, Violation_Time <chr>, Time_First_Observed <chr>,
## #   Violation_County <chr>, Violation_In_Front_Of_Or_Opposite <chr>, House_Number <chr>, Street_Name <chr>,
## #   Intersecting_Street <chr>, Date_First_Observed <int>, Law_Section <int>, Sub_Division <chr>,
## #   Violation_Legal_Code <int>, Days_Parking_In_Effect <chr>, From_Hours_In_Effect <chr>,
## #   To_Hours_In_Effect <chr>, Vehicle_Color <chr>, `Unregistered_Vehicle?` <int>, Vehicle_Year <int>,
## #   Meter_Number <chr>, Feet_From_Curb <int>, Violation_Post_Code <chr>, Violation_Description <chr>,
## #   No_Standing_or_Stopping_Violation <chr>, Hydrant_Violation <chr>, Double_Parking_Violation <chr>

Performance?

system.time(filter(nyc, Issue_Date >= mdy("1/1/2013"), Issue_Date <= mdy("12/31/2014")))
  
##     user  system elapsed  
##    7.560   1.800   9.474  

system.time(filter(nyc, year(Issue_Date) %in% c(2013,2014)))
  
##     user  system elapsed  
##    7.003   1.761   8.794

Putting it all together

nyc = read_csv("/data/nyc_parking/NYParkingViolations.csv") %>%
  setNames(str_replace_all(names(.)," ", "_")) %>%
  select(Registration_State:Issuing_Agency, 
       Violation_Location, Violation_Precinct, Violation_Time,
       House_Number:Intersecting_Street, Vehicle_Color) %>%
  mutate(Issue_Date = mdy(Issue_Date)) %>% 
  mutate(Issue_Day = day(Issue_Date),
         Issue_Month = month(Issue_Date),
         Issue_Year = year(Issue_Date),
         Issue_WDay = wday(Issue_Date, label=TRUE)) %>%
  filter(Issue_Year %in% 2013:2014)

## # A tibble: 9,095,621 x 18
##    Registration_State Plate_Type Issue_Date Violation_Code Vehicle_Body_Type Vehicle_Make Issuing_Agency
##                 <chr>      <chr>     <date>          <int>             <chr>        <chr>          <chr>
##  1                 NJ        PAS 2013-01-01             20              SUBN        CHRYS              P
##  2                 MI        PAS 2013-01-01             99               SDN        TOYOT              P
##  3                 NY        PAS 2013-01-01             20              SUBN         FORD              P
##  4                 NY        PAS 2013-01-01             20               SDN        INFIN              P
##  5                 NY        SRF 2013-01-01             20               SDN        NISSA              P
##  6                 NY        PAS 2013-01-01             46              SUBN        CHRYS              P
##  7                 NY        PAS 2013-01-01             62               SDN        HYUND              P
##  8                 NY        PAS 2013-01-01             98               SDN        TOYOT              P
##  9                 NY        PAS 2013-01-01             71               SDN        TOYOT              P
## 10                 NY        PAS 2013-01-01             71              SUBN         FORD              P
## # ... with 9,095,611 more rows, and 11 more variables: Violation_Location <chr>, Violation_Precinct <int>,
## #   Violation_Time <chr>, House_Number <chr>, Street_Name <chr>, Intersecting_Street <chr>, Vehicle_Color <chr>,
## #   Issue_Day <int>, Issue_Month <dbl>, Issue_Year <dbl>, Issue_WDay <ord>

Ticket Frequency

nyc %>% 
  group_by(Issue_Date) %>% 
  summarize(n=n()) %>%
  ggplot(aes(x=Issue_Date, y=n)) + 
    geom_line() + 
    xlim(mdy("7/1/2013"), mdy("6/30/2014"))

Exercise 1

Some more dplyr practice,

Create a plot of the weekly pattern (tickets issued per day of the week) - When are you most likely to get a ticket and when are you least likely to get a ticket?
Which precinct issued the most tickets to Toyotas?
How many different colors of cars were ticketed?

dbplyr

Creating an sqlite database

(db = src_sqlite("/data/nyc_parking/NYParkingViolations.sqlite", create = TRUE))

##  src:  sqlite 3.8.6 [/data/nyc_parking/NYParkingViolations.sqlite]
##  tbls:

nyc_sql = copy_to(db, nyc, temporary = FALSE)
db

##  src:  sqlite 3.8.6 [/data/nyc_parking/NYParkingViolations.sqlite]
##  tbls: nyc, sqlite_stat1

nyc_sql = tbl(db,"nyc")
str(nyc_sql)

##  List of 9
##   $ src      :List of 3
##    ..$ con :Formal class 'SQLiteConnection' [package "RSQLite"] with 5 slots
##    .. .. ..@ Id                 :<externalptr> 
##    .. .. ..@ dbname             : chr "/data/nyc_parking/NYParkingViolations.sqlite"
##    .. .. ..@ loadable.extensions: logi TRUE
##    .. .. ..@ flags              : int 6
##    .. .. ..@ vfs                : chr ""
##    ..$ path: chr "/data/nyc_parking/NYParkingViolations.sqlite"
##    ..$ info:List of 2
##    .. ..$ serverVersion: chr "3.8.6"
##    .. ..$ results      : logi FALSE
##    ..- attr(*, "class")= chr [1:3] "src_sqlite" "src_sql" "src"
##   $ from     :Classes 'ident', 'sql', 'character'  chr "nyc"
## ...

nyc_sql

## # Source:   table<nyc> [?? x 18]
## # Database: sqlite 3.19.3 [/data/nyc_parking/NYParkingViolations.sqlite]
##    Registration_State Plate_Type Issue_Date Violation_Code Vehicle_Body_Type Vehicle_Make Issuing_Agency
##                 <chr>      <chr>      <dbl>          <int>             <chr>        <chr>          <chr>
##  1                 NJ        PAS      15706             20              SUBN        CHRYS              P
##  2                 MI        PAS      15706             99               SDN        TOYOT              P
##  3                 NY        PAS      15706             20              SUBN         FORD              P
##  4                 NY        PAS      15706             20               SDN        INFIN              P
##  5                 NY        SRF      15706             20               SDN        NISSA              P
##  6                 NY        PAS      15706             46              SUBN        CHRYS              P
##  7                 NY        PAS      15706             62               SDN        HYUND              P
##  8                 NY        PAS      15706             98               SDN        TOYOT              P
##  9                 NY        PAS      15706             71               SDN        TOYOT              P
## 10                 NY        PAS      15706             71              SUBN         FORD              P
## # ... with more rows, and 11 more variables: Violation_Location <chr>, Violation_Precinct <int>,
## #   Violation_Time <chr>, House_Number <chr>, Street_Name <chr>, Intersecting_Street <chr>,
## #   Vehicle_Color <chr>, Issue_Day <int>, Issue_Month <dbl>, Issue_Year <dbl>, Issue_WDay <chr>

Using dplyr with sqlite

(addr = nyc_sql %>%
        select(Issue_Date, Issuing_Agency, Violation_Precinct, House_Number, Street_Name) %>%
        filter(Violation_Precinct >=1, Violation_Precinct <= 34)
)

##  # Source:   lazy query [?? x 5]
##  # Database: sqlite 3.19.3 [/data/nyc_parking/NYParkingViolations.sqlite]
##     Issue_Date Issuing_Agency Violation_Precinct House_Number     Street_Name
##          <dbl>          <chr>              <int>        <chr>           <chr>
##   1      15706              P                 28          102        W 123 ST
##   2      15706              P                 23         2121           1 AVE
##   3      15706              P                 23           60        E 106 ST
##   4      15706              P                  5           54    ELIZABETH ST
##   5      15706              P                 26      488-490 ST NICHOLAS AVE
##   6      15706              P                 26         1420   AMSTERDAM AVE
##   7      15706              P                 25          219        E 121 ST
##   8      15706              P                 12          630   LEXINGTON AVE
##   9      15706              P                 18          413        48 TH ST
##  10      15706              P                 25         2123     MADISON AVE
# ... with more rows

SQL Query

class(addr)

## [1] "tbl_dbi"  "tbl_sql"  "tbl_lazy" "tbl" 

show_query(addr)

## <SQL>
## SELECT *
## FROM (SELECT `Issue_Date` AS `Issue_Date`, `Issuing_Agency` AS `Issuing_Agency`, `Violation_Precinct` AS `Violation_Precinct`, `House_Number` AS `House_Number`, `Street_Name` AS `Street_Name`
## FROM `nyc`)
## WHERE ((`Violation_Precinct` >= 1.0) AND (`Violation_Precinct` <= 34.0))

Limitations

addr %>% mutate(address = paste(House_Number, Street_Name))

## Error in rsqlite_send_query(conn@ptr, statement) : 
##   no such function: PASTE

addr %>% summarize(mean = mean(Violation_Precinct, na.rm=TRUE))

## na.rm not needed in SQL: NULL are always droppedFALSE
## # Source:   lazy query [?? x 1]
## # Database: sqlite 3.19.3 [/data/nyc_parking/NYParkingViolations.sqlite]
##       mean
##      <dbl>
## 1 16.09762

addr %>% summarize(mean = mean(Violation_Precinct))

## # Source:   lazy query [?? x 1]
## # Database: sqlite 3.19.3 [/data/nyc_parking/NYParkingViolations.sqlite]
##       mean
##      <dbl>
## 1 16.09762

SQL Grouping

addr %>% group_by(Issuing_Agency, Violation_Precinct) %>% summarize(n=n())

##  # Source:   lazy query [?? x 3]
##  # Database: sqlite 3.19.3 [/data/nyc_parking/NYParkingViolations.sqlite]
##  # Groups:   Issuing_Agency
##     Issuing_Agency Violation_Precinct     n
##              <chr>              <int> <int>
##   1              A                  1    13
##   2              A                  7     1
##   3              A                 10    24
##   4              A                 11     1
##   5              A                 14    47
##   6              A                 33    11
##   7              B                 25     2
##   8              C                  5    73
##   9              C                 13     7
##  10              D                  1     1
##  # ... with more rows

SQL Query

addr %>% group_by(Issuing_Agency, Violation_Precinct) %>% summarize(n=n()) %>% show_query()

## <SQL>
## SELECT `Issuing_Agency`, `Violation_Precinct`, COUNT() AS `n`
## FROM (SELECT `Issue_Date` AS `Issue_Date`, `Issuing_Agency` AS `Issuing_Agency`, `Violation_Precinct` AS `## Violation_Precinct`, `House_Number` AS `House_Number`, `Street_Name` AS `Street_Name`
## FROM `nyc`)
## WHERE ((`Violation_Precinct` >= 1.0) AND (`Violation_Precinct` <= 34.0))
## GROUP BY `Issuing_Agency`, `Violation_Precinct`

SQL Translation

dbplyr has a function, translate_sql, that lets you experiment with how R functions are translated to SQL

dbplyr::translate_sql(x == 1 & (y < 2 | z > 3))

## <SQL> "x" = 1.0 AND ("y" < 2.0 OR "z" > 3.0)

dbplyr::translate_sql(x ^ 2 < 10)

## <SQL> POWER("x", 2.0) < 10.0

dbplyr::translate_sql(x %% 2 == 10)

## <SQL> "x" % 2.0 = 10.0

dbplyr::translate_sql(paste(x,y))

## <SQL> PASTE("x", "y")

dbplyr::translate_sql(mean(x))

## <SQL> avg("x") OVER ()

dbplyr::translate_sql(mean(x, na.rm=TRUE))

## Error in mean(x, na.rm = TRUE): unused argument (na.rm = TRUE)

In general, dplyr / dbplyr knows how to translate basic math, logical, and summary functions from R to SQL.

(Unfair) Timings

system.time(
  nyc %>%
    select(Issue_Date, Issuing_Agency, Violation_Precinct, House_Number, Street_Name) %>%
    filter(Violation_Precinct >=1, Violation_Precinct <= 34) %>%
    group_by(Issuing_Agency, Violation_Precinct) %>%
    summarize(n=n())
)

##   user  system elapsed 
##  0.639   0.099   0.740 

system.time(
  nyc_sql %>%
    select(Issue_Date, Issuing_Agency, Violation_Precinct, House_Number, Street_Name) %>%
    filter(Violation_Precinct >=1, Violation_Precinct <= 34) %>%
    group_by(Issuing_Agency, Violation_Precinct) %>%
    summarize(n=n())
)

##   user  system elapsed 
##  0.045   0.011   0.056

nyc_sql was 30x times faster than nyc, but the former is disk based while the latter is in memory, why this discrepancy?

Laziness

dplyr / dbplyr uses lazy evaluation as much as possible, particularly when working with non-local backends.

When building a query, we don't want the entire table, often we want just enough to check if our query is working.
Since we would prefer to run one complex query over many simple queries, laziness allows for verbs to be strung together.
Therefore, by default dplyr
- won't connect and query the database until absolutely necessary (e.g. show output),
- and unless explicitly told to, will only query a handful of rows to give a sense of what the result will look like

nyc_sql %>%
  select(Issue_Date, Issuing_Agency, Violation_Precinct, House_Number, Street_Name) %>%
  filter(Violation_Precinct >=1, Violation_Precinct <= 34) %>%
  group_by(Issuing_Agency, Violation_Precinct) %>%
  summarize(n=n())

## # Source:   lazy query [?? x 3]
## # Database: sqlite 3.19.3 [/data/nyc_parking/NYParkingViolations.sqlite]
## # Groups:   Issuing_Agency
##    Issuing_Agency Violation_Precinct     n
##             <chr>              <int> <int>
##  1              A                  1    13
##  2              A                  7     1
##  3              A                 10    24
##  4              A                 11     1
##  5              A                 14    47
##  6              A                 33    11
##  7              B                 25     2
##  8              C                  5    73
##  9              C                 13     7
## 10              D                  1     1
## # ... with more rows

Full query

To force a full query and return a complete it is necessart to use the collect function.

nyc_sql %>%
  select(Issue_Date, Issuing_Agency, Violation_Precinct, House_Number, Street_Name) %>%
  filter(Violation_Precinct >=1, Violation_Precinct <= 34) %>%
  group_by(Issuing_Agency, Violation_Precinct) %>%
  summarize(n=n()) %>%
  collect()

## # A tibble: 199 x 3
## # Groups:   Issuing_Agency [15]
##    Issuing_Agency Violation_Precinct     n
##             <chr>              <int> <int>
##  1              A                  1    13
##  2              A                  7     1
##  3              A                 10    24
##  4              A                 11     1
##  5              A                 14    47
##  6              A                 33    11
##  7              B                 25     2
##  8              C                  5    73
##  9              C                 13     7
## 10              D                  1     1
## # ... with 189 more rows

compute and collapse also force a full query but have slightly different behavior and return types.

Creating Indexes

(db_index = src_sqlite("/data/nyc_parking/NYParkingViolations_index.sqlite", create = TRUE))

## src:  sqlite 3.8.6 [/data/nyc_parking/NYParkingViolations_index.sqlite]
## tbls:

nyc_index = copy_to(db_index, nyc, temporary = FALSE, overwrite = TRUE,
                    index = list("Violation_Precinct"))

The indexed database takes up more disk space:

cr173@saxon [~]$ ls -lh /data/nyc_parking/*.sqlite

-rwxr--r--. 1 cr173 visitor 1.8G Nov  6 12:18 /data/nyc_parking/NYParkingViolations_index.sqlite
-rwxr--r--. 1 cr173 visitor 1.7G Nov  6 12:01 /data/nyc_parking/NYParkingViolations.sqlite

Timings for filtering precincts

system.time(nyc_sql %>% filter(Violation_Precinct <= 34, Violation_Precinct >= 1) %>% collect())

##    user  system elapsed  
##  22.204   1.524  23.832  

system.time(nyc_index %>% filter(Violation_Precinct <= 34, Violation_Precinct >= 1) %>% collect())

##    user  system elapsed  
##  21.547   1.647  23.284

Timings for grouping

system.time(nyc_sql %>% group_by(Violation_Precinct) %>% summarize(n=n()) %>% collect())

##   user  system elapsed  
##  5.131   0.652   5.804  

system.time(nyc_index %>% group_by(Violation_Precinct) %>% summarize(n=n()) %>% collect())

##   user  system elapsed  
##  1.227   0.082   1.314

Acknowledgments

Above materials are derived in part from the following sources: