Task | Timing |
---|---|
L1 cache reference | 0.5 ns |
L2 cache reference | 7 ns |
Main memory reference | 100 ns |
Read 1 MB sequentially from memory | 250,000 ns |
Disk seek | 10,000,000 ns |
Read 1 MB sequentially from network | 10,000,000 ns |
Read 1 MB sequentially from disk | 30,000,000 ns |
Lets imagine we have a 10 GB flat data file and that we want to select certain rows based on a given criteria. This requires a sequential read across the entire data set.
If we can store the file in memory:
If we have to access the file from disk:
This is just for reading data, if we make any modifications (writing) things are much worse.
What about a 100 GB flat data file?
If we can store the file in memory:
If we have to access the file from disk:
This is actually incredibly optimistic since it assumes that all the data on disk can be read in one continuous read.
Cost: Disk << Memory
Speed: Disk <<< MemorySo usually possible to grow our disk storage to accommodate our data. However, memory is usually the limiting resource, and if we can’t fit everything into memory?
Create blocks - group rows based on similar attributes and read in multiple rows at a time. Optimal size will depend on the task and the properties of the disk.
Even with blocks, any kind of subsetting of rows requires a linear search, which requires O(N) accesses where N is the number of blocks.
We can do much better if we properly structure our data, specifically sorting some or all of the columns.
Sorting is expensive, O(NlogN), but it only needs to be done once.
After sorting, we can use a binary search for any subsetting tasks which is much faster, O(logN).
These sorted columns are known as indexes.
Indexes require additional storage, but usually small enough to be kept in memory while blocks stay on disk.
Age | Name |
---|---|
19 | Carol |
20 | Greg |
21 | Alice |
21 | Dave |
22 | Eve |
23 | Bob |
23 | Frank |
Lets search for records for people who are 22 or older.
This is just barely scratching the surface,
Efficiency gains are not just for disk, access is access
In general, trade off between storage and efficiency
Reality is a lot more complicated for everything mentioned so far, lots of very smart people have spent a lot of time thinking about and implementing tools
Different tasks with different requirements require different implementations and have different criteria for optimization
Structures Query Language is a special purpose language for interacting with (querying and modifying) these indexed tabular data structures.
ANSI Standard but with some dialect divergence
This functionality maps very closely (but not exactly) with the data manipulation verbs present in dplyr.
We will see this mapping in more detail in a bit.
library(dplyr)
##
## Attaching package: 'dplyr'
##
## The following object is masked from 'package:stats':
##
## filter
##
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(lubridate)
## Loading required package: methods
library(stringr)
park = read.csv("/home/vis/cr173/Sta523/data/parking/NYParkingViolations_small.csv",
stringsAsFactors=FALSE) %>%
as.data.frame() %>%
tbl_df()
dir.create("~/db",showWarnings = FALSE)
db = src_sqlite("~/db/park_db.sqlite3", create = TRUE)
## Loading required package: RSQLite
## Loading required package: DBI
## Loading required package: RSQLite.extfuns
dir("~/db/")
## [1] "park_db.sqlite3" "park_full_db.sqlite3" "park_full_indexed_db.sqlite3"
park_sqlite = copy_to(db, park, temporary = FALSE)
park_sqlite
## Source: sqlite 3.7.17 [~/db/park_db.sqlite3]
## From: park [91,003 x 43]
##
## Summons.Number Plate.ID Registration.State Plate.Type Issue.Date Violation.Code Vehicle.Body.Type
## 1 1.359e+09 FXX9781 NY PAS 02/20/2014 20 SUBN
## 2 7.486e+09 FLZ6021 NY PAS 08/12/2013 37 4DSD
## 3 1.354e+09 53902MB NY COM 10/24/2013 14 VAN
## 4 1.342e+09 FYM2426 NY PAS 09/16/2013 21 SDN
## 5 1.372e+09 GPV3714 NY PAS 06/10/2014 71 SUBN
## 6 1.362e+09 XBAV11 NJ PAS 12/27/2013 78 VAN
## 7 7.004e+09 59305JY NY COM 09/18/2013 38 DELV
## 8 1.362e+09 2146518 IN PAS 11/22/2013 41 VAN
## 9 7.541e+09 49138M2 MD PAS 12/18/2013 38 SUBN
## 10 7.311e+09 GHD5283 NY PAS 09/24/2013 46 4DSD
## .. ... ... ... ... ... ... ...
## Variables not shown: Vehicle.Make (chr), Issuing.Agency (chr), Street.Code1 (int), Street.Code2 (int),
## Street.Code3 (int), Vehicle.Expiration.Date (int), Violation.Location (int), Violation.Precinct (int),
## Issuer.Precinct (int), Issuer.Code (int), Issuer.Command (chr), Issuer.Squad (chr), Violation.Time (chr),
## Time.First.Observed (chr), Violation.County (chr), Violation.In.Front.Of.Or.Opposite (chr), House.Number
## (chr), Street.Name (chr), Intersecting.Street (chr), Date.First.Observed (int), Law.Section (int),
## Sub.Division (chr), Violation.Legal.Code (chr), Days.Parking.In.Effect.... (chr), From.Hours.In.Effect
## (chr), To.Hours.In.Effect (chr), Vehicle.Color (chr), Unregistered.Vehicle. (int), Vehicle.Year (int),
## Meter.Number (chr), Feet.From.Curb (int), Violation.Post.Code (chr), Violation.Description (chr),
## No.Standing.or.Stopping.Violation (int), Hydrant.Violation (int), Double.Parking.Violation (int)
(addr = select(park_sqlite, Violation.Precinct, House.Number, Street.Name) %>%
filter(Violation.Precinct <= 34) %>%
mutate(House.Number = str_trim(House.Number), Street.Name = str_trim(Street.Name)) %>%
filter(House.Number != "" & Street.Name != "") %>%
filter(str_detect(House.Number,"[0-9]+")) %>%
transmute(Violation.Precinct = Violation.Precinct, addr = paste(House.Number, Street.Name)) %>%
mutate(addr = tolower(addr)))
## Source: sqlite 3.7.17 [~/db/park_db.sqlite3]
## Error: RS-DBI driver: (error in statement: no such function: PASTE)
(addr = select(park_sqlite, Violation.Precinct, House.Number, Street.Name) %>%
filter(Violation.Precinct <= 34) %>%
mutate(House.Number = trim(House.Number), Street.Name = trim(Street.Name)) %>%
filter(House.Number != "" & Street.Name != "")
)
## Source: sqlite 3.7.17 [~/db/park_db.sqlite3]
## From: park [35,283 x 5]
## Filter: Violation.Precinct <= 34, House.Number != "" & Street.Name != ""
##
## Violation.Precinct House.Number Street.Name House.Number.1 Street.Name.1
## 1 0 840 8 AVE 840 8 AVE
## 2 17 643 Lexington Ave 643 Lexington Ave
## 3 17 305 E 47th St 305 E 47th St
## 4 14 N W 43rd St N W 43rd St
## 5 24 39 W 105th St 39 W 105th St
## 6 18 220 W 49th St 220 W 49th St
## 7 18 6 W 48th St 6 W 48th St
## 8 14 205 W 39th St 205 W 39th St
## 9 10 255 W 14th St 255 W 14th St
## 10 24 S W 97th St S W 97th St
## .. ... ... ... ... ...
(addr = select(park_sqlite, Violation.Precinct, House.Number, Street.Name) %>%
filter(Violation.Precinct <= 34) %>%
filter(trim(House.Number) != "" & trim(House.Number) != "")
)
## Source: sqlite 3.7.17 [~/db/park_db.sqlite3]
## From: park [35,288 x 3]
## Filter: Violation.Precinct <= 34, trim(House.Number) != "" & trim(House.Number) != ""
##
## Violation.Precinct House.Number Street.Name
## 1 0 840 8 AVE
## 2 17 643 Lexington Ave
## 3 17 305 E 47th St
## 4 14 N W 43rd St
## 5 24 39 W 105th St
## 6 18 220 W 49th St
## 7 18 6 W 48th St
## 8 14 205 W 39th St
## 9 10 255 W 14th St
## 10 24 S W 97th St
## .. ... ... ...
addr$query
## <Query> SELECT "Violation.Precinct" AS "Violation.Precinct", "House.Number" AS "House.Number", "Street.Name" AS "Street.Name"
## FROM "park"
## WHERE "Violation.Precinct" <= 34.0 AND TRIM("House.Number") != '' AND TRIM("House.Number") != ''
## <SQLiteConnection: DBI CON (28575, 0)>
dplyr has a function, translate_sql
, that lets you experiment with how R functions are translated to SQL
translate_sql(x == 1 & (y < 2 | z > 3))
## <SQL> "x" = 1.0 AND ("y" < 2.0 OR "z" > 3.0)
translate_sql(x ^ 2 < 10)
## <SQL> POWER("x", 2.0) < 10.0
translate_sql(x %% 2 == 10)
## <SQL> "x" % 2.0 = 10.0
translate_sql(paste(x,y))
## <SQL> PASTE("x", "y")
translate_sql(mean(x))
## <SQL> AVG("x")
translate_sql(mean(x, na.rm=TRUE))
## Error: na.rm not needed in SQL: NULL are always dropped
In general, dplyr knows how to translate basic math, logical, and summary functions from R to SQL.
system.time(
select(park, Violation.Precinct, House.Number, Street.Name) %>%
filter(Violation.Precinct <= 34) %>%
filter(str_trim(House.Number) != "" & str_trim(House.Number) != "")
)
## user system elapsed
## 0.061 0.001 0.062
system.time(
select(park_sqlite, Violation.Precinct, House.Number, Street.Name) %>%
filter(Violation.Precinct <= 34) %>%
filter(trim(House.Number) != "" & trim(House.Number) != "")
)
## user system elapsed
## 0.018 0.000 0.019
park_sqlite
was 3.4x times faster than park
, but the latter is disk based while the former is wholly in memory, why this discrepancy?
dplyr
uses lazy evaluation as much as possible, particularly when working with SQL backends.
When building a query, we don’t want the entire table, often we just enough to check if our query is working.
Since we would prefer to run one complex query over many simple queries, laziness allows for verbs to be strung together.
Therefore, by default dplyr
won’t connect and query the database until absolutely necessary (e.g. show output),
unless explicitly told to, only query a handful of rows
To force a full query, a return a complete tbl_df
object dplyr
uses the collect
function.
collect(addr)
## Source: local data frame [35,288 x 3]
##
## Violation.Precinct House.Number Street.Name
## 1 0 840 8 AVE
## 2 17 643 Lexington Ave
## 3 17 305 E 47th St
## 4 14 N W 43rd St
## 5 24 39 W 105th St
## 6 18 220 W 49th St
## 7 18 6 W 48th St
## 8 14 205 W 39th St
## 9 10 255 W 14th St
## 10 24 S W 97th St
## .. ... ... ...
compute
and collapse
also force a full query but have slightly different behavior and return types.
First some data loading and cleaning
library(data.table)
library(bit64)
park_full = fread("/home/vis/cr173/Sta523/data/parking/NYParkingViolations.csv") %>%
data.frame() %>%
tbl_df() %>%
select(Plate.ID:Violation.Code, Violation.Precinct, House.Number:Intersecting.Street) %>%
mutate(Issue.Date = mdy(Issue.Date)) %>%
mutate(Year = year(Issue.Date), Month = month(Issue.Date), Day = day(Issue.Date)) %>%
select(-Issue.Date)
full = src_sqlite("~/db/park_full_db.sqlite3", create = TRUE) %>%
copy_to(park_full, temporary = FALSE)
full_index = src_sqlite("~/db/park_full_indexed_db.sqlite3", create = TRUE) %>%
copy_to(park_full, temporary = FALSE,
index = list(c("Year", "Month", "Day"), "Violation.Precinct", "Violation.Code"))
The indexed database takes up more disk space:
cr173@saxon [Sta523_Fa14]$ ls -lh ~/db
total 740M
-rw-r--r--+ 1 cr173 visitor 0 Oct 19 22:24 park_db.sqlite3
-rw-r--r--+ 1 cr173 visitor 483M Oct 20 11:59 park_full_db.sqlite3
-rw-r--r--+ 1 cr173 visitor 792M Oct 20 12:04 park_full_indexed_db.sqlite3
system.time(full %>% filter( !(Year %in% 2013:2014) ) %>% collect)
## user system elapsed
## 3.272 0.466 3.747
system.time(full_index %>% filter( !(Year %in% 2013:2014) ) %>% collect)
## user system elapsed
## 3.261 0.499 3.765
system.time(full %>% filter( Year %in% 2013:2014 ) %>%
group_by(Year, Month, Violation.Precinct) %>% summarize(count = n()) %>%
collect)
## user system elapsed
## 41.238 2.097 43.347
system.time(full_index %>% filter( Year %in% 2013:2014 ) %>%
group_by(Year, Month, Violation.Precinct) %>% summarize(count = n()) %>%
collect)
## user system elapsed
## 40.488 2.175 42.659
Above materials are derived in part from the following sources: