full_join
- Join data, preserving all rows in both a
and b
.inner_join
- Join data, preserving only rows with keys in both a
and b
.left_join
- Join matching rows from b
to a
, preserving all rows of a
right_join
- Join matching rows from a
to b
, preserving all rows of b
.semi_join
- Subset rows in a
that have a match in b
.anti_join
- Subset rows in a
that do not have a match in b
.addr = data.frame(name = c("Alice","Bob",
"Carol","dave",
"Eve"),
email= c("alice@company.com",
"bob@company.com",
"carol@company.com",
"dave@company.com",
"eve@company.com"),
stringsAsFactors = FALSE)
phone = data.frame(name = c("Bob","Carol",
"Eve","Eve",
"Frank"),
phone= c("919 555-1111",
"919 555-2222",
"919 555-3333",
"310 555-3333",
"919 555-4444"),
stringsAsFactors = FALSE)
addr
## name email
## 1 Alice alice@company.com
## 2 Bob bob@company.com
## 3 Carol carol@company.com
## 4 dave dave@company.com
## 5 Eve eve@company.com
phone
## name phone
## 1 Bob 919 555-1111
## 2 Carol 919 555-2222
## 3 Eve 919 555-3333
## 4 Eve 310 555-3333
## 5 Frank 919 555-4444
dplyr:
full_join(addr, phone)
## Joining, by = "name"
## name email phone
## 1 Alice alice@company.com <NA>
## 2 Bob bob@company.com 919 555-1111
## 3 Carol carol@company.com 919 555-2222
## 4 dave dave@company.com <NA>
## 5 Eve eve@company.com 919 555-3333
## 6 Eve eve@company.com 310 555-3333
## 7 Frank <NA> 919 555-4444
Base R:
merge(addr, phone, all=TRUE)
## name email phone
## 1 Alice alice@company.com <NA>
## 2 Bob bob@company.com 919 555-1111
## 3 Carol carol@company.com 919 555-2222
## 4 dave dave@company.com <NA>
## 5 Eve eve@company.com 919 555-3333
## 6 Eve eve@company.com 310 555-3333
## 7 Frank <NA> 919 555-4444
dplyr:
inner_join(addr,phone)
## Joining, by = "name"
## name email phone
## 1 Bob bob@company.com 919 555-1111
## 2 Carol carol@company.com 919 555-2222
## 3 Eve eve@company.com 919 555-3333
## 4 Eve eve@company.com 310 555-3333
Base R:
merge(addr, phone, all=FALSE)
## name email phone
## 1 Bob bob@company.com 919 555-1111
## 2 Carol carol@company.com 919 555-2222
## 3 Eve eve@company.com 919 555-3333
## 4 Eve eve@company.com 310 555-3333
dplyr:
left_join(addr,phone)
## Joining, by = "name"
## name email phone
## 1 Alice alice@company.com <NA>
## 2 Bob bob@company.com 919 555-1111
## 3 Carol carol@company.com 919 555-2222
## 4 dave dave@company.com <NA>
## 5 Eve eve@company.com 919 555-3333
## 6 Eve eve@company.com 310 555-3333
Base R:
merge(addr, phone, all.x=TRUE)
## name email phone
## 1 Alice alice@company.com <NA>
## 2 Bob bob@company.com 919 555-1111
## 3 Carol carol@company.com 919 555-2222
## 4 dave dave@company.com <NA>
## 5 Eve eve@company.com 919 555-3333
## 6 Eve eve@company.com 310 555-3333
dplyr:
right_join(addr, phone)
## Joining, by = "name"
## name email phone
## 1 Bob bob@company.com 919 555-1111
## 2 Carol carol@company.com 919 555-2222
## 3 Eve eve@company.com 919 555-3333
## 4 Eve eve@company.com 310 555-3333
## 5 Frank <NA> 919 555-4444
Base R:
merge(addr, phone, all.y=TRUE)
## name email phone
## 1 Bob bob@company.com 919 555-1111
## 2 Carol carol@company.com 919 555-2222
## 3 Eve eve@company.com 919 555-3333
## 4 Eve eve@company.com 310 555-3333
## 5 Frank <NA> 919 555-4444
semi_join(addr, phone)
## Joining, by = "name"
## name email
## 1 Bob bob@company.com
## 2 Carol carol@company.com
## 3 Eve eve@company.com
anti_join(addr, phone)
## Joining, by = "name"
## name email
## 1 Alice alice@company.com
## 2 dave dave@company.com
addr = data.frame(name = c("Alice","Alice", "Bob","Bob"),
email= c("alice@company.com","alice@gmail.com", "bob@company.com","bob@hotmail.com"),
stringsAsFactors = FALSE)
phone = data.frame(name = c("Alice","Alice", "Bob","Bob"),
phone= c("919 555-1111", "310 555-2222", "919 555-3333", "310 555-3333"),
stringsAsFactors = FALSE)
dplyr:
full_join(addr, phone, by="name")
## name email phone
## 1 Alice alice@company.com 919 555-1111
## 2 Alice alice@company.com 310 555-2222
## 3 Alice alice@gmail.com 919 555-1111
## 4 Alice alice@gmail.com 310 555-2222
## 5 Bob bob@company.com 919 555-3333
## 6 Bob bob@company.com 310 555-3333
## 7 Bob bob@hotmail.com 919 555-3333
## 8 Bob bob@hotmail.com 310 555-3333
Base R:
merge(addr, phone)
## name email phone
## 1 Alice alice@company.com 919 555-1111
## 2 Alice alice@company.com 310 555-2222
## 3 Alice alice@gmail.com 919 555-1111
## 4 Alice alice@gmail.com 310 555-2222
## 5 Bob bob@company.com 919 555-3333
## 6 Bob bob@company.com 310 555-3333
## 7 Bob bob@hotmail.com 919 555-3333
## 8 Bob bob@hotmail.com 310 555-3333
Task | Timing |
---|---|
L1 cache reference | 0.5 ns |
L2 cache reference | 7 ns |
Main memory reference | 100 ns |
Read 1 MB sequentially from memory | 250,000 ns |
Disk seek | 10,000,000 ns |
Read 1 MB sequentially from network | 10,000,000 ns |
Read 1 MB sequentially from disk | 30,000,000 ns |
Lets imagine we have a 10 GB flat data file and that we want to select certain rows based on a given criteria. This requires a sequential read across the entire data set.
If we can store the file in memory:
If we have to access the file from disk:
This is just for reading data, if we make any modifications (writing) things are much worse.
Cost: Disk << Memory
Speed: Disk <<< Memory
So usually possible to grow our disk storage to accommodate our data. However, memory is usually the limiting resource, and if we can’t fit everything into memory?
Create blocks - group rows based on similar attributes and read in multiple rows at a time. Optimal size will depend on the task and the properties of the disk.
Even with blocks, any kind of subsetting of rows requires a linear search, which requires \(\mathcal{O}(N)\) accesses where \(N\) is the number of blocks.
We can do much better if we are careful about how we structure our data, specifically sorting some or all of the columns.
Sorting is expensive, \(\mathcal{O}(N \log N)\), but it only needs to be done once.
After sorting, we can use a binary search for any subsetting tasks (\(\mathcal{O}(\log N)\)).
These sorted columns are known as indexes.
Indexes require additional storage, but usually small enough to be kept in memory while blocks stay on disk.
Age | Name |
---|---|
19 | Carol |
20 | Greg |
21 | Alice |
21 | Dave |
22 | Eve |
23 | Bob |
23 | Frank |
Lets search for records for people who are 22 or older.
This is just barely scratching the surface,
Efficiency gains are not just for disk, access is access
In general, trade off between storage and efficiency
Reality is a lot more complicated for everything mentioned so far, lots of very smart people have spent a lot of time thinking about and implementing tools
Different tasks with different requirements require different implementations and have different criteria for optimization
Structures Query Language is a special purpose language for interacting with (querying and modifying) these indexed tabular data structures.
ANSI Standard but with some dialect divergence
This functionality maps very closely (but not exactly) with the data manipulation verbs present in dplyr.
We will see this mapping in more detail in a bit.
Low level package for interfacing R with Database management systems (DBMS) that provides a common interface to achieve the following functionality:
Provides the implementation necessary to use DBI to interface with an SQLite database.
library(RSQLite)
## Loading required package: DBI
con = dbConnect(RSQLite::SQLite(), ":memory:")
str(con)
## Formal class 'SQLiteConnection' [package "RSQLite"] with 5 slots
## ..@ Id :<externalptr>
## ..@ dbname : chr ":memory:"
## ..@ loadable.extensions: logi TRUE
## ..@ flags : int 6
## ..@ vfs : chr ""
employees = data.frame(name = c("Alice","Bob","Carol","Dave","Eve","Frank"),
email = c("alice@company.com", "bob@company.com",
"carol@company.com", "dave@company.com",
"eve@company.com", "frank@comany.com"),
salary = c(52000, 40000, 30000, 33000, 44000, 37000),
dept = c("Accounting", "Accounting","Sales",
"Accounting","Sales","Sales"),
stringsAsFactors = FALSE)
dbWriteTable(con, "employees", employees)
## [1] TRUE
dbListTables(con)
## [1] "employees"
dbWriteTable(con, "employs", employees)
## [1] TRUE
dbListTables(con)
## [1] "employees" "employs"
dbRemoveTable(con,"employs")
## [1] TRUE
dbListTables(con)
## [1] "employees"
res = dbSendQuery(con, "SELECT * FROM employees")
dbFetch(res)
## name email salary dept
## 1 Alice alice@company.com 52000 Accounting
## 2 Bob bob@company.com 40000 Accounting
## 3 Carol carol@company.com 30000 Sales
## 4 Dave dave@company.com 33000 Accounting
## 5 Eve eve@company.com 44000 Sales
## 6 Frank frank@comany.com 37000 Sales
dbClearResult(res)
## [1] TRUE
dbDisconnect(con)
## [1] TRUE
cr173@saxon [2017-11-01-sqlite]$ sqlite3 employees.sqlite
SQLite version 3.11.0 2016-02-15 17:29:24
Enter ".help" for usage hints.
The following is specific to SQLite
sqlite> .tables
employees
sqlite> .schema employees
CREATE TABLE `employees` (
`name` TEXT,
`email` TEXT,
`salary` REAL,
`dept` TEXT
);
sqlite> .indices employees
sqlite> SELECT * FROM employees;
Alice|alice@company.com|52000.0|Accounting
Bob|bob@company.com|40000.0|Accounting
Carol|carol@company.com|30000.0|Sales
Dave|dave@company.com|33000.0|Accounting
Eve|eve@company.com|44000.0|Sales
Frank|frank@comany.com|37000.0|Sales
We can make this table output a little nicer with some additonal SQLite options:
sqlite> .mode column
sqlite> .headers on
sqlite> SELECT * FROM employees;
name email salary dept
---------- ----------------- ---------- ----------
Alice alice@company.com 52000.0 Accounting
Bob bob@company.com 40000.0 Accounting
Carol carol@company.com 30000.0 Sales
Dave dave@company.com 33000.0 Accounting
Eve eve@company.com 44000.0 Sales
Frank frank@comany.com 37000.0 Sales
We can subset for certain columns (and rename them) using SELECT
sqlite> SELECT name AS first_name, salary FROM employees;
first_name salary
---------- ----------
Alice 52000.0
Bob 40000.0
Carol 30000.0
Dave 33000.0
Eve 44000.0
Frank 37000.0
We can sort our results by adding ORDER BY
to our SELECT
statement
sqlite> SELECT name AS first_name, salary FROM employees ORDER BY salary;
first_name salary
---------- ----------
Carol 30000.0
Dave 33000.0
Frank 37000.0
Bob 40000.0
Eve 44000.0
Alice 52000.0
We can sort in the opposite order by adding DESC
SELECT name AS first_name, salary FROM employees ORDER BY salary DESC;
first_name salary
---------- ----------
Alice 52000.0
Eve 44000.0
Bob 40000.0
Frank 37000.0
Dave 33000.0
Carol 30000.0
We can filter rows by adding WHERE
to our statements
sqlite> SELECT * FROM employees WHERE salary < 40000;
name email salary dept
---------- ----------------- ---------- ----------
Carol carol@company.com 30000.0 Sales
Dave dave@company.com 33000.0 Accounting
Frank frank@comany.com 37000.0 Sales
sqlite> SELECT * FROM employees WHERE salary < 40000 AND dept = "Sales";
name email salary dept
---------- ----------------- ---------- ----------
Carol carol@company.com 30000.0 Sales
Frank frank@comany.com 37000.0 Sales
We can create groups for the purpose of summarizing using GROUP BY
. As with dplyr it is not terribly useful by itself.
sqlite> SELECT * FROM employees GROUP BY dept;
name email salary dept
---------- ---------------- ---------- ----------
Dave dave@company.com 33000.0 Accounting
Frank frank@comany.com 37000.0 Sales
sqlite> SELECT dept, AVG(salary) AS mean_salary, COUNT(*) AS n FROM employees GROUP BY dept;
dept mean_salary n
---------- ---------------- ----------
Accounting 41666.6666666667 3
Sales 37000.0 3
We can limit the number of rows we get by using LIMIT
.
sqlite> SELECT * FROM employees LIMIT 3;
name email salary dept
---------- ----------------- ---------- ----------
Alice alice@company.com 52000.0 Accounting
Bob bob@company.com 40000.0 Accounting
Carol carol@company.com 30000.0 Sales
sqlite> SELECT * FROM employees ORDER BY name DESC LIMIT 3;
name email salary dept
---------- ---------------- ---------- ----------
Frank frank@comany.com 37000.0 Sales
Eve eve@company.com 44000.0 Sales
Dave dave@company.com 33000.0 Accounting
sqlite> .mode csv
sqlite> .import phone.csv phone
sqlite> .tables
employees phone
sqlite> .mode column
sqlite> SELECT * FROM phone;
name phone
---------- ------------
Bob 919 555-1111
Carol 919 555-2222
Eve 919 555-3333
Frank 919 555-4444
By default SQLite uses a CROSS JOIN
which is not terribly useful
sqlite> SELECT * FROM employees JOIN phone;
name email salary dept name phone
---------- ----------------- ---------- ---------- ---------- ------------
Alice alice@company.com 52000.0 Accounting Bob 919 555-1111
Alice alice@company.com 52000.0 Accounting Carol 919 555-2222
Alice alice@company.com 52000.0 Accounting Eve 919 555-3333
Alice alice@company.com 52000.0 Accounting Frank 919 555-4444
Bob bob@company.com 40000.0 Accounting Bob 919 555-1111
Bob bob@company.com 40000.0 Accounting Carol 919 555-2222
Bob bob@company.com 40000.0 Accounting Eve 919 555-3333
Bob bob@company.com 40000.0 Accounting Frank 919 555-4444
Carol carol@company.com 30000.0 Sales Bob 919 555-1111
Carol carol@company.com 30000.0 Sales Carol 919 555-2222
Carol carol@company.com 30000.0 Sales Eve 919 555-3333
Carol carol@company.com 30000.0 Sales Frank 919 555-4444
Dave dave@company.com 33000.0 Accounting Bob 919 555-1111
Dave dave@company.com 33000.0 Accounting Carol 919 555-2222
Dave dave@company.com 33000.0 Accounting Eve 919 555-3333
Dave dave@company.com 33000.0 Accounting Frank 919 555-4444
Eve eve@company.com 44000.0 Sales Bob 919 555-1111
Eve eve@company.com 44000.0 Sales Carol 919 555-2222
Eve eve@company.com 44000.0 Sales Eve 919 555-3333
Eve eve@company.com 44000.0 Sales Frank 919 555-4444
Frank frank@comany.com 37000.0 Sales Bob 919 555-1111
Frank frank@comany.com 37000.0 Sales Carol 919 555-2222
Frank frank@comany.com 37000.0 Sales Eve 919 555-3333
Frank frank@comany.com 37000.0 Sales Frank 919 555-4444
If you want SQLite to find the columns to merge on automatically then we prefix the join with NATURAL
.
sqlite> SELECT * FROM employees NATURAL JOIN phone;
name email salary dept phone
---------- --------------- ---------- ---------- ------------
Bob bob@company.com 40000.0 Accounting 919 555-1111
Carol carol@company.c 30000.0 Sales 919 555-2222
Eve eve@company.com 44000.0 Sales 919 555-3333
Frank frank@comany.co 37000.0 Sales 919 555-4444
sqlite> SELECT * FROM employees JOIN phone ON employees.name = phone.name;
name email salary dept name phone
---------- --------------- ---------- ---------- ---------- ------------
Bob bob@company.com 40000.0 Accounting Bob 919 555-1111
Carol carol@company.c 30000.0 Sales Carol 919 555-2222
Eve eve@company.com 44000.0 Sales Eve 919 555-3333
Frank frank@comany.co 37000.0 Sales Frank 919 555-4444
sqlite> SELECT * FROM employees NATURAL LEFT JOIN phone;
name email salary dept phone
---------- ----------------- ---------- ---------- ----------
Alice alice@company.com 52000.0 Accounting
Bob bob@company.com 40000.0 Accounting 919 555-11
Carol carol@company.com 30000.0 Sales 919 555-22
Dave dave@company.com 33000.0 Accounting
Eve eve@company.com 44000.0 Sales 919 555-33
Frank frank@comany.com 37000.0 Sales 919 555-44
sqlite> SELECT * FROM employees LEFT JOIN phone ON employees.name = phone.name;
name email salary dept name phone
---------- ----------------- ---------- ---------- ---------- ----------
Alice alice@company.com 52000.0 Accounting
Bob bob@company.com 40000.0 Accounting Bob 919 555-11
Carol carol@company.com 30000.0 Sales Carol 919 555-22
Dave dave@company.com 33000.0 Accounting
Eve eve@company.com 44000.0 Sales Eve 919 555-33
Frank frank@comany.com 37000.0 Sales Frank 919 555-44
Note that SQLite does not support directly support an OUTER JOIN
or a RIGHT JOIN
.
sqlite> CREATE INDEX index_name ON employees (name);
sqlite> .indices
index_name
sqlite> CREATE INDEX index_name_email ON employees (name,email);
sqlite> .indices
index_name
index_name_email
We can nest tables within tables for the purpose of queries.
SELECT * FROM (SELECT * FROM employees NATURAL LEFT JOIN phone) WHERE phone IS NULL;
name email salary dept phone
---------- ----------------- ---------- ---------- ----------
Alice alice@company.com 52000.0 Accounting
Dave dave@company.com 33000.0 Accounting
sqlite> SELECT * FROM (SELECT * FROM employees NATURAL LEFT JOIN phone) WHERE phone IS NOT NULL;
name email salary dept phone
---------- --------------- ---------- ---------- ------------
Bob bob@company.com 40000.0 Accounting 919 555-1111
Carol carol@company.c 30000.0 Sales 919 555-2222
Eve eve@company.com 44000.0 Sales 919 555-3333
Frank frank@comany.co 37000.0 Sales 919 555-4444
Lets try to create a table that has a new column - abv_avg
which contains how much more (or less) than the average, for their department, each person is paid.
Hint - This will require joining a subquery.
employees.sqlite
is available in /data/Sta523
on saxon.