---
title: databases & sql
author: "Colin Rundel"
date: "2018-11-12"
output:
xaringan::moon_reader:
css: "slides.css"
lib_dir: libs
nature:
highlightStyle: github
highlightLines: true
countIncrementalSlides: false
---
exclude: true
```{r setup, echo=FALSE, message=FALSE, warning=FALSE, include=FALSE}
options(
htmltools.dir.version = FALSE, # for blogdown
width = 80,
tibble.width = 80
)
knitr::opts_chunk$set(
fig.align = "center"
)
htmltools::tagList(rmarkdown::html_dependency_font_awesome())
library(dplyr)
```
---
class: middle
count: false
# The why of databases
---
## Numbers everyone should know
| Task | Timing |
|-------------------------------------|-------------------|
| L1 cache reference | 0.5 ns |
| L2 cache reference | 7 ns |
| Main memory reference | 100 ns |
| Read 1 MB sequentially from memory | 250,000 ns |
| Disk seek | 10,000,000 ns |
| Read 1 MB sequentially from network | 10,000,000 ns |
| Read 1 MB sequentially from disk | 30,000,000 ns |
.footnote[
From: http://surana.wordpress.com/2009/01/01/numbers-everyone-should-know/
]
---
## Implications for bigish data
Lets imagine we have a *10 GB* flat data file and that we want to select certain rows based on a given criteria. This requires a sequential read across the entire data set.
If we can store the file in memory:
* $10~GB \times (250~\mu s / 1~MB) = 0.25$ seconds
If we have to access the file from disk:
* $10~GB \times (30~ms / 1~MB) = 30$ seconds
This is just for *reading* data, if we make any modifications (*writing*) things are much worse.
---
## Blocks
.center[
*Cost*: Disk << Memory
]
.center[
*Speed*: Disk <<< Memory
]
So usually possible to grow our disk storage to accommodate our data. However, memory is usually the limiting resource, and if we can't fit everything into memory?
Create *blocks* - group related data (i.e. rows) and read in multiple rows at a time. Optimal size will depend on the task and the properties of the disk.
---
## Linear vs Binary Search
Even with blocks, any kind of subsetting of rows requires a linear search, which requires $\mathcal{O}(N)$ accesses where $N$ is the number of blocks.
We can do much better if we are careful about how we structure our data, specifically sorting some or all of the columns.
* Sorting is expensive, $\mathcal{O}(N \log N)$, but it only needs to be done once.
* After sorting, we can use a binary search for any subsetting tasks ( $\mathcal{O}(\log N)$ ).
* These sorted columns are known as *indexes*.
* Indexes require additional storage, but usually small enough to be kept in memory while blocks stay on disk.
---
## Binary Search Example
| Age | Name |
|------|--------------|
| 19 | Carol |
| 20 | Greg |
| 21 | Alice |
| 21 | Dave |
| 22 | Eve |
| 23 | Bob |
| 23 | Frank |
Lets search for records for people who are 22 or older.
---
## and then?
This is just barely scratching the surface,
* Efficiency gains are not just for disk, access is access
* In general, trade off between storage and efficiency
* Reality is a lot more complicated for everything mentioned so far, lots of very smart people have spent a lot of time thinking about and implementing tools
* Different tasks with different requirements require different implementations and have different criteria for optimization
---
class: middle
count: false
# Databases
---
## SQL
Structures Query Language is a special purpose language for interacting with (querying and modifying) these indexed tabular data structures.
* ANSI Standard but with some dialect divergence
* This functionality maps very closely (but not exactly) with the data manipulation verbs present in dplyr.
* We will see this mapping in more detail in a bit.
---
## DBI
Low level package for interfacing R with Database management systems (DBMS) that provides a common interface to achieve the following functionality:
* connect/disconnect from DB
* create and execute statements in the DB
* extract results/output from statements
* error/exception handling
* information (meta-data) from database objects
* transaction management (optional)
---
## RSQLite
Provides the implementation necessary to use DBI to interface with an SQLite database.
```r
library(RSQLite)
---
## Loading required package: DBI
con = dbConnect(RSQLite::SQLite(), ":memory:")
str(con)
## Formal class 'SQLiteConnection' [package "RSQLite"] with 5 slots
## ..@ Id :
## ..@ dbname : chr ":memory:"
## ..@ loadable.extensions: logi TRUE
## ..@ flags : int 6
## ..@ vfs : chr ""
```
---
## Example Table
```r
employees = data.frame(name = c("Alice","Bob","Carol","Dave","Eve","Frank"),
email = c("alice@company.com", "bob@company.com",
"carol@company.com", "dave@company.com",
"eve@company.com", "frank@comany.com"),
salary = c(52000, 40000, 30000, 33000, 44000, 37000),
dept = c("Accounting", "Accounting","Sales",
"Accounting","Sales","Sales"),
stringsAsFactors = FALSE)
```
```r
dbWriteTable(con, "employees", employees)
## [1] TRUE
dbListTables(con)
## [1] "employees"
```
---
## Removing Tables
```r
dbWriteTable(con, "employs", employees)
## [1] TRUE
dbListTables(con)
## [1] "employees" "employs"
dbRemoveTable(con,"employs")
## [1] TRUE
dbListTables(con)
## [1] "employees"
```
---
## Querying Tables
```r
res = dbSendQuery(con, "SELECT * FROM employees")
dbFetch(res)
## name email salary dept
## 1 Alice alice@company.com 52000 Accounting
## 2 Bob bob@company.com 40000 Accounting
## 3 Carol carol@company.com 30000 Sales
## 4 Dave dave@company.com 33000 Accounting
## 5 Eve eve@company.com 44000 Sales
## 6 Frank frank@comany.com 37000 Sales
dbClearResult(res)
## [1] TRUE
```
---
## Closing the connection
```r
dbDisconnect(con)
## [1] TRUE
```
---
class: middle
count: false
# SQL Queries
---
## Connecting
```shell
cr173@saxon [2018-11-12-sqlite]$ sqlite3 employees.sqlite
SQLite version 3.20.1 2017-08-24 16:21:36
Enter ".help" for usage hints.
```
---
## Table information
The following is specific to SQLite
```sqlite
sqlite> .tables
employees
```
```sqlite
sqlite> .schema employees
CREATE TABLE `employees` (
`name` TEXT,
`email` TEXT,
`salary` REAL,
`dept` TEXT
);
```
```sqlite
sqlite> .indices employees
```
---
## SELECT Statements
```sqlite
sqlite> SELECT * FROM employees;
Alice|alice@company.com|52000.0|Accounting
Bob|bob@company.com|40000.0|Accounting
Carol|carol@company.com|30000.0|Sales
Dave|dave@company.com|33000.0|Accounting
Eve|eve@company.com|44000.0|Sales
Frank|frank@comany.com|37000.0|Sales
```
---
## Pretty Output
We can make this table output a little nicer with some additonal SQLite options:
```sqlite
sqlite> .mode column
sqlite> .headers on
```
```sqlite
sqlite> SELECT * FROM employees;
name email salary dept
---------- ----------------- ---------- ----------
Alice alice@company.com 52000.0 Accounting
Bob bob@company.com 40000.0 Accounting
Carol carol@company.com 30000.0 Sales
Dave dave@company.com 33000.0 Accounting
Eve eve@company.com 44000.0 Sales
Frank frank@comany.com 37000.0 Sales
```
---
## select using SELECT
We can subset for certain columns (and rename them) using `SELECT`
```sqlite
sqlite> SELECT name AS first_name, salary FROM employees;
first_name salary
---------- ----------
Alice 52000.0
Bob 40000.0
Carol 30000.0
Dave 33000.0
Eve 44000.0
Frank 37000.0
```
---
## arrange using ORDER BY
We can sort our results by adding `ORDER BY` to our `SELECT` statement
```sqlite
sqlite> SELECT name AS first_name, salary FROM employees ORDER BY salary;
first_name salary
---------- ----------
Carol 30000.0
Dave 33000.0
Frank 37000.0
Bob 40000.0
Eve 44000.0
Alice 52000.0
```
---
We can sort in the opposite order by adding `DESC`
```sqlite
SELECT name AS first_name, salary FROM employees ORDER BY salary DESC;
first_name salary
---------- ----------
Alice 52000.0
Eve 44000.0
Bob 40000.0
Frank 37000.0
Dave 33000.0
Carol 30000.0
```
---
## filter via WHERE
We can filter rows by adding `WHERE` to our statements
```sqlite
sqlite> SELECT * FROM employees WHERE salary < 40000;
name email salary dept
---------- ----------------- ---------- ----------
Carol carol@company.com 30000.0 Sales
Dave dave@company.com 33000.0 Accounting
Frank frank@comany.com 37000.0 Sales
sqlite> SELECT * FROM employees WHERE salary < 40000 AND dept = "Sales";
name email salary dept
---------- ----------------- ---------- ----------
Carol carol@company.com 30000.0 Sales
Frank frank@comany.com 37000.0 Sales
```
---
## group_by via GROUP BY
We can create groups for the purpose of summarizing using `GROUP BY`. As with dplyr it is not terribly useful by itself.
```sqlite
sqlite> SELECT * FROM employees GROUP BY dept;
name email salary dept
---------- ---------------- ---------- ----------
Dave dave@company.com 33000.0 Accounting
Frank frank@comany.com 37000.0 Sales
sqlite> SELECT dept, AVG(salary) AS mean_salary, COUNT(*) AS n FROM employees GROUP BY dept;
dept mean_salary n
---------- ---------------- ----------
Accounting 41666.6666666667 3
Sales 37000.0 3
```
---
## head via LIMIT
We can limit the number of rows we get by using `LIMIT`.
```sqlite
sqlite> SELECT * FROM employees LIMIT 3;
name email salary dept
---------- ----------------- ---------- ----------
Alice alice@company.com 52000.0 Accounting
Bob bob@company.com 40000.0 Accounting
Carol carol@company.com 30000.0 Sales
sqlite> SELECT * FROM employees ORDER BY name DESC LIMIT 3;
name email salary dept
---------- ---------------- ---------- ----------
Frank frank@comany.com 37000.0 Sales
Eve eve@company.com 44000.0 Sales
Dave dave@company.com 33000.0 Accounting
```
---
## Import CSV files
```sqlite
sqlite> .mode csv
sqlite> .import phone.csv phone
sqlite> .tables
employees phone
sqlite> .mode column
sqlite> SELECT * FROM phone;
name phone
---------- ------------
Bob 919 555-1111
Carol 919 555-2222
Eve 919 555-3333
Frank 919 555-4444
```
---
## SQL Joins
.center[
```{r echo=FALSE}
knitr::include_graphics("imgs/sql_joins.png")
```
]
---
## Joins - Default
By default SQLite uses a `CROSS JOIN` which is not terribly useful
```sqlite
sqlite> SELECT * FROM employees JOIN phone;
name email salary dept name phone
---------- ----------------- ---------- ---------- ---------- ------------
Alice alice@company.com 52000.0 Accounting Bob 919 555-1111
Alice alice@company.com 52000.0 Accounting Carol 919 555-2222
Alice alice@company.com 52000.0 Accounting Eve 919 555-3333
Alice alice@company.com 52000.0 Accounting Frank 919 555-4444
Bob bob@company.com 40000.0 Accounting Bob 919 555-1111
Bob bob@company.com 40000.0 Accounting Carol 919 555-2222
Bob bob@company.com 40000.0 Accounting Eve 919 555-3333
Bob bob@company.com 40000.0 Accounting Frank 919 555-4444
Carol carol@company.com 30000.0 Sales Bob 919 555-1111
Carol carol@company.com 30000.0 Sales Carol 919 555-2222
Carol carol@company.com 30000.0 Sales Eve 919 555-3333
Carol carol@company.com 30000.0 Sales Frank 919 555-4444
Dave dave@company.com 33000.0 Accounting Bob 919 555-1111
Dave dave@company.com 33000.0 Accounting Carol 919 555-2222
Dave dave@company.com 33000.0 Accounting Eve 919 555-3333
Dave dave@company.com 33000.0 Accounting Frank 919 555-4444
Eve eve@company.com 44000.0 Sales Bob 919 555-1111
Eve eve@company.com 44000.0 Sales Carol 919 555-2222
Eve eve@company.com 44000.0 Sales Eve 919 555-3333
Eve eve@company.com 44000.0 Sales Frank 919 555-4444
Frank frank@comany.com 37000.0 Sales Bob 919 555-1111
Frank frank@comany.com 37000.0 Sales Carol 919 555-2222
Frank frank@comany.com 37000.0 Sales Eve 919 555-3333
Frank frank@comany.com 37000.0 Sales Frank 919 555-4444
```
---
## Inner Join
If you want SQLite to find the columns to merge on automatically then we prefix the join with `NATURAL`.
```sqlite
sqlite> SELECT * FROM employees NATURAL JOIN phone;
name email salary dept phone
---------- --------------- ---------- ---------- ------------
Bob bob@company.com 40000.0 Accounting 919 555-1111
Carol carol@company.c 30000.0 Sales 919 555-2222
Eve eve@company.com 44000.0 Sales 919 555-3333
Frank frank@comany.co 37000.0 Sales 919 555-4444
```
---
## Inner Join - Explicit
```sqlite
sqlite> SELECT * FROM employees JOIN phone ON employees.name = phone.name;
name email salary dept name phone
---------- --------------- ---------- ---------- ---------- ------------
Bob bob@company.com 40000.0 Accounting Bob 919 555-1111
Carol carol@company.c 30000.0 Sales Carol 919 555-2222
Eve eve@company.com 44000.0 Sales Eve 919 555-3333
Frank frank@comany.co 37000.0 Sales Frank 919 555-4444
```
---
## Left Join - Natural
```sqlite
sqlite> SELECT * FROM employees NATURAL LEFT JOIN phone;
name email salary dept phone
---------- ----------------- ---------- ---------- ----------
Alice alice@company.com 52000.0 Accounting
Bob bob@company.com 40000.0 Accounting 919 555-11
Carol carol@company.com 30000.0 Sales 919 555-22
Dave dave@company.com 33000.0 Accounting
Eve eve@company.com 44000.0 Sales 919 555-33
Frank frank@comany.com 37000.0 Sales 919 555-44
```
---
## Left Join - Explicit
```sqlite
sqlite> SELECT * FROM employees LEFT JOIN phone
ON employees.name = phone.name;
name email salary dept name phone
---------- ----------------- ---------- ---------- ---------- ----------
Alice alice@company.com 52000.0 Accounting
Bob bob@company.com 40000.0 Accounting Bob 919 555-11
Carol carol@company.com 30000.0 Sales Carol 919 555-22
Dave dave@company.com 33000.0 Accounting
Eve eve@company.com 44000.0 Sales Eve 919 555-33
Frank frank@comany.com 37000.0 Sales Frank 919 555-44
```
---
## Other Joins
Note that SQLite does not support directly support an `OUTER JOIN` or a `RIGHT JOIN`.
---
## Creating an index
```sqlite
sqlite> CREATE INDEX index_name ON employees (name);
sqlite> .indices
index_name
sqlite> CREATE INDEX index_name_email ON employees (name,email);
sqlite> .indices
index_name
index_name_email
```
---
## Subqueries
We can nest tables within tables for the purpose of queries.
```sqlite
SELECT * FROM (SELECT * FROM employees NATURAL LEFT JOIN phone) WHERE phone IS NULL;
name email salary dept phone
---------- ----------------- ---------- ---------- ----------
Alice alice@company.com 52000.0 Accounting
Dave dave@company.com 33000.0 Accounting
```
```sqlite
sqlite> SELECT * FROM (SELECT * FROM employees NATURAL LEFT JOIN phone) WHERE phone IS NOT NULL;
name email salary dept phone
---------- --------------- ---------- ---------- ------------
Bob bob@company.com 40000.0 Accounting 919 555-1111
Carol carol@company.c 30000.0 Sales 919 555-2222
Eve eve@company.com 44000.0 Sales 919 555-3333
Frank frank@comany.co 37000.0 Sales 919 555-4444
```
---
## Excercise - Stupid SQL Tricks
Lets try to create a table that has a new column - `abv_avg` which contains how much more (or less) than the average, for their department, each person is paid.
Hint - This will require joining a subquery.
`employees.sqlite` is available in the exercises repo.