---
title: "Spark & sparklyr part I"
subtitle: "Statistical Computing & Programming"
author: "Shawn Santo"
institute: ""
date: "06-19-20"
output:
xaringan::moon_reader:
css: "slides.css"
lib_dir: libs
nature:
highlightStyle: github
highlightLines: true
countIncrementalSlides: false
editor_options:
chunk_output_type: console
---
```{r include=FALSE}
knitr::opts_chunk$set(echo = TRUE, message = FALSE, warning = FALSE,
comment = "#>", highlight = TRUE,
fig.align = "center")
library(sparklyr)
library(tidyverse)
```
## Supplementary materials
Companion videos
- [Introduction to Spark](https://warpwire.duke.edu/w/xdkDAA/)
- [Getting started with `sparklyr`](https://warpwire.duke.edu/w/ydkDAA/)
- [Fixing variable names](https://warpwire.duke.edu/w/y9kDAA/)
- [Summaries and analysis](https://warpwire.duke.edu/w/zdkDAA/)
Additional resources
- [`sparklyr`: R interface for Apache Spark](https://spark.rstudio.com/)
- [R Front End for Apache Spark](http://spark.apache.org/docs/latest/api/R/index.html)
---
class: inverse, center, middle
# Spark
---
## What is Apache Spark?
- As described by Databricks, "Spark is a unified computing engine and a set
of libraries for parallel data processing on computing clusters".
- Spark's goal is to support data analytics tasks within a single ecosystem:
data loading, SQL queries, machine learning, and streaming computations.
- Spark is written in Scala and runs on Java. However, Spark can be used
from R, Python, SQL, Scala, or Java.
---
## Spark ecosystem
![](images/spark-ecosystem.png)
---
## Key features
- In-memory computation
- Fast and scalable
- efficiently scale up from one to many thousands of compute nodes
- Access data on a multitude of platforms
- SQL and NoSQL databses
- Cloud storage
- Hadoop Distributed File System
- Real-time stream processing
- Libraries
- Spark SQL
- MLlib
- Spark streaming
- GraphX
---
class: inverse, center, middle
# Installing and connecting to Spark
---
## Install
We'll be able to install Spark and set-up a connection through the helper
functions in package `sparklyr`. More on this in a moment.
```{r}
library(sparklyr)
```
```{r}
sparklyr::spark_available_versions()
```
--
Let's install version 2.4 of Spark for use with a local Spark connection
via `spark_install(version = "2.4")`
---
## Configure and connect
```{r eval=FALSE}
# add some custom configurations
conf <- list(
sparklyr.cores.local = 4,
`sparklyr.shell.driver-memory` = "16G",
spark.memory.fraction = 0.5
)
```
`sparklyr.cores.local` - defaults to using all of the available cores
`sparklyr.shell.driver-memory` - limit is the amount of RAM available in the
computer minus what would be needed for OS operations
`spark.memory.fraction` - default is set to 60% of the requested memory
per executor
```{r echo=FALSE}
Sys.setenv(JAVA_HOME="/Library/Java/JavaVirtualMachines/jdk1.8.0_202.jdk/Contents/Home")
```
```{r eval=FALSE}
# create a spark connection
sc <- spark_connect(master = "local",
version = "2.4.0",
config = conf)
```
---
class: inverse, center, middle
# Working with `sparklyr`
---
## What is `sparklyr`?
Package `sparklyr` provides an R interface for Spark.
- Use `dplyr` to translate R code into Spark SQL
- Work with Spark's MLlib
- Interact with a stream of data
---
## Adding data
https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page
Put the January 2009 Yellow Cab data in your working directory or a location
it can be accessed.
.tiny[
```{r eval=FALSE}
cab <- spark_read_csv(sc, name = "cab",
path = "~/.public_html/data/taxi/yellow_tripdata_2009-01.csv")
```
]
Data can also be read into Spark DataFrames with `spark_read_json()`
and `spark_read_parquet()`.
Go to http://www2.stat.duke.edu/~sms185/data/taxi/yellow_tripdata_2009-01.csv to
download the data.
---
.tiny[
```{r eval=FALSE}
glimpse(cab)
```
```{r eval=FALSE}
Observations: ??
Variables: 18
Database: spark_connection
$ vendor_name "VTS", "VTS", "VTS", "DDS", "DDS", "DDS", "DDS", "VTS"…
$ Trip_Pickup_DateTime 2009-01-04 07:52:00, 2009-01-04 08:31:00, 2009-01-03 …
$ Trip_Dropoff_DateTime 2009-01-04 08:02:00, 2009-01-04 08:38:00, 2009-01-03 …
$ Passenger_Count 1, 3, 5, 1, 1, 2, 1, 1, 1, 1, 1, 1, 2, 2, 1, 1, 1, 1, …
$ Trip_Distance 2.63, 4.55, 10.35, 5.00, 0.40, 1.20, 0.40, 1.72, 1.60,…
$ Start_Lon -73.99196, -73.98210, -74.00259, -73.97427, -74.00158,…
$ Start_Lat 40.72157, 40.73629, 40.73975, 40.79095, 40.71938, 40.7…
$ Rate_Code NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ store_and_forward NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ End_Lon -73.99380, -73.95585, -73.86998, -73.99656, -74.00838,…
$ End_Lat 40.69592, 40.76803, 40.77023, 40.73185, 40.72035, 40.7…
$ Payment_Type "CASH", "Credit", "Credit", "CREDIT", "CASH", "CASH", …
$ Fare_Amt 8.9, 12.1, 23.7, 14.9, 3.7, 6.1, 5.7, 6.1, 8.7, 5.9, 2…
$ surcharge 0.5, 0.5, 0.0, 0.5, 0.0, 0.5, 0.0, 0.5, 0.0, 0.0, 0.0,…
$ mta_tax NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ Tip_Amt 0.00, 2.00, 4.74, 3.05, 0.00, 0.00, 1.00, 0.00, 1.30, …
$ Tolls_Amt 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ Total_Amt 9.40, 14.60, 28.44, 18.45, 3.70, 6.60, 6.70, 6.60, 10.…
```
]
It looks like we have some variable naming inconsistencies we should clean
up.
---
## Try some cleaning
Clean the names up with `janitor::clean_names()`.
```{r eval=FALSE}
janitor::clean_names(cab)
```
```{r eval=FALSE}
Error in clean_names.default(cab) :
clean_names() must be called on a data.frame.
Consider janitor::make_clean_names() for other
cases of manipulating vectors of names.
```
What's happening?
--
```{r eval=FALSE}
names(cab)
[1] "src" "ops"
```
Object `cab` is a list specifying the connection. What can we do?
---
## Fix names
Function `colnames()` seems to work.
```{r eval=FALSE}
colnames(cab)
```
```{r eval=FALSE}
[1] "vendor_name" "Trip_Pickup_DateTime" "Trip_Dropoff_DateTime" "Passenger_Count"
[5] "Trip_Distance" "Start_Lon" "Start_Lat" "Rate_Code"
[9] "store_and_forward" "End_Lon" "End_Lat" "Payment_Type"
[13] "Fare_Amt" "surcharge" "mta_tax" "Tip_Amt"
[17] "Tolls_Amt" "Total_Amt"
```
--
Create a function, `fix_names()`, that provides a more uniform structure
given our `cab` object
```{r}
fix_names <- function(x) {
colnames(x) %>%
tolower() %>%
stringr::str_remove(pattern = "trip_") %>%
setNames(x, .)
}
```
---
```{r eval=FALSE}
cab <- fix_names(cab)
```
--
```{r eval=FALSE}
colnames(cab)
```
```{r eval=FALSE}
[1] "vendor_name" "pickup_datetime" "dropoff_datetime"
[4] "passenger_count" "distance" "start_lon"
[7] "start_lat" "rate_code" "store_and_forward"
[10] "end_lon" "end_lat" "payment_type"
[13] "fare_amt" "surcharge" "mta_tax"
[16] "tip_amt" "tolls_amt" "total_amt"
```
---
## Fix `payment_type`
.tiny[
```{r eval=FALSE}
cab %>% group_by(payment_type) %>% summarise(count = n())
```
```{r eval=FALSE}
# Source: spark> [?? x 2] #<<
payment_type count
1 No Charge 40118
2 CASH 6024471
3 Credit 2865982
4 Cash 4995101
5 Dispute 8050
6 CREDIT 158691
```
]
--
.tiny[
```{r eval=FALSE}
cab %>%
mutate(payment_type = tolower(payment_type)) %>%
show_query() #<<
```
```{r eval=FALSE}
SELECT `vendor_name`, `pickup_datetime`, `dropoff_datetime`,
`passenger_count`, `distance`, `start_lon`, `start_lat`, `rate_code`,
`store_and_forward`, `end_lon`, `end_lat`, LOWER(`payment_type`) AS
`payment_type`, `fare_amt`, `surcharge`, `mta_tax`, `tip_amt`,
`tolls_amt`, `total_amt`
FROM `sparklyr_tmp_614420ac929`
```
]
---
Transform all payment types to lower case and add another Spark DataFrame.
```{r eval=FALSE}
cab %>%
mutate(payment_type = tolower(payment_type)) %>%
sdf_register("cab_clean_pymt") #<<
```
```{r eval=FALSE}
cab_clean_pymt <- sdf_load_table(sc, "cab_clean_pymt")
```
```{r eval=FALSE}
src_tbls(sc)
```
```{r eval=FALSE}
[1] "cab" "cab_clean_pymt"
```
--
Using the new Spark DataFrame, bring everything back to R with `collect()`.
```{r eval=FALSE}
cab_clean_pymt %>%
group_by(payment_type) %>%
summarise(count = n()) %>%
collect() #<<
```
```{r eval=FALSE}
# A tibble: 4 x 2
payment_type count
1 cash 11019572
2 no charge 40118
3 dispute 8050
4 credit 3024673
```
---
## Summaries and analysis
Let's compute some summary information about taxi trips.
```{r eval=FALSE}
cab_clean_pymt %>%
select(passenger_count, distance, tip_amt, fare_amt, total_amt) %>%
mutate(
cost_per_passenger = passenger_count / total_amt,
tip_pct = tip_amt / fare_amt,
cost_per_mile = fare_amt / distance
)
```
--
.tiny[
```{r eval=FALSE}
# Source: spark> [?? x 8]
passenger_count distance tip_amt fare_amt total_amt
1 1 2.63 0 8.9 9.4
2 3 4.55 2 12.1 14.6
3 5 10.4 4.74 23.7 28.4
4 1 5 3.05 14.9 18.4
5 1 0.4 0 3.7 3.7
6 2 1.2 0 6.1 6.6
7 1 0.4 1 5.7 6.7
8 1 1.72 0 6.1 6.6
9 1 1.6 1.3 8.7 10
10 1 0.7 0 5.9 5.9
# … with more rows, and 3 more variables: cost_per_passenger ,
# tip_pct , cost_per_mile
```
]
---
What query is being made with regards to Spark?
```{r eval=FALSE}
cab_clean_pymt %>%
select(passenger_count, distance, tip_amt, fare_amt, total_amt) %>%
mutate(
cost_per_passenger = passenger_count / total_amt,
tip_pct = tip_amt / fare_amt,
cost_per_mile = fare_amt / distance
) %>%
show_query() #<<
```
--
```{r eval=FALSE}
SELECT `passenger_count`, `distance`, `tip_amt`, `fare_amt`,
`total_amt`,
`passenger_count` / `total_amt` AS `cost_per_passenger`,
`tip_amt` / `fare_amt` AS `tip_pct`,
`fare_amt` / `distance` AS `cost_per_mile`
FROM `sparklyr_tmp_a2a7a060b11`
```
---
Next, let's parse `pickup_datetime` to get the hour, day (as a name),
month (as a name). We'll also include the tip percentage and fare cost per mile.
Lastly we'll compute some summary measures.
See: http://spark.apache.org/docs/latest/api/R/index.html for how to
use `date_format()`.
```{r eval=FALSE}
cab_summary <- cab_clean_pymt %>%
select(pickup_datetime, dropoff_datetime, distance,
fare_amt, tip_amt, total_amt) %>%
mutate(
pickup_hour = hour(pickup_datetime),
pickup_day = date_format(pickup_datetime, "EEE"),
pickup_month = date_format(pickup_datetime, "MMM"),
tip_pct = tip_amt / fare_amt,
fare_per_mile = fare_amt / distance
) %>%
group_by(pickup_hour, pickup_day) %>%
summarise(
avg_dist = mean(distance),
avg_fare = mean(fare_amt),
avg_tip_pct = mean(tip_pct),
avg_fare_per_mile = mean(fare_per_mile)
) %>%
collect()
```
---
```{r eval=FALSE}
cab_summary
```
Note that this is an R object.
.tiny[
```{r eval=FALSE}
# A tibble: 168 x 6 #<<
pickup_hour pickup_day avg_dist avg_fare avg_tip_pct avg_fare_per_mile
1 1 Thu 3.08 10.5 0.0448 5.33
2 7 Sun 3.68 11.6 0.0401 5.32
3 8 Sat 2.82 9.37 0.0377 5.00
4 6 Tue 2.85 9.33 0.0375 5.29
5 11 Fri 2.33 9.23 0.0400 6.25
6 0 Wed 3.16 10.4 0.0530 5.38
7 6 Mon 3.12 9.95 0.0377 5.16
8 17 Mon 2.41 9.36 0.0440 5.91
9 16 Thu 2.34 9.85 0.0408 13.0
10 10 Wed 2.13 8.98 0.0457 6.49
# … with 158 more rows
```
]
---
We can now use `cab_summary` just as we would any other object in R.
```{r eval=FALSE}
cab_summary %>%
ggplot(aes(x = pickup_hour, y = avg_tip_pct)) +
geom_line() +
facet_wrap(~factor(pickup_day, levels = c("Sun", "Mon", "Tue", "Wed",
"Thu", "Fri", "Sat"))) +
labs(x = "Pick-up hour", y = "Average tip percentage") +
theme_bw()
```
---
class: center, middle
![](images/taxi_plot.png)
---
## Other functions
- Perform joins with the `*_join()` family of functions.
- Sampling can be done with `sample_n()` and `sample_frac()`.
- Write the results of your analysis into persistent storage with
`spark_write_parquet()`, `spark_write_csv()`, or `spark_write_json()`.
---
## Family of `sparklyr` functions
| Sparklyr family of functions | Description |
|-----------------------------:|:------------------------------------------------------------------------------------------------|
| `spark_*()` | functions to manage and configure spark connections;
functions to read and write data |
| `sdf_*()` | functions for manipulating SparkDataFrames |
| `ft_*()` | feature transformers for manipulating individual features |
| `ml_*()` | machine learning algorithms - K-Means, GLM, Survival Regression,
PCA, Naive-Bayes, and more |
| `stream_*()` | functions for handling stream data |
---
## Exercise
Navigate to the [BTS Reporting Carrier On-Time Performance (1987-present)](https://www.transtats.bts.gov/Tables.asp?DB_ID=120&DB_Name=Airline%20On-Time%20Performance%20Data) and click on "Download" under Reporting Carrier On-Time Performance
(1987-present).
1. Download the 2001 airline data. Choose a subset of variables as detailed
[here.](http://stat-computing.org/dataexpo/2009/the-data.html)
2. Read it into a Spark DataFrame.
3. Drop columns that have all `NA` values, add names for month and day of
week, and try to fix the plane tail number.
4. Summarize the data based on on-time performance for flights before, on,
and after September, 11.
---
## References
- https://spark.rstudio.com/
- http://spark.apache.org/docs/latest/api/R/index.html
- http://www.dcs.bbk.ac.uk/~dell/teaching/cc/book/databricks/spark-intro.pdf