---
title: "Spark & sparklyr part II"
subtitle: "Statistical Computing & Programming"
author: "Shawn Santo"
institute: ""
date: "06-22-20"
output:
xaringan::moon_reader:
css: "slides.css"
lib_dir: libs
nature:
highlightStyle: github
highlightLines: true
countIncrementalSlides: false
editor_options:
chunk_output_type: console
---
```{r include=FALSE}
knitr::opts_chunk$set(echo = TRUE, message = FALSE, warning = FALSE,
comment = "#>", highlight = TRUE,
fig.align = "center")
library(sparklyr)
library(tidyverse)
```
## Supplementary materials
Companion videos
- [Data manipulation and partition](https://warpwire.duke.edu/w/S9sDAA/)
- [ML models](https://warpwire.duke.edu/w/TdsDAA/)
- [ML pipelines](https://warpwire.duke.edu/w/T9sDAA/)
Additional resources
- [A Gentle Introduction to Apache Spark](http://www.dcs.bbk.ac.uk/~dell/teaching/cc/book/databricks/spark-intro.pdf)
---
class: inverse, center, middle
# Recall
---
## What is Apache Spark?
- As described by Databricks, "Spark is a unified computing engine and a set
of libraries for parallel data processing on computing clusters".
- Spark's goal is to support data analytics tasks within a single ecosystem:
data loading, SQL queries, machine learning, and streaming computations.
- Spark is written in Scala and runs on Java. However, Spark can be used
from R, Python, SQL, Scala, or Java.
---
## Key features
- In-memory computation
- Fast and scalable
- efficiently scale up from one to many thousands of compute nodes
- Access data on a multitude of platforms
- SQL and NoSQL databases
- Cloud storage
- Hadoop Distributed File System
- Real-time stream processing
- Libraries
- Spark SQL
- MLlib
- Spark streaming
- GraphX
---
## Install
We'll be able to install Spark and set-up a connection through the helper
functions in package `sparklyr`. More on this in a moment.
```{r}
library(sparklyr)
```
```{r}
sparklyr::spark_available_versions()
```
--
Let's install version 2.4 of Spark for use with a local Spark connection
via `spark_install(version = "2.4")`
---
## Configure and connect
```{r eval=FALSE}
# add some custom configurations
conf <- list(
sparklyr.cores.local = 4,
`sparklyr.shell.driver-memory` = "16G",
spark.memory.fraction = 0.5
)
```
`sparklyr.cores.local` - defaults to using all of the available cores
`sparklyr.shell.driver-memory` - limit is the amount of RAM available in the
computer minus what would be needed for OS operations
`spark.memory.fraction` - default is set to 60% of the requested memory
per executor
```{r echo=FALSE}
Sys.setenv(JAVA_HOME="/Library/Java/JavaVirtualMachines/jdk1.8.0_202.jdk/Contents/Home")
```
```{r eval=FALSE}
# create a spark connection
sc <- spark_connect(master = "local",
version = "2.4.0",
config = conf)
```
---
## What is `sparklyr`?
Package `sparklyr` provides an R interface for Spark.
- Use `dplyr` to translate R code into Spark SQL
- Work with Spark's MLlib
- Interact with a stream of data
---
## Family of `sparklyr` functions
| Sparklyr family of functions | Description |
|-----------------------------:|:------------------------------------------------------------------------------------------------|
| `spark_*()` | functions to manage and configure spark connections;
functions to read and write data |
| `sdf_*()` | functions for manipulating SparkDataFrames |
| `ft_*()` | feature transformers for manipulating individual features |
| `ml_*()` | machine learning algorithms - K-Means, GLM, Survival Regression,
PCA, Naive-Bayes, and more |
| `stream_*()` | functions for handling stream data |
---
class: inverse, center, middle
# Machine learning
---
## Logistic regression
Our goal will be to classify member type based on some of the predictors
in the 2017 Capital Bikeshare Data. First, let's create a spark table object.
```{r eval=FALSE}
bike_2017 <- spark_read_csv(sc, name = "cbs_bike_2017",
path = "~/.public_html/data/bike/cbs_2017.csv")
```
--
```{r eval=FALSE}
> glimpse(bike_2017)
Rows: ??
Columns: 9
Database: spark_connection
$ Duration 221, 1676, 1356, 1327, 1636, 1603, 473, 200, 748, 912,…
$ Start_date 2017-01-01 00:00:41, 2017-01-01 00:06:53, 2017-01-01 …
$ End_date 2017-01-01 00:04:23, 2017-01-01 00:34:49, 2017-01-01 …
$ Start_station_number 31634, 31258, 31289, 31289, 31258, 31258, 31611, 31104…
$ Start_station "3rd & Tingey St SE", "Lincoln Memorial", "Henry Bacon…
$ End_station_number 31208, 31270, 31222, 31222, 31270, 31270, 31616, 31121…
$ End_station "M St & New Jersey Ave SE", "8th & D St NW", "New York…
$ Bike_number "W00869", "W00894", "W21945", "W20012", "W22786", "W20…
$ Member_type "Member", "Casual", "Casual", "Casual", "Casual", "Cas…
```
Go to http://www2.stat.duke.edu/~sms185/data/bike/cbs_2017.csv to download the
data.
---
## Data partitions
Create training and test data.
```{r eval=FALSE}
partitions <- bike_2017 %>%
mutate(Member_type = as.numeric(Member_type == "Member")) %>%
group_by(Member_type) %>%
sdf_random_split(train = .8, test = .2, seed = 4718) #<<
```
`sdf_random_split()` creates an R list of tbl_sparks.
--
**Where is `partitions` under tab Connections?**
--
It exists, but if you want to see it as a table for Spark SQL you need to
register it with `sdf_register()`.
---
```{r eval=FALSE}
bike_train <- partitions$train
bike_test <- partitions$test
```
--
Register these (not necessary).
```{r eval=FALSE}
sdf_register(bike_train, name = "train")
sdf_register(bike_test, name = "test")
```
You should now see `train` and `test` in your connections tab.
---
Verify our partition is properly stratified.
```{r eval=FALSE}
bike_train %>%
count(Member_type) %>%
mutate(prop = n / sum(n, na.rm = T))
```
```{r eval=FALSE}
# Source: spark> [?? x 3]
Member_type n prop
1 0 786021 0.261
2 1 2220182 0.739
```
```{r eval=FALSE}
bike_test %>%
count(Member_type) %>%
mutate(prop = n / sum(n, na.rm = T))
```
```{r eval=FALSE}
# Source: spark> [?? x 3]
Member_type n prop
1 0 195777 0.260
2 1 555797 0.740
```
---
## Model fit
```{r eval=FALSE}
bike_logistic <- bike_train %>%
ml_logistic_regression(formula = Member_type ~ Duration) #<<
```
```{r eval=FALSE}
Formula: Member_type ~ Duration
Coefficients:
(Intercept) Duration
2.496742551 -0.001325904
```
---
## Model predictions
```{r eval=FALSE}
logistic_pred <- ml_predict(bike_logistic, bike_test)
logistic_pred
```
```{r eval=FALSE}
# Source: spark [?? x 7]
features label rawPrediction probability prediction probability_0 probability_1
1 1 1 0.0820 0.918
2 1 1 0.0820 0.918
3 1 1 0.0820 0.918
4 0 1 0.0820 0.918
5 1 1 0.0820 0.918
6 1 1 0.0820 0.918
7 0 1 0.0820 0.918
8 1 1 0.0820 0.918
9 1 1 0.0820 0.918
10 1 1 0.0820 0.918
# … with more rows
```
---
## Model evaluation
To compute the AUROC
```{r eval=FALSE}
ml_binary_classification_evaluator(pred)
```
```{r eval=FALSE}
[1] 0.8264824
```
--
To compute the confusion matrix
```{r eval=FALSE}
table(pull(logistic_pred, label), pull(logistic_pred, prediction))
```
```{r eval=FALSE}
0 1
0 68991 126786
1 14492 541305
```
---
## Model evaluation metrics
For these `ml_*_evaluator()` functions the following metrics are supported
- Binary Classification: areaUnderROC (default) or areaUnderPR
(not available in Spark 2.X.)
- Multiclass Classification: f1 (default), precision, recall, weightedPrecision,
weightedRecall or accuracy; for Spark 2.X: f1 (default), weightedPrecision,
weightedRecall or accuracy
- Regression: rmse (root mean squared error, default),
mse (mean squared error), r2, or mae (mean absolute error)
---
## Random forest
```{r eval=FALSE}
rf_model <- bike_train %>%
ml_random_forest(Member_type ~ Duration,
type = "classification", seed = 402081)
```
--
```{r eval=FALSE}
rf_pred <- ml_predict(rf_model, bike_test)
ml_binary_classification_evaluator(rf_pred)
```
```{r eval=FALSE}
[1] 0.7450385
```
--
```{r eval=FALSE}
table(pull(rf_pred, label), pull(rf_pred, prediction))
```
```{r eval=FALSE}
0 1
0 90965 104812
1 31504 524293
```
---
## SVM
```{r eval=FALSE}
svm_model <- bike_train %>%
ml_linear_svc(Member_type ~ Duration)
```
--
```{r eval=FALSE}
svm_pred <- ml_predict(svm_model, bike_test)
ml_binary_classification_evaluator(svm_pred)
```
```{r eval=FALSE}
[1] 0.8264824
```
--
```{r eval=FALSE}
table(pull(svm_pred, label), pull(svm_pred, prediction))
```
```{r eval=FALSE}
0 1
0 21914 173863
1 1710 554087
```
---
## ML classification summary
| **Method** | **Area Under ROC** |
|------------------------------:|--------------------|
| Logistic regression | 0.826 |
| Random forest classification | 0.745 |
| Linear support vector machine | 0.826 |
Which simple model should we go with?
---
## Exercise
Using a small dataset (`mtcars`), explore `ml_logistic_regression()` and
`ml_generalized_linear_regression()` with family "binomial".
Fit some models with various continuous
and categorical predictors. Do you get the same result?
Also, compare `ml_linear_regression()` and `ml_generalized_linear_regression()`
with the default family gaussian.
```{r eval=FALSE}
cars <- sdf_copy_to(sc, mtcars, "cars")
```
---
class: inverse, center, middle
# ML Pipelines
---
## What is an `ml_pipeline`?
Spark’s ML Pipelines provide a way to easily combine multiple transformations
and algorithms into a single workflow, or pipeline.
Some Spark terminology:
- **Transformer**: A Transformer is an algorithm which can transform one
DataFrame into another DataFrame.
- **Estimator**: An Estimator is an algorithm which can be fit on a DataFrame
to produce a Transformer.
- **Pipeline**: A Pipeline chains multiple Transformers and Estimators
together to specify a machine learning workflow.
---
## ML pipelines for reproducibility
Let's create a ML pipeline to classify if a flight is delayed in February 2020
for all NC airports.
```{r eval=FALSE}
flights <- spark_read_csv(sc, name = "nc_flights_feb_2020",
path = "~/.public_html/data/flights/nc_flights_feb_20.csv")
```
Go to http://www2.stat.duke.edu/~sms185/data/flights/nc_flights_feb_20.csv to
download the
data.
---
```{r eval=FALSE}
df <- flights %>%
mutate(DEP_DELAY = as.numeric(DEP_DELAY),
ARR_DELAY = as.numeric(ARR_DELAY),
MONTH = as.character(MONTH),
DAY_OF_WEEK = as.character(DAY_OF_WEEK)
) %>%
filter(!is.na(DEP_DELAY)) %>%
select(DEP_DELAY, CRS_DEP_TIME, MONTH, DAY_OF_WEEK, DISTANCE)
df
```
.tiny[
```{r eval=FALSE}
# Source: spark> [?? x 5]
DEP_DELAY CRS_DEP_TIME MONTH DAY_OF_WEEK DISTANCE
1 -7 830 2 5 365
2 7 835 2 6 365
3 -8 830 2 7 365
4 -10 830 2 1 365
5 -3 830 2 2 365
6 -12 830 2 3 365
7 -4 830 2 4 365
8 -9 830 2 5 365
9 -8 835 2 6 365
10 -10 830 2 7 365
# … with more rows
```
]
---
```{r eval=FALSE}
flights_pipe <- ml_pipeline(sc) %>% #<<
ft_dplyr_transformer(tbl = df) %>% #<<
ft_binarizer(input_col = "DEP_DELAY",
output_col = "DELAYED",
threshold = 15) %>%
ft_bucketizer(input_col = "CRS_DEP_TIME",
output_col = "HOURS",
splits = seq(0, 2400, 400)) %>%
ft_r_formula(DELAYED ~ DAY_OF_WEEK + HOURS + DISTANCE) %>%
ml_logistic_regression()
```
```{r eval=FALSE}
Pipeline (Estimator) with 5 stages
Stages
|--1 SQLTransformer (Transformer) #<<
|
| (Parameters -- Column Names)
```
--
`ft_dplyr_transformer()` extracts the dplyr transformations used to
generate object `tbl` as a SQL statement then passes it on
to `ft_sql_transformer()`. The result is a `ml_pipeline` object.
---
```{r eval=FALSE}
flights_pipe <- ml_pipeline(sc) %>%
ft_dplyr_transformer(tbl = df) %>%
ft_binarizer(input_col = "DEP_DELAY", #<<
output_col = "DELAYED", #<<
threshold = 15) %>% #<<
ft_bucketizer(input_col = "CRS_DEP_TIME",
output_col = "HOURS",
splits = seq(0, 2400, 400)) %>%
ft_r_formula(DELAYED ~ DAY_OF_WEEK + HOURS + DISTANCE) %>%
ml_logistic_regression()
```
```{r eval=FALSE}
|--2 Binarizer (Transformer) #<<
|
| (Parameters -- Column Names)
| input_col: DEP_DELAY
| output_col: DELAYED
```
---
```{r eval=FALSE}
flights_pipe <- ml_pipeline(sc) %>%
ft_dplyr_transformer(tbl = df) %>%
ft_binarizer(input_col = "DEP_DELAY",
output_col = "DELAYED",
threshold = 15) %>%
ft_bucketizer(input_col = "CRS_DEP_TIME", #<<
output_col = "HOURS", #<<
splits = seq(0, 2400, 400)) %>% #<<
ft_r_formula(DELAYED ~ DAY_OF_WEEK + HOURS + DISTANCE) %>%
ml_logistic_regression()
```
```{r eval=FALSE}
|--3 Bucketizer (Transformer) #<<
|
| (Parameters -- Column Names)
| input_col: CRS_DEP_TIME
| output_col: HOURS
```
---
```{r eval=FALSE}
flights_pipe <- ml_pipeline(sc) %>%
ft_dplyr_transformer(tbl = df) %>%
ft_binarizer(input_col = "DEP_DELAY",
output_col = "DELAYED",
threshold = 15) %>%
ft_bucketizer(input_col = "CRS_DEP_TIME",
output_col = "HOURS",
splits = seq(0, 2400, 400)) %>%
ft_r_formula(DELAYED ~ DAY_OF_WEEK + HOURS + DISTANCE) %>% #<<
ml_logistic_regression()
```
```{r eval=FALSE}
|--4 RFormula (Estimator) #<<
|
| (Parameters -- Column Names)
| features_col: features
| label_col: label
| (Parameters)
| force_index_label: FALSE
| formula: DELAYED ~ DAY_OF_WEEK + HOURS + DISTANCE
| handle_invalid: error
| stringIndexerOrderType: frequencyDesc
```
---
```{r eval=FALSE}
flights_pipe <- ml_pipeline(sc) %>%
ft_dplyr_transformer(tbl = df) %>%
ft_binarizer(input_col = "DEP_DELAY",
output_col = "DELAYED",
threshold = 15) %>%
ft_bucketizer(input_col = "CRS_DEP_TIME",
output_col = "HOURS",
splits = seq(0, 2400, 400)) %>%
ft_r_formula(DELAYED ~ DAY_OF_WEEK + HOURS + DISTANCE) %>%
ml_logistic_regression() #<<
```
```{r eval=FALSE}
|--5 LogisticRegression (Estimator) #<<
|
| (Parameters -- Column Names)
| features_col: features
| label_col: label
| prediction_col: prediction
| probability_col: probability
| raw_prediction_col: rawPrediction
| (Parameters)
| aggregation_depth: 2
| elastic_net_param: 0
| family: auto
| fit_intercept: TRUE
| max_iter: 100
| reg_param: 0
| standardization: TRUE
| threshold: 0.5
| tol: 1e-06
```
---
## Full pipeline
.pull-left[
.tiny[
```{r eval=FALSE}
Pipeline (Estimator) with 5 stages
Stages
|--1 SQLTransformer (Transformer)
|
| (Parameters -- Column Names)
|--2 Binarizer (Transformer)
|
| (Parameters -- Column Names)
| input_col: DEP_DELAY
| output_col: DELAYED
|--3 Bucketizer (Transformer)
|
| (Parameters -- Column Names)
| input_col: CRS_DEP_TIME
| output_col: HOURS
|--4 RFormula (Estimator)
|
| (Parameters -- Column Names)
| features_col: features
| label_col: label
| (Parameters)
| force_index_label: FALSE
| formula: DELAYED ~ DAY_OF_WEEK + HOURS + DISTANCE
| handle_invalid: error
| stringIndexerOrderType: frequencyDesc
```
]
]
.pull-right[
.tiny[
```{r eval=FALSE}
|--5 LogisticRegression (Estimator)
|
| (Parameters -- Column Names)
| features_col: features
| label_col: label
| prediction_col: prediction
| probability_col: probability
| raw_prediction_col: rawPrediction
| (Parameters)
| aggregation_depth: 2
| elastic_net_param: 0
| family: auto
| fit_intercept: TRUE
| max_iter: 100
| reg_param: 0
| standardization: TRUE
| threshold: 0.5
| tol: 1e-06
```
]
]
---
## What can we do with this pipeline?
1. Easily fit data with `ml_fit()`.
2. Make predictions with a fitted pipeline and `ml_transform()`.
3. Save pipelines that result in Scala scripts with `ml_save()` and
can be read back into `sparklyr` (with `ml_load()`) or by the Scala or
PySpark APIs.
---
## Exercise
Use `bike_2017` to create an `ml_pipeline` object. Consider classification
with member type as the response. Also, consider creating buckets for
duration and a binary variable for round trips (bike starts and ends at the
same location).
---
## Distributed R with `spark_apply()`
Apply an R function to a Spark DataFrame. Also, your function must return
a Spark DataFrame or something coercible to one.
```{r eval=FALSE}
spark_apply(
bike_train,
function(x) broom::tidy(lm(Duration ~ Member_type, data = x)),
group_by = "Start_station"
)
```
.tiny[
```{r eval=FALSE}
# Source: spark> [?? x 6]
Start_station term estimate std_error statistic p_value
1 10th St & L'Enfant Plaza SW (Intercept) 2746. 55.4 49.6 0.
2 10th St & L'Enfant Plaza SW Member_type -1921. 67.9 -28.3 6.07e-164
3 28th St S & S Meade St (Intercept) 2965. 141. 21.0 8.41e- 82
4 28th St S & S Meade St Member_type -1999. 166. -12.0 2.37e- 31
5 18th & C St NW (Intercept) 2329. 38.3 60.7 0.
6 18th & C St NW Member_type -1433. 51.8 -27.7 3.40e-159
7 Georgia Ave & Spring St (Intercept) 2840. 273. 10.4 2.80e- 23
8 Georgia Ave & Spring St Member_type -1856. 310. -5.99 3.78e- 9
9 19th & Savannah St SE (Intercept) 1180. 127. 9.28 7.50e- 4
10 19th & Savannah St SE Member_type -449. 220. -2.04 1.11e- 1
# … with more rows
```
]
---
## References
- https://spark.rstudio.com/
- http://spark.apache.org/docs/latest/api/R/index.html
- http://www.dcs.bbk.ac.uk/~dell/teaching/cc/book/databricks/spark-intro.pdf
- OST_R | BTS | Transtats. (2020). Transtats.bts.gov.
https://www.transtats.bts.gov/tables.asp?Table_ID=236&SYS_Table_Name=T_ONTIME_REPORTING