Spark & sparklyr part II

# Spark & sparklyr part II
## Statistical Computing & Programming
### Shawn Santo
### 06-22-20

---

## Supplementary materials

Companion videos

- [Data manipulation and partition](https://warpwire.duke.edu/w/S9sDAA/)
- [ML models](https://warpwire.duke.edu/w/TdsDAA/)
- [ML pipelines](https://warpwire.duke.edu/w/T9sDAA/)

Additional resources

- [A Gentle Introduction to Apache Spark](http://www.dcs.bbk.ac.uk/~dell/teaching/cc/book/databricks/spark-intro.pdf)

---

# Recall

---

## What is Apache Spark?

- As described by Databricks, "Spark is a unified computing engine and a set
 of libraries for parallel data processing on computing clusters".
 
 
 
- Spark's goal is to support data analytics tasks within a single ecosystem:
 data loading, SQL queries, machine learning, and streaming computations.
 
 
 
- Spark is written in Scala and runs on Java. However, Spark can be used
 from R, Python, SQL, Scala, or Java.

---

## Key features

- In-memory computation

- Fast and scalable
    - efficiently scale up from one to many thousands of compute nodes

- Access data on a multitude of platforms
    - SQL and NoSQL databases
    - Cloud storage
    - Hadoop Distributed File System

- Real-time stream processing

- Libraries
    - Spark SQL
    - MLlib
    - Spark streaming
    - GraphX

---

## Install

We'll be able to install Spark and set-up a connection through the helper
functions in package `sparklyr`. More on this in a moment.

```r
library(sparklyr)
```

```r
sparklyr::spark_available_versions()
```

```
#>   spark
#> 1   1.6
#> 2   2.0
#> 3   2.1
#> 4   2.2
#> 5   2.3
#> 6   2.4
```

Let's install version 2.4 of Spark for use with a local Spark connection
via `spark_install(version = "2.4")`

---

## Configure and connect

```r
# add some custom configurations
conf <- list(
 sparklyr.cores.local = 4,
 `sparklyr.shell.driver-memory` = "16G",
 spark.memory.fraction = 0.5
)
```

`sparklyr.cores.local` - defaults to using all of the available cores

`sparklyr.shell.driver-memory` - limit is the amount of RAM available in the 
computer minus what would be needed for OS operations

`spark.memory.fraction` - default is set to 60% of the requested memory 
per executor

```r
# create a spark connection
sc <- spark_connect(master = "local", 
 version = "2.4.0",
 config = conf)
```

---

## What is `sparklyr`?

Package `sparklyr` provides an R interface for Spark.

- Use `dplyr` to translate R code into Spark SQL

- Work with Spark's MLlib

- Interact with a stream of data

---

## Family of `sparklyr` functions

| Sparklyr family of functions | Description |
|-----------------------------:|:------------------------------------------------------------------------------------------------|
| `spark_*()` | functions to manage and configure spark connections; functions to read and write data |
| `sdf_*()` | functions for manipulating SparkDataFrames |
| `ft_*()` | feature transformers for manipulating individual features |
| `ml_*()` | machine learning algorithms - K-Means, GLM, Survival Regression, PCA, Naive-Bayes, and more |
| `stream_*()` | functions for handling stream data |
---

# Machine learning

---

## Logistic regression

Our goal will be to classify member type based on some of the predictors
in the 2017 Capital Bikeshare Data. First, let's create a spark table object.

```r
bike_2017 <- spark_read_csv(sc, name = "cbs_bike_2017", 
 path = "~/.public_html/data/bike/cbs_2017.csv")
```

```r
> glimpse(bike_2017)
Rows: ??
Columns: 9
Database: spark_connection
$ Duration <int> 221, 1676, 1356, 1327, 1636, 1603, 473, 200, 748, 912,…
$ Start_date <dttm> 2017-01-01 00:00:41, 2017-01-01 00:06:53, 2017-01-01 …
$ End_date <dttm> 2017-01-01 00:04:23, 2017-01-01 00:34:49, 2017-01-01 …
$ Start_station_number <int> 31634, 31258, 31289, 31289, 31258, 31258, 31611, 31104…
$ Start_station <chr> "3rd & Tingey St SE", "Lincoln Memorial", "Henry Bacon…
$ End_station_number <int> 31208, 31270, 31222, 31222, 31270, 31270, 31616, 31121…
$ End_station <chr> "M St & New Jersey Ave SE", "8th & D St NW", "New York…
$ Bike_number <chr> "W00869", "W00894", "W21945", "W20012", "W22786", "W20…
$ Member_type <chr> "Member", "Casual", "Casual", "Casual", "Casual", "Cas…
```

Go to http://www2.stat.duke.edu/~sms185/data/bike/cbs_2017.csv to download the
data.

---

## Data partitions

Create training and test data.

```r
partitions <- bike_2017 %>% 
 mutate(Member_type = as.numeric(Member_type == "Member")) %>% 
 group_by(Member_type) %>% 
* sdf_random_split(train = .8, test = .2, seed = 4718)
```

`sdf_random_split()` creates an R list of tbl_sparks.

**Where is `partitions` under tab Connections?**

It exists, but if you want to see it as a table for Spark SQL you need to
register it with `sdf_register()`.

---

```r
bike_train <- partitions$train
bike_test <- partitions$test
```

```r
sdf_register(bike_train, name = "train")
sdf_register(bike_test, name = "test")
```

You should now see `train` and `test` in your connections tab.

---

Verify our partition is properly stratified.

```r
bike_train %>% 
  count(Member_type) %>% 
  mutate(prop = n / sum(n, na.rm = T))
```

```r
# Source: spark<?> [?? x 3]
 Member_type n prop
 <dbl> <dbl> <dbl>
1 0 786021 0.261
2 1 2220182 0.739
```

```r
bike_test %>% 
  count(Member_type) %>% 
  mutate(prop = n / sum(n, na.rm = T))
```

```r
# Source: spark<?> [?? x 3]
 Member_type n prop
 <dbl> <dbl> <dbl>
1 0 195777 0.260
2 1 555797 0.740
```

---

## Model fit

```r
bike_logistic <- bike_train %>% 
* ml_logistic_regression(formula = Member_type ~ Duration)
```

```r
Formula: Member_type ~ Duration

Coefficients:
 (Intercept)     Duration 
 2.496742551 -0.001325904 
```

---

## Model predictions

```r
logistic_pred <- ml_predict(bike_logistic, bike_test)

logistic_pred
```

```r
# Source: spark<prediction_logistic_tbl> [?? x 7]
 features label rawPrediction probability prediction probability_0 probability_1
 <list> <dbl> <list> <list> <dbl> <dbl> <dbl>
 1 <dbl [1]> 1 <dbl [2]> <dbl [2]> 1 0.0820 0.918
 2 <dbl [1]> 1 <dbl [2]> <dbl [2]> 1 0.0820 0.918
 3 <dbl [1]> 1 <dbl [2]> <dbl [2]> 1 0.0820 0.918
 4 <dbl [1]> 0 <dbl [2]> <dbl [2]> 1 0.0820 0.918
 5 <dbl [1]> 1 <dbl [2]> <dbl [2]> 1 0.0820 0.918
 6 <dbl [1]> 1 <dbl [2]> <dbl [2]> 1 0.0820 0.918
 7 <dbl [1]> 0 <dbl [2]> <dbl [2]> 1 0.0820 0.918
 8 <dbl [1]> 1 <dbl [2]> <dbl [2]> 1 0.0820 0.918
 9 <dbl [1]> 1 <dbl [2]> <dbl [2]> 1 0.0820 0.918
10 <dbl [1]> 1 <dbl [2]> <dbl [2]> 1 0.0820 0.918
# … with more rows
```

---

## Model evaluation

To compute the AUROC

```r
ml_binary_classification_evaluator(pred)
```

```r
[1] 0.8264824
```

To compute the confusion matrix

```r
table(pull(logistic_pred, label), pull(logistic_pred, prediction))
```

```r
         0      1
  0  68991 126786
  1  14492 541305
```

---

## Model evaluation metrics

For these `ml_*_evaluator()` functions the following metrics are supported

- Binary Classification: areaUnderROC (default) or areaUnderPR 
  (not available in Spark 2.X.)

- Multiclass Classification: f1 (default), precision, recall, weightedPrecision,
  weightedRecall or accuracy; for Spark 2.X: f1 (default), weightedPrecision,
  weightedRecall or accuracy

- Regression: rmse (root mean squared error, default), 
  mse (mean squared error), r2, or mae (mean absolute error)

---

## Random forest

```r
rf_model <- bike_train %>%
 ml_random_forest(Member_type ~ Duration, 
 type = "classification", seed = 402081)
```

```r
rf_pred <- ml_predict(rf_model, bike_test)
ml_binary_classification_evaluator(rf_pred)
```

```r
[1] 0.7450385
```

```r
table(pull(rf_pred, label), pull(rf_pred, prediction))
```

```r
         0      1
  0  90965 104812
  1  31504 524293
```

---

## SVM

```r
svm_model <- bike_train %>%
 ml_linear_svc(Member_type ~ Duration)
```

```r
svm_pred <- ml_predict(svm_model, bike_test)
ml_binary_classification_evaluator(svm_pred)
```

```r
[1] 0.8264824
```

```r
table(pull(svm_pred, label), pull(svm_pred, prediction))
```

```r
         0      1
  0  21914 173863
  1   1710 554087
```

---

## ML classification summary

|                    **Method** | **Area Under ROC** |
|------------------------------:|--------------------|
| Logistic regression           | 0.826              |
| Random forest classification  | 0.745              |
| Linear support vector machine | 0.826              |

Which simple model should we go with?

---

## Exercise

Using a small dataset (`mtcars`), explore `ml_logistic_regression()` and 
`ml_generalized_linear_regression()` with family "binomial". 
Fit some models with various continuous
and categorical predictors. Do you get the same result?

Also, compare `ml_linear_regression()` and `ml_generalized_linear_regression()`
with the default family gaussian.

```r
cars <- sdf_copy_to(sc, mtcars, "cars")
```

---

# ML Pipelines

---

## What is an `ml_pipeline`?

Spark’s ML Pipelines provide a way to easily combine multiple transformations 
and algorithms into a single workflow, or pipeline.

Some Spark terminology:

- **Transformer**: A Transformer is an algorithm which can transform one 
  DataFrame into another DataFrame.

- **Estimator**: An Estimator is an algorithm which can be fit on a DataFrame 
  to produce a Transformer.

- **Pipeline**: A Pipeline chains multiple Transformers and Estimators 
  together to specify a machine learning workflow.

---

## ML pipelines for reproducibility

Let's create a ML pipeline to classify if a flight is delayed in February 2020
for all NC airports.

```r
flights <- spark_read_csv(sc, name = "nc_flights_feb_2020",
 path = "~/.public_html/data/flights/nc_flights_feb_20.csv")
```

Go to http://www2.stat.duke.edu/~sms185/data/flights/nc_flights_feb_20.csv to 
download the
data.

---

```r
df <- flights %>%
 mutate(DEP_DELAY = as.numeric(DEP_DELAY),
 ARR_DELAY = as.numeric(ARR_DELAY),
 MONTH = as.character(MONTH),
 DAY_OF_WEEK = as.character(DAY_OF_WEEK)
 ) %>% 
 filter(!is.na(DEP_DELAY)) %>%
 select(DEP_DELAY, CRS_DEP_TIME, MONTH, DAY_OF_WEEK, DISTANCE)

df
```

```r
# Source: spark<?> [?? x 5]
 DEP_DELAY CRS_DEP_TIME MONTH DAY_OF_WEEK DISTANCE
 <dbl> <int> <chr> <chr> <dbl>
 1 -7 830 2 5 365
 2 7 835 2 6 365
 3 -8 830 2 7 365
 4 -10 830 2 1 365
 5 -3 830 2 2 365
 6 -12 830 2 3 365
 7 -4 830 2 4 365
 8 -9 830 2 5 365
 9 -8 835 2 6 365
10 -10 830 2 7 365
# … with more rows
```
]

---

```r
*flights_pipe <- ml_pipeline(sc) %>%
* ft_dplyr_transformer(tbl = df) %>%
 ft_binarizer(input_col = "DEP_DELAY",
 output_col = "DELAYED",
 threshold = 15) %>% 
 ft_bucketizer(input_col = "CRS_DEP_TIME",
 output_col = "HOURS",
 splits = seq(0, 2400, 400)) %>% 
 ft_r_formula(DELAYED ~ DAY_OF_WEEK + HOURS + DISTANCE) %>% 
 ml_logistic_regression() 
```

```r
Pipeline (Estimator) with 5 stages
<pipeline_187aa28dcf960> 
 Stages 
* |--1 SQLTransformer (Transformer)
 | <dplyr_transformer_187aaca3f397> 
 | (Parameters -- Column Names)
```

`ft_dplyr_transformer()` extracts the dplyr transformations used to 
generate object `tbl` as a SQL statement then passes it on 
to `ft_sql_transformer()`. The result is a `ml_pipeline` object.

---

```r
flights_pipe <- ml_pipeline(sc) %>% 
 ft_dplyr_transformer(tbl = df) %>%
* ft_binarizer(input_col = "DEP_DELAY",
* output_col = "DELAYED",
* threshold = 15) %>%
 ft_bucketizer(input_col = "CRS_DEP_TIME",
 output_col = "HOURS",
 splits = seq(0, 2400, 400)) %>% 
 ft_r_formula(DELAYED ~ DAY_OF_WEEK + HOURS + DISTANCE) %>% 
 ml_logistic_regression()
```

---

```r
flights_pipe <- ml_pipeline(sc) %>% 
 ft_dplyr_transformer(tbl = df) %>%
 ft_binarizer(input_col = "DEP_DELAY",
 output_col = "DELAYED",
 threshold = 15) %>% 
* ft_bucketizer(input_col = "CRS_DEP_TIME",
* output_col = "HOURS",
* splits = seq(0, 2400, 400)) %>%
 ft_r_formula(DELAYED ~ DAY_OF_WEEK + HOURS + DISTANCE) %>% 
 ml_logistic_regression()
```

---

```r
flights_pipe <- ml_pipeline(sc) %>% 
 ft_dplyr_transformer(tbl = df) %>%
 ft_binarizer(input_col = "DEP_DELAY",
 output_col = "DELAYED",
 threshold = 15) %>% 
 ft_bucketizer(input_col = "CRS_DEP_TIME",
 output_col = "HOURS",
 splits = seq(0, 2400, 400)) %>% 
* ft_r_formula(DELAYED ~ DAY_OF_WEEK + HOURS + DISTANCE) %>%
 ml_logistic_regression()
```

---

```r
flights_pipe <- ml_pipeline(sc) %>% 
 ft_dplyr_transformer(tbl = df) %>%
 ft_binarizer(input_col = "DEP_DELAY",
 output_col = "DELAYED",
 threshold = 15) %>% 
 ft_bucketizer(input_col = "CRS_DEP_TIME",
 output_col = "HOURS",
 splits = seq(0, 2400, 400)) %>% 
 ft_r_formula(DELAYED ~ DAY_OF_WEEK + HOURS + DISTANCE) %>% 
* ml_logistic_regression()
```

---

## Full pipeline

---

## What can we do with this pipeline?

1. Easily fit data with `ml_fit()`.

2. Make predictions with a fitted pipeline and `ml_transform()`.

3. Save pipelines that result in Scala scripts with `ml_save()` and
   can be read back into `sparklyr` (with `ml_load()`) or by the Scala or 
   PySpark APIs.

---

## Exercise

Use `bike_2017` to create an `ml_pipeline` object. Consider classification
with member type as the response. Also, consider creating buckets for 
duration and a binary variable for round trips (bike starts and ends at the 
same location).

---

## Distributed R with `spark_apply()`

Apply an R function to a Spark DataFrame. Also, your function must return
a Spark DataFrame or something coercible to one.

```r
spark_apply(
  bike_train,
  function(x) broom::tidy(lm(Duration ~ Member_type, data = x)),
  group_by = "Start_station"
)
```

```r
# Source: spark<?> [?? x 6]
 Start_station term estimate std_error statistic p_value
 <chr> <chr> <dbl> <dbl> <dbl> <dbl>
 1 10th St & L'Enfant Plaza SW (Intercept) 2746. 55.4 49.6 0. 
 2 10th St & L'Enfant Plaza SW Member_type -1921. 67.9 -28.3 6.07e-164
 3 28th St S & S Meade St (Intercept) 2965. 141. 21.0 8.41e- 82
 4 28th St S & S Meade St Member_type -1999. 166. -12.0 2.37e- 31
 5 18th & C St NW (Intercept) 2329. 38.3 60.7 0. 
 6 18th & C St NW Member_type -1433. 51.8 -27.7 3.40e-159
 7 Georgia Ave & Spring St (Intercept) 2840. 273. 10.4 2.80e- 23
 8 Georgia Ave & Spring St Member_type -1856. 310. -5.99 3.78e- 9
 9 19th & Savannah St SE (Intercept) 1180. 127. 9.28 7.50e- 4
10 19th & Savannah St SE Member_type -449. 220. -2.04 1.11e- 1
# … with more rows
```
]

---

## References

- https://spark.rstudio.com/

- http://spark.apache.org/docs/latest/api/R/index.html

- http://www.dcs.bbk.ac.uk/~dell/teaching/cc/book/databricks/spark-intro.pdf

- OST_R | BTS | Transtats. (2020). Transtats.bts.gov.  
  https://www.transtats.bts.gov/tables.asp?Table_ID=236&SYS_Table_Name=T_ONTIME_REPORTING