--- title: "Spark & sparklyr part II" subtitle: "Statistical Computing & Programming" author: "Shawn Santo" institute: "" date: "06-22-20" output: xaringan::moon_reader: css: "slides.css" lib_dir: libs nature: highlightStyle: github highlightLines: true countIncrementalSlides: false editor_options: chunk_output_type: console --- ```{r include=FALSE} knitr::opts_chunk$set(echo = TRUE, message = FALSE, warning = FALSE, comment = "#>", highlight = TRUE, fig.align = "center") library(sparklyr) library(tidyverse) ``` ## Supplementary materials Companion videos - [Data manipulation and partition](https://warpwire.duke.edu/w/S9sDAA/) - [ML models](https://warpwire.duke.edu/w/TdsDAA/) - [ML pipelines](https://warpwire.duke.edu/w/T9sDAA/) Additional resources - [A Gentle Introduction to Apache Spark](http://www.dcs.bbk.ac.uk/~dell/teaching/cc/book/databricks/spark-intro.pdf) --- class: inverse, center, middle # Recall --- ## What is Apache Spark? - As described by Databricks, "Spark is a unified computing engine and a set of libraries for parallel data processing on computing clusters".

- Spark's goal is to support data analytics tasks within a single ecosystem: data loading, SQL queries, machine learning, and streaming computations.

- Spark is written in Scala and runs on Java. However, Spark can be used from R, Python, SQL, Scala, or Java. --- ## Key features - In-memory computation - Fast and scalable - efficiently scale up from one to many thousands of compute nodes - Access data on a multitude of platforms - SQL and NoSQL databases - Cloud storage - Hadoop Distributed File System - Real-time stream processing - Libraries - Spark SQL - MLlib - Spark streaming - GraphX --- ## Install We'll be able to install Spark and set-up a connection through the helper functions in package `sparklyr`. More on this in a moment. ```{r} library(sparklyr) ``` ```{r} sparklyr::spark_available_versions() ``` --

Let's install version 2.4 of Spark for use with a local Spark connection via `spark_install(version = "2.4")` --- ## Configure and connect ```{r eval=FALSE} # add some custom configurations conf <- list( sparklyr.cores.local = 4, `sparklyr.shell.driver-memory` = "16G", spark.memory.fraction = 0.5 ) ``` `sparklyr.cores.local` - defaults to using all of the available cores `sparklyr.shell.driver-memory` - limit is the amount of RAM available in the computer minus what would be needed for OS operations `spark.memory.fraction` - default is set to 60% of the requested memory per executor ```{r echo=FALSE} Sys.setenv(JAVA_HOME="/Library/Java/JavaVirtualMachines/jdk1.8.0_202.jdk/Contents/Home") ``` ```{r eval=FALSE} # create a spark connection sc <- spark_connect(master = "local", version = "2.4.0", config = conf) ``` --- ## What is `sparklyr`? Package `sparklyr` provides an R interface for Spark. - Use `dplyr` to translate R code into Spark SQL - Work with Spark's MLlib - Interact with a stream of data --- ## Family of `sparklyr` functions | Sparklyr family of functions | Description | |-----------------------------:|:------------------------------------------------------------------------------------------------| | `spark_*()` | functions to manage and configure spark connections;
functions to read and write data | | `sdf_*()` | functions for manipulating SparkDataFrames | | `ft_*()` | feature transformers for manipulating individual features | | `ml_*()` | machine learning algorithms - K-Means, GLM, Survival Regression,
PCA, Naive-Bayes, and more | | `stream_*()` | functions for handling stream data | --- class: inverse, center, middle # Machine learning --- ## Logistic regression Our goal will be to classify member type based on some of the predictors in the 2017 Capital Bikeshare Data. First, let's create a spark table object. ```{r eval=FALSE} bike_2017 <- spark_read_csv(sc, name = "cbs_bike_2017", path = "~/.public_html/data/bike/cbs_2017.csv") ``` -- ```{r eval=FALSE} > glimpse(bike_2017) Rows: ?? Columns: 9 Database: spark_connection $ Duration 221, 1676, 1356, 1327, 1636, 1603, 473, 200, 748, 912,… $ Start_date 2017-01-01 00:00:41, 2017-01-01 00:06:53, 2017-01-01 … $ End_date 2017-01-01 00:04:23, 2017-01-01 00:34:49, 2017-01-01 … $ Start_station_number 31634, 31258, 31289, 31289, 31258, 31258, 31611, 31104… $ Start_station "3rd & Tingey St SE", "Lincoln Memorial", "Henry Bacon… $ End_station_number 31208, 31270, 31222, 31222, 31270, 31270, 31616, 31121… $ End_station "M St & New Jersey Ave SE", "8th & D St NW", "New York… $ Bike_number "W00869", "W00894", "W21945", "W20012", "W22786", "W20… $ Member_type "Member", "Casual", "Casual", "Casual", "Casual", "Cas… ``` Go to http://www2.stat.duke.edu/~sms185/data/bike/cbs_2017.csv to download the data. --- ## Data partitions Create training and test data. ```{r eval=FALSE} partitions <- bike_2017 %>% mutate(Member_type = as.numeric(Member_type == "Member")) %>% group_by(Member_type) %>% sdf_random_split(train = .8, test = .2, seed = 4718) #<< ``` `sdf_random_split()` creates an R list of tbl_sparks. --

**Where is `partitions` under tab Connections?** -- It exists, but if you want to see it as a table for Spark SQL you need to register it with `sdf_register()`. --- ```{r eval=FALSE} bike_train <- partitions$train bike_test <- partitions$test ``` -- Register these (not necessary). ```{r eval=FALSE} sdf_register(bike_train, name = "train") sdf_register(bike_test, name = "test") ``` You should now see `train` and `test` in your connections tab. --- Verify our partition is properly stratified. ```{r eval=FALSE} bike_train %>% count(Member_type) %>% mutate(prop = n / sum(n, na.rm = T)) ``` ```{r eval=FALSE} # Source: spark [?? x 3] Member_type n prop 1 0 786021 0.261 2 1 2220182 0.739 ``` ```{r eval=FALSE} bike_test %>% count(Member_type) %>% mutate(prop = n / sum(n, na.rm = T)) ``` ```{r eval=FALSE} # Source: spark [?? x 3] Member_type n prop 1 0 195777 0.260 2 1 555797 0.740 ``` --- ## Model fit ```{r eval=FALSE} bike_logistic <- bike_train %>% ml_logistic_regression(formula = Member_type ~ Duration) #<< ``` ```{r eval=FALSE} Formula: Member_type ~ Duration Coefficients: (Intercept) Duration 2.496742551 -0.001325904 ``` --- ## Model predictions ```{r eval=FALSE} logistic_pred <- ml_predict(bike_logistic, bike_test) logistic_pred ``` ```{r eval=FALSE} # Source: spark [?? x 7] features label rawPrediction probability prediction probability_0 probability_1 1 1 1 0.0820 0.918 2 1 1 0.0820 0.918 3 1 1 0.0820 0.918 4 0 1 0.0820 0.918 5 1 1 0.0820 0.918 6 1 1 0.0820 0.918 7 0 1 0.0820 0.918 8 1 1 0.0820 0.918 9 1 1 0.0820 0.918 10 1 1 0.0820 0.918 # … with more rows ``` --- ## Model evaluation To compute the AUROC ```{r eval=FALSE} ml_binary_classification_evaluator(pred) ``` ```{r eval=FALSE} [1] 0.8264824 ``` -- To compute the confusion matrix ```{r eval=FALSE} table(pull(logistic_pred, label), pull(logistic_pred, prediction)) ``` ```{r eval=FALSE} 0 1 0 68991 126786 1 14492 541305 ``` --- ## Model evaluation metrics For these `ml_*_evaluator()` functions the following metrics are supported - Binary Classification: areaUnderROC (default) or areaUnderPR (not available in Spark 2.X.) - Multiclass Classification: f1 (default), precision, recall, weightedPrecision, weightedRecall or accuracy; for Spark 2.X: f1 (default), weightedPrecision, weightedRecall or accuracy - Regression: rmse (root mean squared error, default), mse (mean squared error), r2, or mae (mean absolute error) --- ## Random forest ```{r eval=FALSE} rf_model <- bike_train %>% ml_random_forest(Member_type ~ Duration, type = "classification", seed = 402081) ``` -- ```{r eval=FALSE} rf_pred <- ml_predict(rf_model, bike_test) ml_binary_classification_evaluator(rf_pred) ``` ```{r eval=FALSE} [1] 0.7450385 ``` -- ```{r eval=FALSE} table(pull(rf_pred, label), pull(rf_pred, prediction)) ``` ```{r eval=FALSE} 0 1 0 90965 104812 1 31504 524293 ``` --- ## SVM ```{r eval=FALSE} svm_model <- bike_train %>% ml_linear_svc(Member_type ~ Duration) ``` -- ```{r eval=FALSE} svm_pred <- ml_predict(svm_model, bike_test) ml_binary_classification_evaluator(svm_pred) ``` ```{r eval=FALSE} [1] 0.8264824 ``` -- ```{r eval=FALSE} table(pull(svm_pred, label), pull(svm_pred, prediction)) ``` ```{r eval=FALSE} 0 1 0 21914 173863 1 1710 554087 ``` --- ## ML classification summary | **Method** | **Area Under ROC** | |------------------------------:|--------------------| | Logistic regression | 0.826 | | Random forest classification | 0.745 | | Linear support vector machine | 0.826 |

Which simple model should we go with? --- ## Exercise Using a small dataset (`mtcars`), explore `ml_logistic_regression()` and `ml_generalized_linear_regression()` with family "binomial". Fit some models with various continuous and categorical predictors. Do you get the same result? Also, compare `ml_linear_regression()` and `ml_generalized_linear_regression()` with the default family gaussian. ```{r eval=FALSE} cars <- sdf_copy_to(sc, mtcars, "cars") ``` --- class: inverse, center, middle # ML Pipelines --- ## What is an `ml_pipeline`? Spark’s ML Pipelines provide a way to easily combine multiple transformations and algorithms into a single workflow, or pipeline. Some Spark terminology: - **Transformer**: A Transformer is an algorithm which can transform one DataFrame into another DataFrame. - **Estimator**: An Estimator is an algorithm which can be fit on a DataFrame to produce a Transformer. - **Pipeline**: A Pipeline chains multiple Transformers and Estimators together to specify a machine learning workflow. --- ## ML pipelines for reproducibility Let's create a ML pipeline to classify if a flight is delayed in February 2020 for all NC airports. ```{r eval=FALSE} flights <- spark_read_csv(sc, name = "nc_flights_feb_2020", path = "~/.public_html/data/flights/nc_flights_feb_20.csv") ``` Go to http://www2.stat.duke.edu/~sms185/data/flights/nc_flights_feb_20.csv to download the data. --- ```{r eval=FALSE} df <- flights %>% mutate(DEP_DELAY = as.numeric(DEP_DELAY), ARR_DELAY = as.numeric(ARR_DELAY), MONTH = as.character(MONTH), DAY_OF_WEEK = as.character(DAY_OF_WEEK) ) %>% filter(!is.na(DEP_DELAY)) %>% select(DEP_DELAY, CRS_DEP_TIME, MONTH, DAY_OF_WEEK, DISTANCE) df ``` .tiny[ ```{r eval=FALSE} # Source: spark [?? x 5] DEP_DELAY CRS_DEP_TIME MONTH DAY_OF_WEEK DISTANCE 1 -7 830 2 5 365 2 7 835 2 6 365 3 -8 830 2 7 365 4 -10 830 2 1 365 5 -3 830 2 2 365 6 -12 830 2 3 365 7 -4 830 2 4 365 8 -9 830 2 5 365 9 -8 835 2 6 365 10 -10 830 2 7 365 # … with more rows ``` ] --- ```{r eval=FALSE} flights_pipe <- ml_pipeline(sc) %>% #<< ft_dplyr_transformer(tbl = df) %>% #<< ft_binarizer(input_col = "DEP_DELAY", output_col = "DELAYED", threshold = 15) %>% ft_bucketizer(input_col = "CRS_DEP_TIME", output_col = "HOURS", splits = seq(0, 2400, 400)) %>% ft_r_formula(DELAYED ~ DAY_OF_WEEK + HOURS + DISTANCE) %>% ml_logistic_regression() ``` ```{r eval=FALSE} Pipeline (Estimator) with 5 stages Stages |--1 SQLTransformer (Transformer) #<< | | (Parameters -- Column Names) ``` --
`ft_dplyr_transformer()` extracts the dplyr transformations used to generate object `tbl` as a SQL statement then passes it on to `ft_sql_transformer()`. The result is a `ml_pipeline` object. --- ```{r eval=FALSE} flights_pipe <- ml_pipeline(sc) %>% ft_dplyr_transformer(tbl = df) %>% ft_binarizer(input_col = "DEP_DELAY", #<< output_col = "DELAYED", #<< threshold = 15) %>% #<< ft_bucketizer(input_col = "CRS_DEP_TIME", output_col = "HOURS", splits = seq(0, 2400, 400)) %>% ft_r_formula(DELAYED ~ DAY_OF_WEEK + HOURS + DISTANCE) %>% ml_logistic_regression() ``` ```{r eval=FALSE} |--2 Binarizer (Transformer) #<< | | (Parameters -- Column Names) | input_col: DEP_DELAY | output_col: DELAYED ``` --- ```{r eval=FALSE} flights_pipe <- ml_pipeline(sc) %>% ft_dplyr_transformer(tbl = df) %>% ft_binarizer(input_col = "DEP_DELAY", output_col = "DELAYED", threshold = 15) %>% ft_bucketizer(input_col = "CRS_DEP_TIME", #<< output_col = "HOURS", #<< splits = seq(0, 2400, 400)) %>% #<< ft_r_formula(DELAYED ~ DAY_OF_WEEK + HOURS + DISTANCE) %>% ml_logistic_regression() ``` ```{r eval=FALSE} |--3 Bucketizer (Transformer) #<< | | (Parameters -- Column Names) | input_col: CRS_DEP_TIME | output_col: HOURS ``` --- ```{r eval=FALSE} flights_pipe <- ml_pipeline(sc) %>% ft_dplyr_transformer(tbl = df) %>% ft_binarizer(input_col = "DEP_DELAY", output_col = "DELAYED", threshold = 15) %>% ft_bucketizer(input_col = "CRS_DEP_TIME", output_col = "HOURS", splits = seq(0, 2400, 400)) %>% ft_r_formula(DELAYED ~ DAY_OF_WEEK + HOURS + DISTANCE) %>% #<< ml_logistic_regression() ``` ```{r eval=FALSE} |--4 RFormula (Estimator) #<< | | (Parameters -- Column Names) | features_col: features | label_col: label | (Parameters) | force_index_label: FALSE | formula: DELAYED ~ DAY_OF_WEEK + HOURS + DISTANCE | handle_invalid: error | stringIndexerOrderType: frequencyDesc ``` --- ```{r eval=FALSE} flights_pipe <- ml_pipeline(sc) %>% ft_dplyr_transformer(tbl = df) %>% ft_binarizer(input_col = "DEP_DELAY", output_col = "DELAYED", threshold = 15) %>% ft_bucketizer(input_col = "CRS_DEP_TIME", output_col = "HOURS", splits = seq(0, 2400, 400)) %>% ft_r_formula(DELAYED ~ DAY_OF_WEEK + HOURS + DISTANCE) %>% ml_logistic_regression() #<< ``` ```{r eval=FALSE} |--5 LogisticRegression (Estimator) #<< | | (Parameters -- Column Names) | features_col: features | label_col: label | prediction_col: prediction | probability_col: probability | raw_prediction_col: rawPrediction | (Parameters) | aggregation_depth: 2 | elastic_net_param: 0 | family: auto | fit_intercept: TRUE | max_iter: 100 | reg_param: 0 | standardization: TRUE | threshold: 0.5 | tol: 1e-06 ``` --- ## Full pipeline .pull-left[ .tiny[ ```{r eval=FALSE} Pipeline (Estimator) with 5 stages Stages |--1 SQLTransformer (Transformer) | | (Parameters -- Column Names) |--2 Binarizer (Transformer) | | (Parameters -- Column Names) | input_col: DEP_DELAY | output_col: DELAYED |--3 Bucketizer (Transformer) | | (Parameters -- Column Names) | input_col: CRS_DEP_TIME | output_col: HOURS |--4 RFormula (Estimator) | | (Parameters -- Column Names) | features_col: features | label_col: label | (Parameters) | force_index_label: FALSE | formula: DELAYED ~ DAY_OF_WEEK + HOURS + DISTANCE | handle_invalid: error | stringIndexerOrderType: frequencyDesc ``` ] ] .pull-right[ .tiny[ ```{r eval=FALSE} |--5 LogisticRegression (Estimator) | | (Parameters -- Column Names) | features_col: features | label_col: label | prediction_col: prediction | probability_col: probability | raw_prediction_col: rawPrediction | (Parameters) | aggregation_depth: 2 | elastic_net_param: 0 | family: auto | fit_intercept: TRUE | max_iter: 100 | reg_param: 0 | standardization: TRUE | threshold: 0.5 | tol: 1e-06 ``` ] ] --- ## What can we do with this pipeline? 1. Easily fit data with `ml_fit()`. 2. Make predictions with a fitted pipeline and `ml_transform()`. 3. Save pipelines that result in Scala scripts with `ml_save()` and can be read back into `sparklyr` (with `ml_load()`) or by the Scala or PySpark APIs. --- ## Exercise Use `bike_2017` to create an `ml_pipeline` object. Consider classification with member type as the response. Also, consider creating buckets for duration and a binary variable for round trips (bike starts and ends at the same location). --- ## Distributed R with `spark_apply()` Apply an R function to a Spark DataFrame. Also, your function must return a Spark DataFrame or something coercible to one. ```{r eval=FALSE} spark_apply( bike_train, function(x) broom::tidy(lm(Duration ~ Member_type, data = x)), group_by = "Start_station" ) ``` .tiny[ ```{r eval=FALSE} # Source: spark [?? x 6] Start_station term estimate std_error statistic p_value 1 10th St & L'Enfant Plaza SW (Intercept) 2746. 55.4 49.6 0. 2 10th St & L'Enfant Plaza SW Member_type -1921. 67.9 -28.3 6.07e-164 3 28th St S & S Meade St (Intercept) 2965. 141. 21.0 8.41e- 82 4 28th St S & S Meade St Member_type -1999. 166. -12.0 2.37e- 31 5 18th & C St NW (Intercept) 2329. 38.3 60.7 0. 6 18th & C St NW Member_type -1433. 51.8 -27.7 3.40e-159 7 Georgia Ave & Spring St (Intercept) 2840. 273. 10.4 2.80e- 23 8 Georgia Ave & Spring St Member_type -1856. 310. -5.99 3.78e- 9 9 19th & Savannah St SE (Intercept) 1180. 127. 9.28 7.50e- 4 10 19th & Savannah St SE Member_type -449. 220. -2.04 1.11e- 1 # … with more rows ``` ] --- ## References - https://spark.rstudio.com/ - http://spark.apache.org/docs/latest/api/R/index.html - http://www.dcs.bbk.ac.uk/~dell/teaching/cc/book/databricks/spark-intro.pdf - OST_R | BTS | Transtats. (2020). Transtats.bts.gov. https://www.transtats.bts.gov/tables.asp?Table_ID=236&SYS_Table_Name=T_ONTIME_REPORTING