class: center, middle, inverse, title-slide # Spark & sparklyr part II ## Statistical Computing & Programming ### Shawn Santo ### 06-22-20 --- ## Supplementary materials Companion videos - [Data manipulation and partition](https://warpwire.duke.edu/w/S9sDAA/) - [ML models](https://warpwire.duke.edu/w/TdsDAA/) - [ML pipelines](https://warpwire.duke.edu/w/T9sDAA/) Additional resources - [A Gentle Introduction to Apache Spark](http://www.dcs.bbk.ac.uk/~dell/teaching/cc/book/databricks/spark-intro.pdf) --- class: inverse, center, middle # Recall --- ## What is Apache Spark? - As described by Databricks, "Spark is a unified computing engine and a set of libraries for parallel data processing on computing clusters". <br/><br/><br/> - Spark's goal is to support data analytics tasks within a single ecosystem: data loading, SQL queries, machine learning, and streaming computations. <br/><br/><br/> - Spark is written in Scala and runs on Java. However, Spark can be used from R, Python, SQL, Scala, or Java. --- ## Key features - In-memory computation - Fast and scalable - efficiently scale up from one to many thousands of compute nodes - Access data on a multitude of platforms - SQL and NoSQL databases - Cloud storage - Hadoop Distributed File System - Real-time stream processing - Libraries - Spark SQL - MLlib - Spark streaming - GraphX --- ## Install We'll be able to install Spark and set-up a connection through the helper functions in package `sparklyr`. More on this in a moment. ```r library(sparklyr) ``` ```r sparklyr::spark_available_versions() ``` ``` #> spark #> 1 1.6 #> 2 2.0 #> 3 2.1 #> 4 2.2 #> 5 2.3 #> 6 2.4 ``` -- <br/><br/> Let's install version 2.4 of Spark for use with a local Spark connection via `spark_install(version = "2.4")` --- ## Configure and connect ```r # add some custom configurations conf <- list( sparklyr.cores.local = 4, `sparklyr.shell.driver-memory` = "16G", spark.memory.fraction = 0.5 ) ``` `sparklyr.cores.local` - defaults to using all of the available cores `sparklyr.shell.driver-memory` - limit is the amount of RAM available in the computer minus what would be needed for OS operations `spark.memory.fraction` - default is set to 60% of the requested memory per executor ```r # create a spark connection sc <- spark_connect(master = "local", version = "2.4.0", config = conf) ``` --- ## What is `sparklyr`? Package `sparklyr` provides an R interface for Spark. - Use `dplyr` to translate R code into Spark SQL - Work with Spark's MLlib - Interact with a stream of data --- ## Family of `sparklyr` functions | Sparklyr family of functions | Description | |-----------------------------:|:------------------------------------------------------------------------------------------------| | `spark_*()` | functions to manage and configure spark connections; <br>functions to read and write data | | `sdf_*()` | functions for manipulating SparkDataFrames | | `ft_*()` | feature transformers for manipulating individual features | | `ml_*()` | machine learning algorithms - K-Means, GLM, Survival Regression, <br>PCA, Naive-Bayes, and more | | `stream_*()` | functions for handling stream data | --- class: inverse, center, middle # Machine learning --- ## Logistic regression Our goal will be to classify member type based on some of the predictors in the 2017 Capital Bikeshare Data. First, let's create a spark table object. ```r bike_2017 <- spark_read_csv(sc, name = "cbs_bike_2017", path = "~/.public_html/data/bike/cbs_2017.csv") ``` -- ```r > glimpse(bike_2017) Rows: ?? Columns: 9 Database: spark_connection $ Duration <int> 221, 1676, 1356, 1327, 1636, 1603, 473, 200, 748, 912,… $ Start_date <dttm> 2017-01-01 00:00:41, 2017-01-01 00:06:53, 2017-01-01 … $ End_date <dttm> 2017-01-01 00:04:23, 2017-01-01 00:34:49, 2017-01-01 … $ Start_station_number <int> 31634, 31258, 31289, 31289, 31258, 31258, 31611, 31104… $ Start_station <chr> "3rd & Tingey St SE", "Lincoln Memorial", "Henry Bacon… $ End_station_number <int> 31208, 31270, 31222, 31222, 31270, 31270, 31616, 31121… $ End_station <chr> "M St & New Jersey Ave SE", "8th & D St NW", "New York… $ Bike_number <chr> "W00869", "W00894", "W21945", "W20012", "W22786", "W20… $ Member_type <chr> "Member", "Casual", "Casual", "Casual", "Casual", "Cas… ``` Go to http://www2.stat.duke.edu/~sms185/data/bike/cbs_2017.csv to download the data. --- ## Data partitions Create training and test data. ```r partitions <- bike_2017 %>% mutate(Member_type = as.numeric(Member_type == "Member")) %>% group_by(Member_type) %>% * sdf_random_split(train = .8, test = .2, seed = 4718) ``` `sdf_random_split()` creates an R list of tbl_sparks. -- <br/><br/> **Where is `partitions` under tab Connections?** -- It exists, but if you want to see it as a table for Spark SQL you need to register it with `sdf_register()`. --- ```r bike_train <- partitions$train bike_test <- partitions$test ``` -- Register these (not necessary). ```r sdf_register(bike_train, name = "train") sdf_register(bike_test, name = "test") ``` You should now see `train` and `test` in your connections tab. --- Verify our partition is properly stratified. ```r bike_train %>% count(Member_type) %>% mutate(prop = n / sum(n, na.rm = T)) ``` ```r # Source: spark<?> [?? x 3] Member_type n prop <dbl> <dbl> <dbl> 1 0 786021 0.261 2 1 2220182 0.739 ``` ```r bike_test %>% count(Member_type) %>% mutate(prop = n / sum(n, na.rm = T)) ``` ```r # Source: spark<?> [?? x 3] Member_type n prop <dbl> <dbl> <dbl> 1 0 195777 0.260 2 1 555797 0.740 ``` --- ## Model fit ```r bike_logistic <- bike_train %>% * ml_logistic_regression(formula = Member_type ~ Duration) ``` ```r Formula: Member_type ~ Duration Coefficients: (Intercept) Duration 2.496742551 -0.001325904 ``` --- ## Model predictions ```r logistic_pred <- ml_predict(bike_logistic, bike_test) logistic_pred ``` ```r # Source: spark<prediction_logistic_tbl> [?? x 7] features label rawPrediction probability prediction probability_0 probability_1 <list> <dbl> <list> <list> <dbl> <dbl> <dbl> 1 <dbl [1]> 1 <dbl [2]> <dbl [2]> 1 0.0820 0.918 2 <dbl [1]> 1 <dbl [2]> <dbl [2]> 1 0.0820 0.918 3 <dbl [1]> 1 <dbl [2]> <dbl [2]> 1 0.0820 0.918 4 <dbl [1]> 0 <dbl [2]> <dbl [2]> 1 0.0820 0.918 5 <dbl [1]> 1 <dbl [2]> <dbl [2]> 1 0.0820 0.918 6 <dbl [1]> 1 <dbl [2]> <dbl [2]> 1 0.0820 0.918 7 <dbl [1]> 0 <dbl [2]> <dbl [2]> 1 0.0820 0.918 8 <dbl [1]> 1 <dbl [2]> <dbl [2]> 1 0.0820 0.918 9 <dbl [1]> 1 <dbl [2]> <dbl [2]> 1 0.0820 0.918 10 <dbl [1]> 1 <dbl [2]> <dbl [2]> 1 0.0820 0.918 # … with more rows ``` --- ## Model evaluation To compute the AUROC ```r ml_binary_classification_evaluator(pred) ``` ```r [1] 0.8264824 ``` -- To compute the confusion matrix ```r table(pull(logistic_pred, label), pull(logistic_pred, prediction)) ``` ```r 0 1 0 68991 126786 1 14492 541305 ``` --- ## Model evaluation metrics For these `ml_*_evaluator()` functions the following metrics are supported - Binary Classification: areaUnderROC (default) or areaUnderPR (not available in Spark 2.X.) - Multiclass Classification: f1 (default), precision, recall, weightedPrecision, weightedRecall or accuracy; for Spark 2.X: f1 (default), weightedPrecision, weightedRecall or accuracy - Regression: rmse (root mean squared error, default), mse (mean squared error), r2, or mae (mean absolute error) --- ## Random forest ```r rf_model <- bike_train %>% ml_random_forest(Member_type ~ Duration, type = "classification", seed = 402081) ``` -- ```r rf_pred <- ml_predict(rf_model, bike_test) ml_binary_classification_evaluator(rf_pred) ``` ```r [1] 0.7450385 ``` -- ```r table(pull(rf_pred, label), pull(rf_pred, prediction)) ``` ```r 0 1 0 90965 104812 1 31504 524293 ``` --- ## SVM ```r svm_model <- bike_train %>% ml_linear_svc(Member_type ~ Duration) ``` -- ```r svm_pred <- ml_predict(svm_model, bike_test) ml_binary_classification_evaluator(svm_pred) ``` ```r [1] 0.8264824 ``` -- ```r table(pull(svm_pred, label), pull(svm_pred, prediction)) ``` ```r 0 1 0 21914 173863 1 1710 554087 ``` --- ## ML classification summary | **Method** | **Area Under ROC** | |------------------------------:|--------------------| | Logistic regression | 0.826 | | Random forest classification | 0.745 | | Linear support vector machine | 0.826 | <br/><br/> Which simple model should we go with? --- ## Exercise Using a small dataset (`mtcars`), explore `ml_logistic_regression()` and `ml_generalized_linear_regression()` with family "binomial". Fit some models with various continuous and categorical predictors. Do you get the same result? Also, compare `ml_linear_regression()` and `ml_generalized_linear_regression()` with the default family gaussian. ```r cars <- sdf_copy_to(sc, mtcars, "cars") ``` --- class: inverse, center, middle # ML Pipelines --- ## What is an `ml_pipeline`? Spark’s ML Pipelines provide a way to easily combine multiple transformations and algorithms into a single workflow, or pipeline. Some Spark terminology: - **Transformer**: A Transformer is an algorithm which can transform one DataFrame into another DataFrame. - **Estimator**: An Estimator is an algorithm which can be fit on a DataFrame to produce a Transformer. - **Pipeline**: A Pipeline chains multiple Transformers and Estimators together to specify a machine learning workflow. --- ## ML pipelines for reproducibility Let's create a ML pipeline to classify if a flight is delayed in February 2020 for all NC airports. ```r flights <- spark_read_csv(sc, name = "nc_flights_feb_2020", path = "~/.public_html/data/flights/nc_flights_feb_20.csv") ``` Go to http://www2.stat.duke.edu/~sms185/data/flights/nc_flights_feb_20.csv to download the data. --- ```r df <- flights %>% mutate(DEP_DELAY = as.numeric(DEP_DELAY), ARR_DELAY = as.numeric(ARR_DELAY), MONTH = as.character(MONTH), DAY_OF_WEEK = as.character(DAY_OF_WEEK) ) %>% filter(!is.na(DEP_DELAY)) %>% select(DEP_DELAY, CRS_DEP_TIME, MONTH, DAY_OF_WEEK, DISTANCE) df ``` .tiny[ ```r # Source: spark<?> [?? x 5] DEP_DELAY CRS_DEP_TIME MONTH DAY_OF_WEEK DISTANCE <dbl> <int> <chr> <chr> <dbl> 1 -7 830 2 5 365 2 7 835 2 6 365 3 -8 830 2 7 365 4 -10 830 2 1 365 5 -3 830 2 2 365 6 -12 830 2 3 365 7 -4 830 2 4 365 8 -9 830 2 5 365 9 -8 835 2 6 365 10 -10 830 2 7 365 # … with more rows ``` ] --- ```r *flights_pipe <- ml_pipeline(sc) %>% * ft_dplyr_transformer(tbl = df) %>% ft_binarizer(input_col = "DEP_DELAY", output_col = "DELAYED", threshold = 15) %>% ft_bucketizer(input_col = "CRS_DEP_TIME", output_col = "HOURS", splits = seq(0, 2400, 400)) %>% ft_r_formula(DELAYED ~ DAY_OF_WEEK + HOURS + DISTANCE) %>% ml_logistic_regression() ``` ```r Pipeline (Estimator) with 5 stages <pipeline_187aa28dcf960> Stages * |--1 SQLTransformer (Transformer) | <dplyr_transformer_187aaca3f397> | (Parameters -- Column Names) ``` -- <br/> `ft_dplyr_transformer()` extracts the dplyr transformations used to generate object `tbl` as a SQL statement then passes it on to `ft_sql_transformer()`. The result is a `ml_pipeline` object. --- ```r flights_pipe <- ml_pipeline(sc) %>% ft_dplyr_transformer(tbl = df) %>% * ft_binarizer(input_col = "DEP_DELAY", * output_col = "DELAYED", * threshold = 15) %>% ft_bucketizer(input_col = "CRS_DEP_TIME", output_col = "HOURS", splits = seq(0, 2400, 400)) %>% ft_r_formula(DELAYED ~ DAY_OF_WEEK + HOURS + DISTANCE) %>% ml_logistic_regression() ``` ```r * |--2 Binarizer (Transformer) | <binarizer_187aa5412bd32> | (Parameters -- Column Names) | input_col: DEP_DELAY | output_col: DELAYED ``` --- ```r flights_pipe <- ml_pipeline(sc) %>% ft_dplyr_transformer(tbl = df) %>% ft_binarizer(input_col = "DEP_DELAY", output_col = "DELAYED", threshold = 15) %>% * ft_bucketizer(input_col = "CRS_DEP_TIME", * output_col = "HOURS", * splits = seq(0, 2400, 400)) %>% ft_r_formula(DELAYED ~ DAY_OF_WEEK + HOURS + DISTANCE) %>% ml_logistic_regression() ``` ```r * |--3 Bucketizer (Transformer) | <bucketizer_187aa90f07cf> | (Parameters -- Column Names) | input_col: CRS_DEP_TIME | output_col: HOURS ``` --- ```r flights_pipe <- ml_pipeline(sc) %>% ft_dplyr_transformer(tbl = df) %>% ft_binarizer(input_col = "DEP_DELAY", output_col = "DELAYED", threshold = 15) %>% ft_bucketizer(input_col = "CRS_DEP_TIME", output_col = "HOURS", splits = seq(0, 2400, 400)) %>% * ft_r_formula(DELAYED ~ DAY_OF_WEEK + HOURS + DISTANCE) %>% ml_logistic_regression() ``` ```r * |--4 RFormula (Estimator) | <r_formula_187aa79a9bb9b> | (Parameters -- Column Names) | features_col: features | label_col: label | (Parameters) | force_index_label: FALSE | formula: DELAYED ~ DAY_OF_WEEK + HOURS + DISTANCE | handle_invalid: error | stringIndexerOrderType: frequencyDesc ``` --- ```r flights_pipe <- ml_pipeline(sc) %>% ft_dplyr_transformer(tbl = df) %>% ft_binarizer(input_col = "DEP_DELAY", output_col = "DELAYED", threshold = 15) %>% ft_bucketizer(input_col = "CRS_DEP_TIME", output_col = "HOURS", splits = seq(0, 2400, 400)) %>% ft_r_formula(DELAYED ~ DAY_OF_WEEK + HOURS + DISTANCE) %>% * ml_logistic_regression() ``` ```r * |--5 LogisticRegression (Estimator) | <logistic_regression_187aa3ccd7a92> | (Parameters -- Column Names) | features_col: features | label_col: label | prediction_col: prediction | probability_col: probability | raw_prediction_col: rawPrediction | (Parameters) | aggregation_depth: 2 | elastic_net_param: 0 | family: auto | fit_intercept: TRUE | max_iter: 100 | reg_param: 0 | standardization: TRUE | threshold: 0.5 | tol: 1e-06 ``` --- ## Full pipeline .pull-left[ .tiny[ ```r Pipeline (Estimator) with 5 stages <pipeline_187aa28dcf960> Stages |--1 SQLTransformer (Transformer) | <dplyr_transformer_187aaca3f397> | (Parameters -- Column Names) |--2 Binarizer (Transformer) | <binarizer_187aa5412bd32> | (Parameters -- Column Names) | input_col: DEP_DELAY | output_col: DELAYED |--3 Bucketizer (Transformer) | <bucketizer_187aa90f07cf> | (Parameters -- Column Names) | input_col: CRS_DEP_TIME | output_col: HOURS |--4 RFormula (Estimator) | <r_formula_187aa79a9bb9b> | (Parameters -- Column Names) | features_col: features | label_col: label | (Parameters) | force_index_label: FALSE | formula: DELAYED ~ DAY_OF_WEEK + HOURS + DISTANCE | handle_invalid: error | stringIndexerOrderType: frequencyDesc ``` ] ] .pull-right[ .tiny[ ```r |--5 LogisticRegression (Estimator) | <logistic_regression_187aa3ccd7a92> | (Parameters -- Column Names) | features_col: features | label_col: label | prediction_col: prediction | probability_col: probability | raw_prediction_col: rawPrediction | (Parameters) | aggregation_depth: 2 | elastic_net_param: 0 | family: auto | fit_intercept: TRUE | max_iter: 100 | reg_param: 0 | standardization: TRUE | threshold: 0.5 | tol: 1e-06 ``` ] ] --- ## What can we do with this pipeline? 1. Easily fit data with `ml_fit()`. 2. Make predictions with a fitted pipeline and `ml_transform()`. 3. Save pipelines that result in Scala scripts with `ml_save()` and can be read back into `sparklyr` (with `ml_load()`) or by the Scala or PySpark APIs. --- ## Exercise Use `bike_2017` to create an `ml_pipeline` object. Consider classification with member type as the response. Also, consider creating buckets for duration and a binary variable for round trips (bike starts and ends at the same location). --- ## Distributed R with `spark_apply()` Apply an R function to a Spark DataFrame. Also, your function must return a Spark DataFrame or something coercible to one. ```r spark_apply( bike_train, function(x) broom::tidy(lm(Duration ~ Member_type, data = x)), group_by = "Start_station" ) ``` .tiny[ ```r # Source: spark<?> [?? x 6] Start_station term estimate std_error statistic p_value <chr> <chr> <dbl> <dbl> <dbl> <dbl> 1 10th St & L'Enfant Plaza SW (Intercept) 2746. 55.4 49.6 0. 2 10th St & L'Enfant Plaza SW Member_type -1921. 67.9 -28.3 6.07e-164 3 28th St S & S Meade St (Intercept) 2965. 141. 21.0 8.41e- 82 4 28th St S & S Meade St Member_type -1999. 166. -12.0 2.37e- 31 5 18th & C St NW (Intercept) 2329. 38.3 60.7 0. 6 18th & C St NW Member_type -1433. 51.8 -27.7 3.40e-159 7 Georgia Ave & Spring St (Intercept) 2840. 273. 10.4 2.80e- 23 8 Georgia Ave & Spring St Member_type -1856. 310. -5.99 3.78e- 9 9 19th & Savannah St SE (Intercept) 1180. 127. 9.28 7.50e- 4 10 19th & Savannah St SE Member_type -449. 220. -2.04 1.11e- 1 # … with more rows ``` ] --- ## References - https://spark.rstudio.com/ - http://spark.apache.org/docs/latest/api/R/index.html - http://www.dcs.bbk.ac.uk/~dell/teaching/cc/book/databricks/spark-intro.pdf - OST_R | BTS | Transtats. (2020). Transtats.bts.gov. https://www.transtats.bts.gov/tables.asp?Table_ID=236&SYS_Table_Name=T_ONTIME_REPORTING