--- title: "Spark & sparklyr part I" subtitle: "Statistical Computing & Programming" author: "Shawn Santo" institute: "" date: "06-19-20" output: xaringan::moon_reader: css: "slides.css" lib_dir: libs nature: highlightStyle: github highlightLines: true countIncrementalSlides: false editor_options: chunk_output_type: console --- ```{r include=FALSE} knitr::opts_chunk$set(echo = TRUE, message = FALSE, warning = FALSE, comment = "#>", highlight = TRUE, fig.align = "center") library(sparklyr) library(tidyverse) ``` ## Supplementary materials Companion videos - [Introduction to Spark](https://warpwire.duke.edu/w/xdkDAA/) - [Getting started with `sparklyr`](https://warpwire.duke.edu/w/ydkDAA/) - [Fixing variable names](https://warpwire.duke.edu/w/y9kDAA/) - [Summaries and analysis](https://warpwire.duke.edu/w/zdkDAA/) Additional resources - [`sparklyr`: R interface for Apache Spark](https://spark.rstudio.com/) - [R Front End for Apache Spark](http://spark.apache.org/docs/latest/api/R/index.html) --- class: inverse, center, middle # Spark --- ## What is Apache Spark? - As described by Databricks, "Spark is a unified computing engine and a set of libraries for parallel data processing on computing clusters".

- Spark's goal is to support data analytics tasks within a single ecosystem: data loading, SQL queries, machine learning, and streaming computations.

- Spark is written in Scala and runs on Java. However, Spark can be used from R, Python, SQL, Scala, or Java. --- ## Spark ecosystem ![](images/spark-ecosystem.png) --- ## Key features - In-memory computation - Fast and scalable - efficiently scale up from one to many thousands of compute nodes - Access data on a multitude of platforms - SQL and NoSQL databses - Cloud storage - Hadoop Distributed File System - Real-time stream processing - Libraries - Spark SQL - MLlib - Spark streaming - GraphX --- class: inverse, center, middle # Installing and connecting to Spark --- ## Install We'll be able to install Spark and set-up a connection through the helper functions in package `sparklyr`. More on this in a moment. ```{r} library(sparklyr) ``` ```{r} sparklyr::spark_available_versions() ``` --

Let's install version 2.4 of Spark for use with a local Spark connection via `spark_install(version = "2.4")` --- ## Configure and connect ```{r eval=FALSE} # add some custom configurations conf <- list( sparklyr.cores.local = 4, `sparklyr.shell.driver-memory` = "16G", spark.memory.fraction = 0.5 ) ``` `sparklyr.cores.local` - defaults to using all of the available cores `sparklyr.shell.driver-memory` - limit is the amount of RAM available in the computer minus what would be needed for OS operations `spark.memory.fraction` - default is set to 60% of the requested memory per executor ```{r echo=FALSE} Sys.setenv(JAVA_HOME="/Library/Java/JavaVirtualMachines/jdk1.8.0_202.jdk/Contents/Home") ``` ```{r eval=FALSE} # create a spark connection sc <- spark_connect(master = "local", version = "2.4.0", config = conf) ``` --- class: inverse, center, middle # Working with `sparklyr` --- ## What is `sparklyr`? Package `sparklyr` provides an R interface for Spark. - Use `dplyr` to translate R code into Spark SQL - Work with Spark's MLlib - Interact with a stream of data --- ## Adding data https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page Put the January 2009 Yellow Cab data in your working directory or a location it can be accessed. .tiny[ ```{r eval=FALSE} cab <- spark_read_csv(sc, name = "cab", path = "~/.public_html/data/taxi/yellow_tripdata_2009-01.csv") ``` ] Data can also be read into Spark DataFrames with `spark_read_json()` and `spark_read_parquet()`. Go to http://www2.stat.duke.edu/~sms185/data/taxi/yellow_tripdata_2009-01.csv to download the data. --- .tiny[ ```{r eval=FALSE} glimpse(cab) ``` ```{r eval=FALSE} Observations: ?? Variables: 18 Database: spark_connection $ vendor_name "VTS", "VTS", "VTS", "DDS", "DDS", "DDS", "DDS", "VTS"… $ Trip_Pickup_DateTime 2009-01-04 07:52:00, 2009-01-04 08:31:00, 2009-01-03 … $ Trip_Dropoff_DateTime 2009-01-04 08:02:00, 2009-01-04 08:38:00, 2009-01-03 … $ Passenger_Count 1, 3, 5, 1, 1, 2, 1, 1, 1, 1, 1, 1, 2, 2, 1, 1, 1, 1, … $ Trip_Distance 2.63, 4.55, 10.35, 5.00, 0.40, 1.20, 0.40, 1.72, 1.60,… $ Start_Lon -73.99196, -73.98210, -74.00259, -73.97427, -74.00158,… $ Start_Lat 40.72157, 40.73629, 40.73975, 40.79095, 40.71938, 40.7… $ Rate_Code NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA… $ store_and_forward NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA… $ End_Lon -73.99380, -73.95585, -73.86998, -73.99656, -74.00838,… $ End_Lat 40.69592, 40.76803, 40.77023, 40.73185, 40.72035, 40.7… $ Payment_Type "CASH", "Credit", "Credit", "CREDIT", "CASH", "CASH", … $ Fare_Amt 8.9, 12.1, 23.7, 14.9, 3.7, 6.1, 5.7, 6.1, 8.7, 5.9, 2… $ surcharge 0.5, 0.5, 0.0, 0.5, 0.0, 0.5, 0.0, 0.5, 0.0, 0.0, 0.0,… $ mta_tax NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA… $ Tip_Amt 0.00, 2.00, 4.74, 3.05, 0.00, 0.00, 1.00, 0.00, 1.30, … $ Tolls_Amt 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, … $ Total_Amt 9.40, 14.60, 28.44, 18.45, 3.70, 6.60, 6.70, 6.60, 10.… ``` ]

It looks like we have some variable naming inconsistencies we should clean up. --- ## Try some cleaning Clean the names up with `janitor::clean_names()`. ```{r eval=FALSE} janitor::clean_names(cab) ``` ```{r eval=FALSE} Error in clean_names.default(cab) : clean_names() must be called on a data.frame. Consider janitor::make_clean_names() for other cases of manipulating vectors of names. ``` What's happening? -- ```{r eval=FALSE} names(cab) [1] "src" "ops" ``` Object `cab` is a list specifying the connection. What can we do? --- ## Fix names Function `colnames()` seems to work. ```{r eval=FALSE} colnames(cab) ``` ```{r eval=FALSE} [1] "vendor_name" "Trip_Pickup_DateTime" "Trip_Dropoff_DateTime" "Passenger_Count" [5] "Trip_Distance" "Start_Lon" "Start_Lat" "Rate_Code" [9] "store_and_forward" "End_Lon" "End_Lat" "Payment_Type" [13] "Fare_Amt" "surcharge" "mta_tax" "Tip_Amt" [17] "Tolls_Amt" "Total_Amt" ``` -- Create a function, `fix_names()`, that provides a more uniform structure given our `cab` object ```{r} fix_names <- function(x) { colnames(x) %>% tolower() %>% stringr::str_remove(pattern = "trip_") %>% setNames(x, .) } ``` --- ```{r eval=FALSE} cab <- fix_names(cab) ``` -- ```{r eval=FALSE} colnames(cab) ``` ```{r eval=FALSE} [1] "vendor_name" "pickup_datetime" "dropoff_datetime" [4] "passenger_count" "distance" "start_lon" [7] "start_lat" "rate_code" "store_and_forward" [10] "end_lon" "end_lat" "payment_type" [13] "fare_amt" "surcharge" "mta_tax" [16] "tip_amt" "tolls_amt" "total_amt" ``` --- ## Fix `payment_type` .tiny[ ```{r eval=FALSE} cab %>% group_by(payment_type) %>% summarise(count = n()) ``` ```{r eval=FALSE} # Source: spark [?? x 2] #<< payment_type count 1 No Charge 40118 2 CASH 6024471 3 Credit 2865982 4 Cash 4995101 5 Dispute 8050 6 CREDIT 158691 ``` ] -- .tiny[ ```{r eval=FALSE} cab %>% mutate(payment_type = tolower(payment_type)) %>% show_query() #<< ``` ```{r eval=FALSE} SELECT `vendor_name`, `pickup_datetime`, `dropoff_datetime`, `passenger_count`, `distance`, `start_lon`, `start_lat`, `rate_code`, `store_and_forward`, `end_lon`, `end_lat`, LOWER(`payment_type`) AS `payment_type`, `fare_amt`, `surcharge`, `mta_tax`, `tip_amt`, `tolls_amt`, `total_amt` FROM `sparklyr_tmp_614420ac929` ``` ] --- Transform all payment types to lower case and add another Spark DataFrame. ```{r eval=FALSE} cab %>% mutate(payment_type = tolower(payment_type)) %>% sdf_register("cab_clean_pymt") #<< ``` ```{r eval=FALSE} cab_clean_pymt <- sdf_load_table(sc, "cab_clean_pymt") ``` ```{r eval=FALSE} src_tbls(sc) ``` ```{r eval=FALSE} [1] "cab" "cab_clean_pymt" ``` -- Using the new Spark DataFrame, bring everything back to R with `collect()`. ```{r eval=FALSE} cab_clean_pymt %>% group_by(payment_type) %>% summarise(count = n()) %>% collect() #<< ``` ```{r eval=FALSE} # A tibble: 4 x 2 payment_type count 1 cash 11019572 2 no charge 40118 3 dispute 8050 4 credit 3024673 ``` --- ## Summaries and analysis Let's compute some summary information about taxi trips. ```{r eval=FALSE} cab_clean_pymt %>% select(passenger_count, distance, tip_amt, fare_amt, total_amt) %>% mutate( cost_per_passenger = passenger_count / total_amt, tip_pct = tip_amt / fare_amt, cost_per_mile = fare_amt / distance ) ``` -- .tiny[ ```{r eval=FALSE} # Source: spark [?? x 8] passenger_count distance tip_amt fare_amt total_amt 1 1 2.63 0 8.9 9.4 2 3 4.55 2 12.1 14.6 3 5 10.4 4.74 23.7 28.4 4 1 5 3.05 14.9 18.4 5 1 0.4 0 3.7 3.7 6 2 1.2 0 6.1 6.6 7 1 0.4 1 5.7 6.7 8 1 1.72 0 6.1 6.6 9 1 1.6 1.3 8.7 10 10 1 0.7 0 5.9 5.9 # … with more rows, and 3 more variables: cost_per_passenger , # tip_pct , cost_per_mile ``` ] --- What query is being made with regards to Spark? ```{r eval=FALSE} cab_clean_pymt %>% select(passenger_count, distance, tip_amt, fare_amt, total_amt) %>% mutate( cost_per_passenger = passenger_count / total_amt, tip_pct = tip_amt / fare_amt, cost_per_mile = fare_amt / distance ) %>% show_query() #<< ``` -- ```{r eval=FALSE} SELECT `passenger_count`, `distance`, `tip_amt`, `fare_amt`, `total_amt`, `passenger_count` / `total_amt` AS `cost_per_passenger`, `tip_amt` / `fare_amt` AS `tip_pct`, `fare_amt` / `distance` AS `cost_per_mile` FROM `sparklyr_tmp_a2a7a060b11` ``` --- Next, let's parse `pickup_datetime` to get the hour, day (as a name), month (as a name). We'll also include the tip percentage and fare cost per mile. Lastly we'll compute some summary measures. See: http://spark.apache.org/docs/latest/api/R/index.html for how to use `date_format()`. ```{r eval=FALSE} cab_summary <- cab_clean_pymt %>% select(pickup_datetime, dropoff_datetime, distance, fare_amt, tip_amt, total_amt) %>% mutate( pickup_hour = hour(pickup_datetime), pickup_day = date_format(pickup_datetime, "EEE"), pickup_month = date_format(pickup_datetime, "MMM"), tip_pct = tip_amt / fare_amt, fare_per_mile = fare_amt / distance ) %>% group_by(pickup_hour, pickup_day) %>% summarise( avg_dist = mean(distance), avg_fare = mean(fare_amt), avg_tip_pct = mean(tip_pct), avg_fare_per_mile = mean(fare_per_mile) ) %>% collect() ``` --- ```{r eval=FALSE} cab_summary ``` Note that this is an R object. .tiny[ ```{r eval=FALSE} # A tibble: 168 x 6 #<< pickup_hour pickup_day avg_dist avg_fare avg_tip_pct avg_fare_per_mile 1 1 Thu 3.08 10.5 0.0448 5.33 2 7 Sun 3.68 11.6 0.0401 5.32 3 8 Sat 2.82 9.37 0.0377 5.00 4 6 Tue 2.85 9.33 0.0375 5.29 5 11 Fri 2.33 9.23 0.0400 6.25 6 0 Wed 3.16 10.4 0.0530 5.38 7 6 Mon 3.12 9.95 0.0377 5.16 8 17 Mon 2.41 9.36 0.0440 5.91 9 16 Thu 2.34 9.85 0.0408 13.0 10 10 Wed 2.13 8.98 0.0457 6.49 # … with 158 more rows ``` ] --- We can now use `cab_summary` just as we would any other object in R. ```{r eval=FALSE} cab_summary %>% ggplot(aes(x = pickup_hour, y = avg_tip_pct)) + geom_line() + facet_wrap(~factor(pickup_day, levels = c("Sun", "Mon", "Tue", "Wed", "Thu", "Fri", "Sat"))) + labs(x = "Pick-up hour", y = "Average tip percentage") + theme_bw() ``` --- class: center, middle ![](images/taxi_plot.png) --- ## Other functions - Perform joins with the `*_join()` family of functions.

- Sampling can be done with `sample_n()` and `sample_frac()`.

- Write the results of your analysis into persistent storage with `spark_write_parquet()`, `spark_write_csv()`, or `spark_write_json()`. --- ## Family of `sparklyr` functions | Sparklyr family of functions | Description | |-----------------------------:|:------------------------------------------------------------------------------------------------| | `spark_*()` | functions to manage and configure spark connections;
functions to read and write data | | `sdf_*()` | functions for manipulating SparkDataFrames | | `ft_*()` | feature transformers for manipulating individual features | | `ml_*()` | machine learning algorithms - K-Means, GLM, Survival Regression,
PCA, Naive-Bayes, and more | | `stream_*()` | functions for handling stream data | --- ## Exercise Navigate to the [BTS Reporting Carrier On-Time Performance (1987-present)](https://www.transtats.bts.gov/Tables.asp?DB_ID=120&DB_Name=Airline%20On-Time%20Performance%20Data) and click on "Download" under Reporting Carrier On-Time Performance (1987-present). 1. Download the 2001 airline data. Choose a subset of variables as detailed [here.](http://stat-computing.org/dataexpo/2009/the-data.html) 2. Read it into a Spark DataFrame. 3. Drop columns that have all `NA` values, add names for month and day of week, and try to fix the plane tail number. 4. Summarize the data based on on-time performance for flights before, on, and after September, 11. --- ## References - https://spark.rstudio.com/ - http://spark.apache.org/docs/latest/api/R/index.html - http://www.dcs.bbk.ac.uk/~dell/teaching/cc/book/databricks/spark-intro.pdf