---
title: "Spark & sparklyr part II"
subtitle: "Programming for Statistical Science"
author: "Shawn Santo"
institute: ""
date: ""
output:
xaringan::moon_reader:
css: "slides.css"
lib_dir: libs
nature:
highlightStyle: github
highlightLines: true
countIncrementalSlides: false
editor_options:
chunk_output_type: console
---
```{r include=FALSE}
knitr::opts_chunk$set(eval = FALSE, echo = TRUE, message = FALSE,
warning = FALSE, comment = "#>", highlight = TRUE,
fig.align = "center")
```
## Supplementary materials
Full video lecture available in Zoom Cloud Recordings
Additional resources
- [`sparklyr`: R interface for Apache Spark](https://spark.rstudio.com/)
- [R Front End for Apache Spark](http://spark.apache.org/docs/latest/api/R/index.html)
- [Mastering Spark with R](https://therinspark.com)
---
class: inverse, center, middle
# Recall
---
## The Spark ecosystem
![](images/spark_ecosystem.png)
---
## What is `sparklyr`?
Package `sparklyr` provides an R interface for Spark. It works with any version
of Spark.
- Use `dplyr` to translate R code into Spark SQL
- Work with Spark's MLlib
- Interact with a stream of data
--
The interface between R and Spark is young. If you know Scala, a great project
would be to contribute to this R and Spark interaction by making Spark libraries
available as an R package.
---
## Workflow
*Source*: https://therinspark.com/
---
class: inverse, center, middle
# Preliminaries
---
## Configure and connect
```{r}
library(sparklyr)
library(tidyverse)
library(future)
# add some custom configurations
conf <- list(
sparklyr.cores.local = 4,
`sparklyr.shell.driver-memory` = "16G",
spark.memory.fraction = 0.5
)
```
`sparklyr.cores.local` - defaults to using all of the available cores
`sparklyr.shell.driver-memory` - limit is the amount of RAM available in the
computer minus what would be needed for OS operations
`spark.memory.fraction` - default is set to 60% of the requested memory
per executor
```{r}
# create a spark connection
sc <- spark_connect(master = "local", version = "3.0", config = conf)
```
---
class: inverse, center, middle
# Spark Streaming
---
## What is Spark Streaming?
>"Spark Streaming makes it easy to build scalable fault-tolerant streaming
applications."
Streaming data:
- Financial asset prices (stocks, futures, cryptocurrency, etc.)
- Twitter feed
- Purchase orders on Amazon
Think of streaming data as real-time data. Streams are most relevant when we
want to process and analyze this data in real time.
---
## The role of `sparklyr`
`sparklyr` provides an R interface for interacting with Spark Streaming by
allowing you to
- run `dplyr`, SQL, and pipeline machine learning models against a stream of
data;
- read in many file formats (CSV, text, JSON, parquet, etc.) from a stream
source;
- write stream results in the file formats specified above;
- integration with Shiny that allows you to get the contents of a stream in
your app.
---
## Spark Streaming process
Streams in Spark follow a **source** (think reading), **transformation**, and
**sink** (think writing) process.
--
**Source:**
There exists a set of `stream_read_*()` functions in `sparklyr` for reading
the specified file type in as a Spark DataFrame stream.
--
**Transformation:**
Spark (via `sparklyr`) can then perform data wrangling, manipulations, and joins
with other streaming or static data, machine learning pipeline predictions, and
other R manipulations.
--
**Sink:**
There exists a set of `stream_write_*()` functions in `sparklyr` for writing
a Spark DataFrame stream as the specified file type.
---
## Toy example
Let's leave out the transformation step and simply define a streaming process
that reads files from a folder `input_source/` and immediately writes them to a
folder `output_source/`.
```{r}
dir.create("input_source/")
dir.create("output_source/")
stream <- stream_read_text(sc, path = "input_source/") %>%
stream_write_text(path = "output_source/")
```
--
Generate 100 test files to see that they are being read and written to and
from the correct directories. Function `stream_view()` launches a Shiny gadget
to visualize the given stream. You can see the rows per second (rps) being
read and written.
```{r}
stream_generate_test(interval = .2, iterations = 100,
path = "input_source/")
stream_view(stream)
```
--
Stop the stream and remove the `input_source/` and `output_source/` directories.
```{r}
stream_stop(stream)
unlink("input_source/", recursive = TRUE)
unlink("output_source/", recursive = TRUE)
```
---
## Stream viewer
---
## Toy example details
```{r}
stream <- stream_read_text(sc, path = "input_source/") %>%
stream_write_text(path = "output_source/") #<<
```
The output writer is what starts the streaming job. It will start monitoring
the input folder, and then write the new results in the `output_source/` folder.
--
The stream query defaults to micro-batches running every 5 seconds. This
can be adjusted with `stream_trigger_interval()` and
`stream_trigger_continuous()`.
---
## Example with transformations
Using the tibble `diamonds` from `ggplot2`, let's create a stream, do some
aggregation, and output the process to memory as a Spark DataFrame. Using Spark
memory as the target will allow for aggregation to happen during processing.
*On all but Kafka, aggregation is not allowed for any file output.*
```{r}
dir.create("input_source/")
stream_generate_test(df = diamonds, path = "input_source/",
iterations = 1)
```
```{r}
stream <- stream_read_csv(sc, path = "input_source/") %>%
select(price) %>%
stream_watermark() %>% # add a timestamp
group_by(timestamp) %>% # do a grouping by the timestamp
summarise(
min_price = min(price, na.rm = TRUE),
max_price = max(price, na.rm = TRUE),
mean_price = mean(price, na.rm = TRUE),
count = n()
) %>%
stream_write_memory(name = "diamonds_sdf")
```
Object `diamonds_sdf` will be a Spark DataFrame to which our summarized
streaming computations are written.
---
## Example with transformations
Generate some test data using `diamonds`.
```{r}
stream_generate_test(df = diamonds, path = "input_source/", iterations = 10)
```
We can periodically check the results.
```{r}
tbl(sc, "diamonds_sdf")
```
--
Stop the stream and remove the `input_source/` and `output_source/` directories.
```{r}
stream_stop(stream)
unlink("input_source/", recursive = TRUE)
```
---
## Shiny and streaming
Shiny’s reactive framework is well suited to support streaming information,
which you can use to display real-time data from Spark using `reactiveSpark()`.
It can take a Spark DataFrame (or an object coercable to one), and it
returns a reactive data source. You can use it similar to how you used
reactive tibble objects.
--
To demonstrate the functionality of `reactiveSpark()`, we'll again use the
NYC yellow taxi trip data from January 2009.
https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page
```{r eval=FALSE}
taxi_path <- str_c("/home/fac/sms185/.public_html/data/taxi/",
"yellow_tripdata_2009-01.csv")
taxi_tbl <- spark_read_csv(sc, name = "yellow_taxi_2009",
path = taxi_path)
```
---
## Data preview
.small[
```{r eval=FALSE}
glimpse(taxi_tbl)
```
```{r eval=FALSE}
*Rows: ??
Columns: 18
*Database: spark_connection
$ vendor_name "VTS", "VTS", "VTS", "DDS", "DDS", "DDS", "DDS", "V…
$ Trip_Pickup_DateTime 2009-01-04 02:52:00, 2009-01-04 03:31:00, 2009-01-…
$ Trip_Dropoff_DateTime 2009-01-04 03:02:00, 2009-01-04 03:38:00, 2009-01-…
$ Passenger_Count 1, 3, 5, 1, 1, 2, 1, 1, 1, 1, 1, 1, 2, 2, 1, 1, 1, …
$ Trip_Distance 2.63, 4.55, 10.35, 5.00, 0.40, 1.20, 0.40, 1.72, 1.…
$ Start_Lon -73.99196, -73.98210, -74.00259, -73.97427, -74.001…
$ Start_Lat 40.72157, 40.73629, 40.73975, 40.79095, 40.71938, 4…
$ Rate_Code "NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA…
$ store_and_forward "NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA…
$ End_Lon -73.99380, -73.95585, -73.86998, -73.99656, -74.008…
$ End_Lat 40.69592, 40.76803, 40.77023, 40.73185, 40.72035, 4…
$ Payment_Type "CASH", "Credit", "Credit", "CREDIT", "CASH", "CASH…
$ Fare_Amt 8.9, 12.1, 23.7, 14.9, 3.7, 6.1, 5.7, 6.1, 8.7, 5.9…
$ surcharge 0.5, 0.5, 0.0, 0.5, 0.0, 0.5, 0.0, 0.5, 0.0, 0.0, 0…
$ mta_tax "NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA…
$ Tip_Amt 0.00, 2.00, 4.74, 3.05, 0.00, 0.00, 1.00, 0.00, 1.3…
$ Tolls_Amt 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ Total_Amt 9.40, 14.60, 28.44, 18.45, 3.70, 6.60, 6.70, 6.60, …
```
]
---
## Sample Taxi data
Define a bounding box for NYC.
```{r}
min_lat <- 40.5774
max_lat <- 40.9176
min_lon <- -74.15
max_lon <- -73.7004
```
--
Take a sample of about 10% of the trips, where the trip start is within our
bounding box defined above.
```{r}
taxi <- taxi_tbl %>%
sample_frac(size = 0.1) %>%
collect() %>%
janitor::clean_names() %>%
filter(start_lon >= min_lon, start_lon <= max_lon,
start_lat >= min_lat, start_lat <= max_lat)
```
---
## Streaming Shiny gadget
```{r}
library(shiny)
```
```{r}
unlink("shiny-stream", recursive = TRUE)
dir.create("shiny-stream", showWarnings = FALSE)
```
--
To generate test data, we'll do this with our own code.
```{r}
library(tidyverse)
write_stream_csv <- function(x, row, path = "shiny-stream/", pause = 2) {
x %>%
slice(row) %>%
write_csv(file = str_c(path, "stream_", row, ".csv"))
Sys.sleep(pause)
}
trips <- sample(1:nrow(taxi))
walk(trips, write_stream_csv, x = taxi)
```
Run this as a local background job from a script file. This way you can
launch the Shiny App (on the next slide) in RStudio.
---
## Streaming Shiny gadget
Once the local job starts running, launch the app to see how the plot updates
as we simulate more taxi trips beginning.
```{r}
ui <- function() {
plotOutput("taxi_plot")
}
server <- function(input, output, session) {
taxi_stream <- stream_read_csv(sc, path = "shiny-stream") %>%
reactiveSpark() #<<
output$taxi_plot <- renderPlot({
ggplot(taxi_stream(), aes(y = start_lat, x = start_lon)) +
geom_point(alpha = 0.3) +
labs(y = "Latitude", x = "Longitude") +
theme_bw(base_size = 16)
})
}
runGadget(ui, server)
```
---
## References
1. A Gentle Introduction to Apache Spark. (2020).
http://www.dcs.bbk.ac.uk/~dell/teaching/cc/book/databricks/spark-intro.pdf.
2. Javier Luraschi, E. (2020). Mastering Spark with R. https://therinspark.com/.
3. R Front End for Apache Spark. (2020).
http://spark.apache.org/docs/latest/api/R/index.html.
4. sparklyr. (2020). https://spark.rstudio.com/.