Spark & sparklyr part II

# Spark & sparklyr part II
## Programming for Statistical Science
### Shawn Santo

---

## Supplementary materials

Full video lecture available in Zoom Cloud Recordings

Additional resources

- [`sparklyr`: R interface for Apache Spark](https://spark.rstudio.com/)
- [R Front End for Apache Spark](http://spark.apache.org/docs/latest/api/R/index.html)
- [Mastering Spark with R](https://therinspark.com)

---

# Recall

---

## What is Apache Spark?

- As described by Databricks, "Spark is a unified computing engine and a set
  of libraries for parallel data processing on computing clusters".

--
  
- Spark's goal is to support data analytics tasks within a single ecosystem:
  data loading, SQL queries, machine learning, and streaming computations.
  
--
  
- Spark is written in Scala and runs on Java. However, Spark can be used
  from R, Python, SQL, Scala, or Java.

---

## The Spark ecosystem

![](images/spark_ecosystem.png)

---

## Spark's key features

- In-memory computation

- Fast and scalable
    - Efficiently scale up from one to many thousands of compute nodes

- Access data on a multitude of platforms
    - SQL and NoSQL databses
    - Cloud storage
    - Hadoop Distributed File System

- Real-time stream processing

- Libraries
    - Spark SQL
    - MLlib
    - Spark streaming
    - GraphX

---

## What is `sparklyr`?

Package `sparklyr` provides an R interface for Spark. It works with any version
of Spark.

- Use `dplyr` to translate R code into Spark SQL

- Work with Spark's MLlib

- Interact with a stream of data

The interface between R and Spark is young. If you know Scala, a great project
would be to contribute to this R and Spark interaction by making Spark libraries
available as an R package.

---

# Connecting to Spark

---

## Configure and connect

```r
library(tidyverse)
library(sparklyr)
```

```r
# add some custom configurations
conf <- list(
  sparklyr.cores.local = 4,
  `sparklyr.shell.driver-memory` = "16G",
  spark.memory.fraction = 0.5
)
```

```r
# create a spark connection
sc <- spark_connect(master = "local", version = "3.1", config = conf)
```

---

# R functions and Spark

---

## Distrubted R

We've seen that in our data wrangling we can use `dplyr`, some base R functions,
`sparklyr` functions, and Hive functions. If none of these options are available
for what you need, it is possible to apply an R function to a Spark DataFrame.

```r
diamonds_tbl <- copy_to(sc, diamonds)
```

```r
diamonds_tbl %>% 
  select(carat, price) %>% 
  scale()
```

```r
Error: Unable to retrieve a spark_connection from object of class NULL
```

---

## Last resort: `spark_apply()`

```r
start <- Sys.time()
diamonds_tbl %>% 
  select(carat, price) %>% 
* spark_apply(function(x) scale(x))
end <- Sys.time()

end - start
```

```r
Time difference of 2.62 mins
```

```r
start <- Sys.time()
diamonds_tbl %>% 
  select(carat, price) %>% 
  mutate(carat = (carat - mean(carat, na.rm = TRUE)) / sd(carat, na.rm = TRUE),
         price = as.double(price),
         price = (price - mean(price, na.rm = TRUE)) / sd(price, na.rm = TRUE))
end <- Sys.time()

end - start
```

```r
Time difference of 0.58 secs
```

---

## Why so slow?

Since we are using an R function, the data is not processed by Spark.

What happens:

1. chunks of the data are moved from Spark to R

2. data is converted to an appropriate R format -- data.frame

3. the R function is applied

4. the results are converted back to a format for Spark and sent back to Spark

If you can, try to use `dplyr` or code Spark can understand.

---

## Group DataFrame partitions

```r
diamonds_tbl %>%
  spark_apply(
    function(x) summary(lm(price ~ carat, x))$r.squared,
    names = "r.squared",
    group_by = "cut"
    )
```

```r
# Source: spark<?> [?? x 2]
  cut       r.squared
  <chr>         <dbl>
1 Premium       0.856
2 Ideal         0.867
3 Good          0.851
4 Fair          0.738
5 Very Good     0.858
```

Check that this is correct.

???

```r
diamonds %>% 
  group_by(cut) %>% 
  group_modify(~glance(lm(price ~ carat, .))["r.squared"])
```

---

# ML Pipelines

---

## What is an `ml_pipeline`?

Spark’s ML Pipelines provide a way to easily combine multiple transformations 
and algorithms into a single workflow, or pipeline.

Some Spark terminology:

- **Transformer**: a transformer is an algorithm which can transform one 
  DataFrame into another DataFrame

- **Estimator**: an estimator is an algorithm which can be fit on a DataFrame 
  to produce a Transformer.

- **Pipeline**: a pipeline chains multiple Transformers and Estimators 
  together to specify a machine learning workflow
  
- **Pipeline model**: a pipeline that has been trained on data so all of its
  components have been converted to transformers

---

## Example: estimator

```r
standardizer <- ft_standard_scaler(sc, input_col = "predictors",
                                   output_col = "predictors_standardized",
                                   with_mean = TRUE)
standardizer
```

```r
StandardScaler (Estimator)
<standard_scaler__cad4bfc6_f41a_4bd4_bd6b_90cd54d4c071> 
 (Parameters -- Column Names)
  input_col: predictors
  output_col: predictors_standardized
 (Parameters)
  with_mean: TRUE
  with_std: TRUE
```

---

## Example: transformer

```r
random_df <- copy_to(sc, data.frame(value = rpois(100000, 9))) %>% 
  ft_vector_assembler(input_cols = "value", output_col = "predictors")
```

```r
standardizer_algo <- ml_fit(standardizer, random_df)
standardizer_algo
```

```r
StandardScalerModel (Transformer)
<standard_scaler__cad4bfc6_f41a_4bd4_bd6b_90cd54d4c071> 
 (Parameters -- Column Names)
  input_col: predictors
  output_col: predictors_standardized
 (Transformer Info)
* mean:  num 9
* std:  num 3.01
```

???

Many algorithms and feature transformers require that the input be a vector
column.

---

## Example: transformer

We can now feed the transformer some data. This could be our `random_df` or
a new dataset (think train / test).

```r
standardizer_algo %>% 
  ml_transform(random_df) %>% 
  glimpse()
```

```r
Rows: ??
Columns: 3
Database: spark_connection
$ value                   <int> 6, 12, 9, 7, 5, 8, 13, 10, 12, …
$ predictors              <list> 6, 12, 9, 7, 5, 8, 13, 10, 12,…
$ predictors_standardized <list> -0.9982689, 0.9979628, -0.0001…
```

---

## NC flights data

Let's create a ML pipeline to classify if a flight is delayed in February 2020
for all NC airports.

```r
url <- str_c("http://www2.stat.duke.edu/~sms185/data/",
             "flights/nc_flights_feb_20.csv")

download.file(url = url, destfile = "data/nc_flights.csv")
```

```r
nc_flights_tbl <- spark_read_csv(sc, name = "nc_flights",
                                 path = "data/nc_flights.csv")
```

Data is available from the [Bureau of Transportation Statistics](https://www.bts.gov).

---

```r
df <- nc_flights_tbl %>%
  mutate(DEP_DELAY   = as.numeric(DEP_DELAY),
         ARR_DELAY   = as.numeric(ARR_DELAY),
         MONTH       = as.character(MONTH),
         DAY_OF_WEEK = as.character(DAY_OF_WEEK)
  ) %>% 
  filter(!is.na(DEP_DELAY)) %>%
  select(DEP_DELAY, CRS_DEP_TIME, MONTH, DAY_OF_WEEK, DISTANCE)

df
```

```r
# Source: spark<?> [?? x 5]
   DEP_DELAY CRS_DEP_TIME MONTH DAY_OF_WEEK DISTANCE
       <dbl>        <int> <chr> <chr>          <dbl>
 1        -7          830 2     5                365
 2         7          835 2     6                365
 3        -8          830 2     7                365
 4       -10          830 2     1                365
 5        -3          830 2     2                365
# … with more rows
```

---

## Pipeline

```r
*nc_flights_pipe <- ml_pipeline(sc) %>%
* ft_dplyr_transformer(tbl = df) %>%
  ft_binarizer(input_col   = "DEP_DELAY",
               output_col  = "DELAYED",
               threshold   = 15) %>% 
  ft_bucketizer(input_col  = "CRS_DEP_TIME",
                output_col = "HOURS",
                splits     = seq(0, 2400, 400)) %>% 
  ft_r_formula(DELAYED ~ DAY_OF_WEEK + HOURS + DISTANCE) %>% 
  ml_logistic_regression() 
```

```r
Pipeline (Estimator) with 5 stages
<pipeline_187aa28dcf960> 
  Stages 
* |--1 SQLTransformer (Transformer)
  |    <dplyr_transformer_187aaca3f397> 
  |     (Parameters -- Column Names)
```

<br/>

`ft_dplyr_transformer()` extracts the dplyr transformations used to 
generate object `tbl` as a SQL statement then passes it on 
to `ft_sql_transformer()`. The result is a `ml_pipeline` object.

---

```r
nc_flights_pipe <- ml_pipeline(sc) %>% 
  ft_dplyr_transformer(tbl = df) %>%
* ft_binarizer(input_col   = "DEP_DELAY",
*              output_col  = "DELAYED",
*              threshold   = 15) %>%
  ft_bucketizer(input_col  = "CRS_DEP_TIME",
                output_col = "HOURS",
                splits     = seq(0, 2400, 400)) %>% 
  ft_r_formula(DELAYED ~ DAY_OF_WEEK + HOURS + DISTANCE) %>% 
  ml_logistic_regression()
```

---

```r
nc_flights_pipe <- ml_pipeline(sc) %>% 
  ft_dplyr_transformer(tbl = df) %>%
  ft_binarizer(input_col   = "DEP_DELAY",
               output_col  = "DELAYED",
               threshold   = 15) %>% 
* ft_bucketizer(input_col  = "CRS_DEP_TIME",
*               output_col = "HOURS",
*               splits     = seq(0, 2400, 400)) %>%
  ft_r_formula(DELAYED ~ DAY_OF_WEEK + HOURS + DISTANCE) %>% 
  ml_logistic_regression()
```

---

```r
nc_flights_pipe <- ml_pipeline(sc) %>% 
  ft_dplyr_transformer(tbl = df) %>%
  ft_binarizer(input_col   = "DEP_DELAY",
               output_col  = "DELAYED",
               threshold   = 15) %>% 
  ft_bucketizer(input_col  = "CRS_DEP_TIME",
                output_col = "HOURS",
                splits     = seq(0, 2400, 400)) %>% 
* ft_r_formula(DELAYED ~ DAY_OF_WEEK + HOURS + DISTANCE) %>%
  ml_logistic_regression()
```

---

```r
nc_flights_pipe <- ml_pipeline(sc) %>% 
  ft_dplyr_transformer(tbl = df) %>%
  ft_binarizer(input_col   = "DEP_DELAY",
               output_col  = "DELAYED",
               threshold   = 15) %>% 
  ft_bucketizer(input_col  = "CRS_DEP_TIME",
                output_col = "HOURS",
                splits     = seq(0, 2400, 400)) %>% 
  ft_r_formula(DELAYED ~ DAY_OF_WEEK + HOURS + DISTANCE) %>% 
* ml_logistic_regression()
```

---

## Printed pipeline

---

## What can we do with this pipeline?

1. Easily fit data with `ml_fit()`.

2. Make predictions with a fitted pipeline and `ml_transform()`.

3. Save pipelines that result in Scala scripts with `ml_save()` and
   can be read back into `sparklyr` (with `ml_load()`) or by the Scala or 
   PySpark APIs.

---

## Pipeline model

Partition the data into train and test sets.

```r
nc_flights_partition <- nc_flights_tbl %>% 
  sdf_random_split(training = 0.80, testing = 0.20)
```

Train the model

```r
fitted_pipeline <- ml_fit(
  nc_flights_pipe,
  nc_flights_partition$training
)
```

---

## Predictions

```r
predictions <- ml_transform(
  fitted_pipeline,
  nc_flights_partition$training
)
```

```r
sdf_crosstab(predictions, "label", "prediction") %>% 
  arrange(label_prediction)
```

```r
# Source:     spark<?> [?? x 3]
# Ordered by: label_prediction
  label_prediction `0.0` `1.0`
  <chr>            <dbl> <dbl>
1 0.0              32193   177
2 1.0               7244   211
```

---

## Save pipeline objects

Save the pipeline:

```r
ml_save(x = nc_flights_pipe, path = "nc_flights_pipeline")
```

Save the pipeline model (fitted pipeline with data):

```r
ml_save(x = fitted_pipeline, path = "nc_flights_model")
```

The `ml_load()` command can be used to re-load these objects. You could then
create a new pipeline model with new training data or make new predictions
with the fitted pipeline model.

---

## Exercise

Use `bike_tbl` to create an `ml_pipeline` object. Consider classification
with member type as the response. Also, consider creating buckets for 
duration and a binary variable for round trips (bike starts and ends at the 
same location).

```r
download.file(url = "http://www2.stat.duke.edu/~sms185/data/bike/cbs_2017.csv",
              destfile = "data/cbs_bike_2017.csv")
```

```r
bike_tbl <- spark_read_csv(sc, path = "data/cbs_bike_2017.csv")
```

---

## References

1. A Gentle Introduction to Apache Spark. (2021). 
   http://www.dcs.bbk.ac.uk/~dell/teaching/cc/book/databricks/spark-intro.pdf.
   
2. Javier Luraschi, E. (2021). Mastering Spark with R. https://therinspark.com/.

3. OST_R | BTS | Transtats. (2020). Transtats.bts.gov.  
   https://www.bts.gov

4. R Front End for Apache Spark. (2021). 
   http://spark.apache.org/docs/latest/api/R/index.html.

5. sparklyr. (2021). https://spark.rstudio.com/.