--- title: "Integration: R and C++" subtitle: "Programming for Statistical Science" author: "Shawn Santo" institute: "" date: "" output: xaringan::moon_reader: css: "slides.css" lib_dir: libs nature: highlightStyle: github highlightLines: false countIncrementalSlides: false editor_options: chunk_output_type: console --- ```{r include=FALSE} knitr::opts_chunk$set(echo = TRUE, message = FALSE, warning = FALSE, comment = "#>", highlight = TRUE, fig.width = 9, fig.height = 6, fig.align = "center") ``` ## Supplementary materials Full video lecture available in Zoom Cloud Recordings Additional resources - `reticulate` [vignette](https://rstudio.github.io/reticulate/) --- ## Getting started Install and load the `Rcpp` package. ```{r} library(Rcpp) library(tidyverse) library(bench) ``` You'll also need a working C++ compiler. A **compiler** is a software program that converts computer programming code written by a human programmer into binary code (machine code) that can be understood and executed by a CPU. - Windows: install Rtools - Mac: install Xcode - Linux: `sudo apt-get install r-base-dev`
Everything is already installed and configured on `Rook`. Use that for today if you don't have a compiler on your machine. --- ## When to consider using C++ - Loops that can't be easily vectorized in R. Computations must be done in a sequential fashion. - Recursive functions. - Tasks or problems that involve advanced data structures or algorithms. C++ has efficient implementations of core algorithms through the standard template library (STL). --- class: inverse, center, middle # Writing C++ functions in R --- ## Scalar input, scalar output Let's write a C++ function that computes the absolute value of a scalar. ```{r} cppFunction( 'double abs_c(double x) { if (x >= 0) { return x; } else { return -1 * x; } }') ``` ```{r results='hold'} abs_c(x = 10L) abs_c(x = 3.1) abs_c(x = -11) ``` -- It seems to work as expected. --- ## What's happening? Using your computer's compiler, `Rcpp` will compile the C++ code you pass into `cppFunction()` and construct an R function that connects to the compiled C++ function.
-- ```{r} abs_c ``` This is an R function that uses `.Call()` to invoke the underlying C++ function. --- ## Breaking down `abs_c()` ```c++ double abs_c(double x) { if (x >= 0) { return x; } else { return -1 * x; } } ``` What differences do you notice between R and C++? -- - We don't use the keyword `function()`. - We must declare the output type and input type. - An explicit return is required. Recall that this is optional in R. - Each statement ends with a semicolon. --- ## Performance comparison: `abs()` v. `abs_c()` ```{r} bench::mark( abs(10), abs_c(10), abs(-10), abs_c(-10), min_time = Inf, iterations = 10000 )[, 1:6] ``` --- ## Vector input, vector output Let's vectorize our C++ absolute value function. ```{r} cppFunction( 'NumericVector vabs_c(NumericVector x) { // get length of vector x int n = x.size(); for (int i = 0; i < n; ++i) { if (x[i] < 0) { x[i] *= -1; } } return x; }' ) ``` -- ```{r} x <- c(-4, 1, 0, -9.1, 102) vabs_c(x) vabs_c(rnorm(n = 5, mean = -10, sd = 1)) ``` ??? ```{r eval=FALSE} cppFunction( 'int show_i(NumericVector x) { // get length of vector x int n = x.size(); for (int i = 0; i < n; ++i) { Rcout << i << std::endl; } return 0; }' ) ``` --- ## Breaking down `vabs_c()` .small[ ```c++ NumericVector vabs_c(NumericVector x) { // get length of vector x int n = x.size(); for (int i = 0; i < n; ++i) { if (x[i] < 0) { x[i] *= -1; } } return x; } ``` ] What differences do you notice between R and C++? -- - We must declare the type of new variables defined in our function. - The `for` loop syntax is different - initialize `i` to be 0 with `int i = 0`, loop as long as `i < n`, increment `i` by 1 after each iteration. - Indexing in C++ begins at 0, not 1 like in R. - The `=` is used for assignment, not `<-`. - Modify in-place operators such as `*=` exist. - C++ is an object-oriented language (`x.size()`). --- ## Performance comparison: `abs()` v. `vabs_c()` ```{r} x <- rnorm(n = 1e6) bench::mark( abs(x), vabs_c(x), relative = TRUE )[, 1:6] ``` --
Why do you think `vabs_c()` is so much faster? --- ## Objects in R, `Rcpp`, and C++ .small-text[ | Value | R vector | Rcpp vector | Rcpp matrix | Rcpp scalar | C++ scalar | |:--------:|:---------:|:-----------------------------:|:-----------------------------:|:-----------:|:----------:| | Logical | logical | LogicalVector | LogicalMatrix | - | bool | | Integer | integer | IntegerVector | IntegerMatrix | - | int | | Real | numeric | NumericVector | NumericMatrix | - | double | | Complex | complex | ComplexVector | ComplexMatrix | Rcomplex | complex | | String | character | CharacterVector(StringVector) | CharacterMatrix(StringMatrix) | String | string | | Date | Date | DateVector | - | Date | - | | Datetime | POSIXct | DatetimeVector | - | Datetime | time_t | ]
The `Rcpp` package provides these classes for collecting the various objects back to R and passing R objects to the compiled C++ function. --- ## Exercise Write a C++ function that computes the sum of the squared deviations about a specific value. An R function to do this can be defined as follows. ```{r} ssd <- function(x, x0) { sum((x - x0) ^ 2) } ``` Assume `x` is a vector and `x0` is a "scalar". Compare the performance between `ssd()` and your C++ function.
*Hint:* `pow()` in C++ is used for exponentiation. ??? ## Solution ```{r eval=FALSE} cppFunction( 'double ssd_c(NumericVector x, double x0) { int n = x.size(); double result = 0; for (int i = 0; i < n; ++i) { result += pow(x[i] - x0, 2); } return result; }' ) ``` ```{r eval=FALSE} x <- rnorm(100) ssd(x, x0 = 0) ssd_c(x, x0 = 0) ``` ```{r eval=FALSE} x <- rnorm(n = 1e6) bench::mark( ssd(x, 0), ssd_c(x, 0), relative = TRUE )[, 1:6] ``` --- ## What happens if we pass in the wrong type? ```{r} cppFunction( 'double ssd_c_new(IntegerVector x, double x0) { int n = x.size(); double result = 0; for (int i = 0; i < n; ++i) { result += pow(x[i] - x0, 2); } return result; }' ) ``` -- ```{r results='hold'} x <- rnorm(10) z <- ssd_c_new(x, x0 = 0) z typeof(z) ssd_c_new(x, x0 = pi) ``` --- class: inverse, center, middle # Source C++ into R --- ## Source C++ For more complicated problems you'll probably want to write a separate C++ script file for your functions. You can then compile the C++ code and make the functions available in your R environment with `sourceCpp()`.
Start your `.cpp` script file with the following. ```{Rcpp, eval=FALSE} # include using namespace Rcpp; ``` --
Before each function include `// [[Rcpp::export]]`. This will ensure it gets exported to your R environment. ```{Rcpp, eval=FALSE} # include using namespace Rcpp; // [[Rcpp::export]] ``` --- ## Example script Write a function that performs a linear search for a target value `x0`. Save this in a script file named `search.cpp`. ```{Rcpp, eval=FALSE} # include using namespace Rcpp; // [[Rcpp::export]] int search(NumericVector x, double x0) { int n = x.size(); for (int i = 0; i < n; ++i) { if (x[i] == x0) { return i + 1; } } return 0; } ``` -- Export `search()` so it is available in R. ```{r} sourceCpp(file = "search.cpp") ``` --- ## Comparison: `search()` versus `which()` and `==` ```{r results='hold'} x <- sample(1:10000) search(x, x0 = 9124) which(x == 9124) ``` -- ```{r} mark( search(x, x0 = 9124), which(x == 9124), )[, 1:6] ``` -- What is the computation complexity of our `search()` function? --- ```{r} time_search <- function(n) { system_time({ search(sample(1:n), x0 = n / 2) }) } ``` ```{r cache=TRUE} result <- map(10000 * c(1:9), ~replicate(1000, time_search(.x))) ``` -- ```{r echo=FALSE} map_df(result, rowMeans) %>% mutate(n = 10000 * c(1:9)) %>% ggplot(aes(x = as.factor(n), y = real)) + geom_line(aes(group = 1)) + geom_point(size = 2, color = "red") + labs(x = "Size of vector (n)", y = "Mean real time", title = "search(sample(1:n), x0 = n / 2)") + theme_bw(base_size = 16) ``` --- ## R code in `.cpp` files Using special C++ comment blocks, R code can be embed in the C++ file. For example, consider the file `search2.cpp`. .pull-left.tiny[ ```{Rcpp eval=FALSE} # include using namespace Rcpp; // [[Rcpp::export]] int search2(NumericVector x, double x0) { int n = x.size(); for (int i = 0; i < n; ++i) { if (x[i] == x0) { return i + 1; } } return 0; } /*** R x <- sample(1:10000) search2(x, x0 = 9124) which(x == 9124) */ ``` ] .pull-right.small[ ```{r} sourceCpp("search2.cpp") ``` ] --- ## Exercise Convert the R `diff()` function into C++. Only consider a numeric vector as an input and the lag argument. Write this function in a `.cpp` file and source it into R. Some examples of R's `diff()` function: ```{r} set.seed(24581) (x <- sample(1:10)) diff(x, lag = 1) diff(x, lag = 3) diff(x, lag = 4) ``` ??? ## Solution ```{Rcpp, eval=FALSE} # include using namespace Rcpp; // [[Rcpp::export]] NumericVector diff_c(NumericVector x, int lag) { int n = x.size() - lag; NumericVector result(n); for (int i = 0; i < n; ++i) { result[i] = x[i + lag] - x[i]; } return result; } ``` ```{r eval=FALSE} sourceCpp("diff_c.cpp") ``` --- ## `Rcpp` Sugar `Rcpp` Sugar provides a convenient way to work with C++ functions in a similar way as to how R offers vectorized operations. Some functions have identical R equivalents. Sugar functions can be roughly broken down into: - Arithmetic and logical operators: +, *, -, /, pow, <, <=, >, >=, ==, !=, ! - Logical summary functions: any(), all(), is true(), is false() - Vector views: they provide a "view" of a vector, head(), tail(), and so on - Math functions: abs(), acos(), asin(), atan(), beta(), ceil(), ceiling(), choose(), cos(), exp(), factorial(), floor() and so on - Scalar summaries: mean(), min(), max(), sum(), sd(), and var() - Vector summaries: cumsum(), diff(), pmin(), and pmax() - Finding values: match(), self match(), which max(), which min() - Dealing with duplicates: duplicated(), unique() - Standard statistical distribution functions *Source:* http://heather.cs.ucdavis.edu/~matloff/158/RcppTutorial.pdf
[Unofficial Rcpp API documentation](https://thecoatlessprofessor.com/programming/cpp/unofficial-rcpp-api-documentation/) --- ## Example with `Rcpp` sugar ```{Rcpp eval=FALSE} # include using namespace Rcpp; // [[Rcpp::export]] NumericMatrix bootstrap_c(NumericVector data, int b) { // create a b x 2 matrix NumericMatrix boot_stats(b, 2); // data sample size int n = data.size(); for (int i = 0; i < b; ++i) { NumericVector boot_sample = sample(data, n, true); // matrices also start indexing at 0 boot_stats(i, 0) = mean(boot_sample); boot_stats(i, 1) = sd(boot_sample); } return boot_stats; } ``` --- ## Finishing up - Homework 6 is due Monday, November 16 at 11:59pm ET. Don't forget to complete Task 3. - The project is due Monday, November 23 at 11:59am ET. Not late submissions will be accepted. I'll begin grading shortly after noon that day. - The final lab is on Monday. - TAs will have regularly scheduled office hours for the rest of the semester. - In addition to Monday and Friday office hours, I'll add the following for next week: - Tuesday, 11/17 from 9:00am - 10:00am ET - Thursday, 11/19 from 11:00am - 12:00pm ET - **Please submit a course evaluation.** --- ## References 1. Francois, D. (2020). Rcpp: Seamless R and C++ Integration. http://www.rcpp.org/. 2. Wickham, H. (2019). Advanced R. https://adv-r.hadley.nz/. 3. Tsuda, M. (2020). Rcpp for everyone. https://teuder.github.io/rcpp4everyone_en/.