---
title: "Integration: R and C++"
subtitle: "Programming for Statistical Science"
author: "Shawn Santo"
institute: ""
date: ""
output:
xaringan::moon_reader:
css: "slides.css"
lib_dir: libs
nature:
highlightStyle: github
highlightLines: false
countIncrementalSlides: false
editor_options:
chunk_output_type: console
---
```{r include=FALSE}
knitr::opts_chunk$set(echo = TRUE, message = FALSE, warning = FALSE,
comment = "#>", highlight = TRUE,
fig.width = 9, fig.height = 6,
fig.align = "center")
```
## Supplementary materials
Full video lecture available in Zoom Cloud Recordings
Additional resources
- `reticulate` [vignette](https://rstudio.github.io/reticulate/)
---
## Getting started
Install and load the `Rcpp` package.
```{r}
library(Rcpp)
library(tidyverse)
library(bench)
```
You'll also need a working C++ compiler. A **compiler** is a software program
that converts computer programming code written by a human programmer into
binary code (machine code) that can be understood and executed by a CPU.
- Windows: install Rtools
- Mac: install Xcode
- Linux: `sudo apt-get install r-base-dev`
Everything is already installed and configured on `Rook`. Use that for today
if you don't have a compiler on your machine.
---
## When to consider using C++
- Loops that can't be easily vectorized in R. Computations must be done in a
sequential fashion.
- Recursive functions.
- Tasks or problems that involve advanced data structures or algorithms. C++ has
efficient implementations of core algorithms through the standard template
library (STL).
---
class: inverse, center, middle
# Writing C++ functions in R
---
## Scalar input, scalar output
Let's write a C++ function that computes the absolute value of a scalar.
```{r}
cppFunction(
'double abs_c(double x) {
if (x >= 0) {
return x;
} else {
return -1 * x;
}
}')
```
```{r results='hold'}
abs_c(x = 10L)
abs_c(x = 3.1)
abs_c(x = -11)
```
--
It seems to work as expected.
---
## What's happening?
Using your computer's compiler, `Rcpp` will compile the C++ code you pass into
`cppFunction()` and construct an R function that connects to the compiled C++
function.
--
```{r}
abs_c
```
This is an R function that uses `.Call()` to invoke the underlying C++ function.
---
## Breaking down `abs_c()`
```c++
double abs_c(double x) {
if (x >= 0) {
return x;
} else {
return -1 * x;
}
}
```
What differences do you notice between R and C++?
--
- We don't use the keyword `function()`.
- We must declare the output type and input type.
- An explicit return is required. Recall that this is optional in R.
- Each statement ends with a semicolon.
---
## Performance comparison: `abs()` v. `abs_c()`
```{r}
bench::mark(
abs(10),
abs_c(10),
abs(-10),
abs_c(-10),
min_time = Inf,
iterations = 10000
)[, 1:6]
```
---
## Vector input, vector output
Let's vectorize our C++ absolute value function.
```{r}
cppFunction(
'NumericVector vabs_c(NumericVector x) {
// get length of vector x
int n = x.size();
for (int i = 0; i < n; ++i) {
if (x[i] < 0) {
x[i] *= -1;
}
}
return x;
}'
)
```
--
```{r}
x <- c(-4, 1, 0, -9.1, 102)
vabs_c(x)
vabs_c(rnorm(n = 5, mean = -10, sd = 1))
```
???
```{r eval=FALSE}
cppFunction(
'int show_i(NumericVector x) {
// get length of vector x
int n = x.size();
for (int i = 0; i < n; ++i) {
Rcout << i << std::endl;
}
return 0;
}'
)
```
---
## Breaking down `vabs_c()`
.small[
```c++
NumericVector vabs_c(NumericVector x) {
// get length of vector x
int n = x.size();
for (int i = 0; i < n; ++i) {
if (x[i] < 0) {
x[i] *= -1;
}
}
return x;
}
```
]
What differences do you notice between R and C++?
--
- We must declare the type of new variables defined in our function.
- The `for` loop syntax is different - initialize `i` to be 0 with `int i = 0`,
loop as long as `i < n`, increment `i` by 1 after each iteration.
- Indexing in C++ begins at 0, not 1 like in R.
- The `=` is used for assignment, not `<-`.
- Modify in-place operators such as `*=` exist.
- C++ is an object-oriented language (`x.size()`).
---
## Performance comparison: `abs()` v. `vabs_c()`
```{r}
x <- rnorm(n = 1e6)
bench::mark(
abs(x),
vabs_c(x),
relative = TRUE
)[, 1:6]
```
--
Why do you think `vabs_c()` is so much faster?
---
## Objects in R, `Rcpp`, and C++
.small-text[
| Value | R vector | Rcpp vector | Rcpp matrix | Rcpp scalar | C++ scalar |
|:--------:|:---------:|:-----------------------------:|:-----------------------------:|:-----------:|:----------:|
| Logical | logical | LogicalVector | LogicalMatrix | - | bool |
| Integer | integer | IntegerVector | IntegerMatrix | - | int |
| Real | numeric | NumericVector | NumericMatrix | - | double |
| Complex | complex | ComplexVector | ComplexMatrix | Rcomplex | complex |
| String | character | CharacterVector(StringVector) | CharacterMatrix(StringMatrix) | String | string |
| Date | Date | DateVector | - | Date | - |
| Datetime | POSIXct | DatetimeVector | - | Datetime | time_t |
]
The `Rcpp` package provides these classes for collecting the various
objects back to R and passing R objects to the compiled C++ function.
---
## Exercise
Write a C++ function that computes the sum of the squared deviations about
a specific value. An R function to do this can be defined as follows.
```{r}
ssd <- function(x, x0) {
sum((x - x0) ^ 2)
}
```
Assume `x` is a vector and `x0` is a "scalar". Compare the performance between
`ssd()` and your C++ function.
*Hint:* `pow()` in C++ is used for exponentiation.
???
## Solution
```{r eval=FALSE}
cppFunction(
'double ssd_c(NumericVector x, double x0) {
int n = x.size();
double result = 0;
for (int i = 0; i < n; ++i) {
result += pow(x[i] - x0, 2);
}
return result;
}'
)
```
```{r eval=FALSE}
x <- rnorm(100)
ssd(x, x0 = 0)
ssd_c(x, x0 = 0)
```
```{r eval=FALSE}
x <- rnorm(n = 1e6)
bench::mark(
ssd(x, 0),
ssd_c(x, 0),
relative = TRUE
)[, 1:6]
```
---
## What happens if we pass in the wrong type?
```{r}
cppFunction(
'double ssd_c_new(IntegerVector x, double x0) {
int n = x.size();
double result = 0;
for (int i = 0; i < n; ++i) {
result += pow(x[i] - x0, 2);
}
return result;
}'
)
```
--
```{r results='hold'}
x <- rnorm(10)
z <- ssd_c_new(x, x0 = 0)
z
typeof(z)
ssd_c_new(x, x0 = pi)
```
---
class: inverse, center, middle
# Source C++ into R
---
## Source C++
For more complicated problems you'll probably want to write a separate C++
script file for your functions. You can then compile the C++ code and make the
functions available in your R environment with `sourceCpp()`.
Start your `.cpp` script file with the following.
```{Rcpp, eval=FALSE}
# include
using namespace Rcpp;
```
--
Before each function include `// [[Rcpp::export]]`. This will ensure it gets
exported to your R environment.
```{Rcpp, eval=FALSE}
# include
using namespace Rcpp;
// [[Rcpp::export]]
```
---
## Example script
Write a function that performs a linear search for a target value `x0`. Save
this in a script file named `search.cpp`.
```{Rcpp, eval=FALSE}
# include
using namespace Rcpp;
// [[Rcpp::export]]
int search(NumericVector x, double x0) {
int n = x.size();
for (int i = 0; i < n; ++i) {
if (x[i] == x0) {
return i + 1;
}
}
return 0;
}
```
--
Export `search()` so it is available in R.
```{r}
sourceCpp(file = "search.cpp")
```
---
## Comparison: `search()` versus `which()` and `==`
```{r results='hold'}
x <- sample(1:10000)
search(x, x0 = 9124)
which(x == 9124)
```
--
```{r}
mark(
search(x, x0 = 9124),
which(x == 9124),
)[, 1:6]
```
--
What is the computation complexity of our `search()` function?
---
```{r}
time_search <- function(n) {
system_time({
search(sample(1:n), x0 = n / 2)
})
}
```
```{r cache=TRUE}
result <- map(10000 * c(1:9), ~replicate(1000, time_search(.x)))
```
--
```{r echo=FALSE}
map_df(result, rowMeans) %>%
mutate(n = 10000 * c(1:9)) %>%
ggplot(aes(x = as.factor(n), y = real)) +
geom_line(aes(group = 1)) +
geom_point(size = 2, color = "red") +
labs(x = "Size of vector (n)", y = "Mean real time",
title = "search(sample(1:n), x0 = n / 2)") +
theme_bw(base_size = 16)
```
---
## R code in `.cpp` files
Using special C++ comment blocks, R code can be embed in the C++ file. For
example, consider the file `search2.cpp`.
.pull-left.tiny[
```{Rcpp eval=FALSE}
# include
using namespace Rcpp;
// [[Rcpp::export]]
int search2(NumericVector x, double x0) {
int n = x.size();
for (int i = 0; i < n; ++i) {
if (x[i] == x0) {
return i + 1;
}
}
return 0;
}
/*** R
x <- sample(1:10000)
search2(x, x0 = 9124)
which(x == 9124)
*/
```
]
.pull-right.small[
```{r}
sourceCpp("search2.cpp")
```
]
---
## Exercise
Convert the R `diff()` function into C++. Only consider a numeric vector as an
input and the lag argument. Write this function in a `.cpp` file and source
it into R.
Some examples of R's `diff()` function:
```{r}
set.seed(24581)
(x <- sample(1:10))
diff(x, lag = 1)
diff(x, lag = 3)
diff(x, lag = 4)
```
???
## Solution
```{Rcpp, eval=FALSE}
# include
using namespace Rcpp;
// [[Rcpp::export]]
NumericVector diff_c(NumericVector x, int lag) {
int n = x.size() - lag;
NumericVector result(n);
for (int i = 0; i < n; ++i) {
result[i] = x[i + lag] - x[i];
}
return result;
}
```
```{r eval=FALSE}
sourceCpp("diff_c.cpp")
```
---
## `Rcpp` Sugar
`Rcpp` Sugar provides a convenient way to work with C++ functions in a similar
way as to how R offers vectorized operations. Some functions have identical R
equivalents.
Sugar functions can be roughly broken down into:
- Arithmetic and logical operators: +, *, -, /, pow, <, <=, >, >=, ==, !=, !
- Logical summary functions: any(), all(), is true(), is false()
- Vector views: they provide a "view" of a vector, head(), tail(), and so on
- Math functions: abs(), acos(), asin(), atan(), beta(), ceil(), ceiling(),
choose(), cos(), exp(), factorial(), floor() and so on
- Scalar summaries: mean(), min(), max(), sum(), sd(), and var()
- Vector summaries: cumsum(), diff(), pmin(), and pmax()
- Finding values: match(), self match(), which max(), which min()
- Dealing with duplicates: duplicated(), unique()
- Standard statistical distribution functions
*Source:* http://heather.cs.ucdavis.edu/~matloff/158/RcppTutorial.pdf
[Unofficial Rcpp API documentation](https://thecoatlessprofessor.com/programming/cpp/unofficial-rcpp-api-documentation/)
---
## Example with `Rcpp` sugar
```{Rcpp eval=FALSE}
# include
using namespace Rcpp;
// [[Rcpp::export]]
NumericMatrix bootstrap_c(NumericVector data, int b) {
// create a b x 2 matrix
NumericMatrix boot_stats(b, 2);
// data sample size
int n = data.size();
for (int i = 0; i < b; ++i) {
NumericVector boot_sample = sample(data, n, true);
// matrices also start indexing at 0
boot_stats(i, 0) = mean(boot_sample);
boot_stats(i, 1) = sd(boot_sample);
}
return boot_stats;
}
```
---
## Finishing up
- Homework 6 is due Monday, November 16 at 11:59pm ET. Don't forget to complete
Task 3.
- The project is due Monday, November 23 at 11:59am ET. Not late submissions
will be accepted. I'll begin grading shortly after noon that day.
- The final lab is on Monday.
- TAs will have regularly scheduled office hours for the rest of the semester.
- In addition to Monday and Friday office hours, I'll add the following for
next week:
- Tuesday, 11/17 from 9:00am - 10:00am ET
- Thursday, 11/19 from 11:00am - 12:00pm ET
- **Please submit a course evaluation.**
---
## References
1. Francois, D. (2020). Rcpp: Seamless R and C++ Integration.
http://www.rcpp.org/.
2. Wickham, H. (2019). Advanced R. https://adv-r.hadley.nz/.
3. Tsuda, M. (2020). Rcpp for everyone.
https://teuder.github.io/rcpp4everyone_en/.