---
title: Regular Expressions
author: "Colin Rundel"
date: "2019-02-21"
output:
xaringan::moon_reader:
css: "slides.css"
lib_dir: libs
nature:
highlightStyle: github
highlightLines: true
countIncrementalSlides: false
---
exclude: true
```{r, message=FALSE, warning=FALSE, include=FALSE}
options(
htmltools.dir.version = FALSE, # for blogdown
width = 80,
tibble.width = 80
)
knitr::opts_chunk$set(
fig.align = "center"
)
htmltools::tagList(rmarkdown::html_dependency_font_awesome())
```
```{r setup, message=FALSE}
library(stringr)
```
---
class: middle
count: false
```{r echo=FALSE, fig.align="center", out.width="33%"}
knitr::include_graphics("https://github.com/rstudio/hex-stickers/raw/master/PNG/stringr.png")
```
---
## stringr
stringr is a package for handling character strings, it is designed to improve / simplify string handling relative to base R and make it more similar to Python's string library. Most of the package's functions are wrappers for base R functions.
.small[
| Function | Description |
|:-------------|:------------------------------------|
|`str_detect` | Detect the presence or absence of a pattern in a string. |
|`str_locate` | Locate the first position of a pattern and return a matrix with start and end. |
|`str_extract` | Extracts text corresponding to the first match. |
|`str_match` | Extracts capture groups formed by `()` from the first match. |
|`str_split` | Splits string into pieces and returns a list of character vectors. |
|`str_replace` | Replaces the first matched pattern and returns a character vector. |
|`str_remove` | Removes the first matched pattern and returns a character vector. |
]
Many of these functions have variants with an `_all` suffix (e.g. `str_replace_all`) which will match more than one occurrence of the pattern in a given string.
---
class: middle
count: false
# Regular Expressions
---
## Simple Pattern Detection
```{r}
text = c("The","quick","brown","fox","jumps","over","the","lazy","dog")
```
--
```{r}
str_detect(text, "quick")
```
--
```{r}
str_detect(text, "o")
```
--
```{r}
str_detect(text, "row")
```
---
## Aside - Escape Characters
An escape character is a character which results in an alternative interpretation of the subsequent character(s). These vary from language to language but for most string implementations `\` is the escape character which is modified by a single following character.
Some common examples:
.small[
| Literal | Character |
|:--------|:-----------------|
|`\'` | single quote |
|`\"` | double quote |
|`\\` | backslash |
|`\n` | new line |
|`\r` | carriage return |
|`\t` | tab |
|`\b` | backspace |
|`\f` | form feed |
]
---
## Examples
.pull-left[
```{r error=TRUE}
print("a\"b")
print("a\tb")
print("a\nb")
print("a\\b")
```
]
.pull-right[
```{r error=TRUE}
cat("a\"b")
cat("a\tb")
cat("a\nb")
cat("a\\b")
```
]
---
## RegEx Metacharacters
The power of regular expressions comes from their ability to use special metacharacters to modify how pattern matching is performed.
```regex
. ^ $ * + ? { } [ ] \ | ( )
```
--
Because of their special properties they cannot be matched directly, if you need to match one you need to escape it first (precede it by `\`). The problem is that regex escapes live on top of string escapes, so there needs to use *two* levels of escapes.
--
| Pattern | Regex | Literal |
|---------|-------|-----------|
| `.` | `\.` | `"\\."` |
| `?` | `\?` | `"\\?"` |
| `!` | `\!` | `"\\!"` |
---
## Example
```{r error=TRUE}
str_detect("abc[def","\[")
```
--
```{r error=TRUE}
str_detect("abc[def","\\[")
```
--
How do we detect if a string contains a `\` character?
--
```{r}
cat("abc\\def\n")
```
--
```{r}
str_detect("abc\\def","\\\\")
```
---
## XKCD's take
```{r echo=FALSE, fig.align="center"}
knitr::include_graphics("imgs/xkcd_backslashes.png")
```
---
## Anchors
Sometimes we want to specify that our pattern occurs at a particular location in a string, we indicate this using anchor metacharacters.
| Regex | Anchor |
|-------|:----------|
| `^` or `\A` | Start of string |
| `$` or `\Z` | End of string |
| `\b` | Word boundary |
| `\B` | Not word boundary |
---
## Anchor Examples
```{r}
text = "the quick brown fox jumps over the lazy dog"
```
--
```{r}
str_replace(text,"^the","---")
```
--
```{r}
str_replace(text,"^dog","---")
```
--
```{r}
str_replace(text,"the$","---")
```
--
```{r}
str_replace(text,"dog$","---")
```
---
## Anchor Examples
```{r}
text = "the quick brown fox jumps over the lazy dog"
```
--
```{r}
str_replace_all(text,"\\Brow\\B","---")
```
--
```{r}
str_replace_all(text,"\\brow\\b","---")
```
--
```{r}
str_replace_all(text,"\\bthe","---")
```
--
```{r}
str_replace_all(text,"the\\b","---")
```
---
## More complex patterns
If there are more than one pattern we would like to match we can use the or (`|`) metacharacter.
--
```{r}
str_replace_all(text,"the|dog","---")
```
--
```{r}
str_replace_all(text,"a|e|i|o|u","-")
```
--
```{r}
str_replace_all(text,"\\ba|e|i|o|u","-")
```
--
```{r}
str_replace_all(text,"\\b(a|e|i|o|u)","-")
```
---
## Character Classes
When we want to match whole classes of characters at a time there are a number of convenience patterns built in,
| Meta Char | Class | Description |
|:----:|:------------|:-|
| `.` | | Any character except new line (`\n`) |
| `\s` | `[:space:]` | White space |
| `\S` | | Not white space |
| `\d` | `[:digit:]` | Digit (0-9)|
| `\D` | | Not digit |
| `\w` | | Word (A-Z, a-z, 0-9, or _) |
| `\W` | | Not word |
| | `[:punct:]` | Punctionation |
---
## A hierarchical view