---
title: "Text Data and Regex"
subtitle: "Statistical Computing & Programming"
author: "Shawn Santo"
institute: ""
date: "05-29-20"
output:
xaringan::moon_reader:
css: "slides.css"
lib_dir: libs
nature:
highlightStyle: github
highlightLines: true
countIncrementalSlides: false
editor_options:
chunk_output_type: console
---
```{r include=FALSE}
knitr::opts_chunk$set(echo = TRUE, message = FALSE, warning = FALSE,
comment = "#>", highlight = TRUE,
fig.align = "center")
```
## Supplementary materials
Companion videos
- [Introduction to `stringr`](https://warpwire.duke.edu/w/Uc4DAA/)
- [Escaping metacharacters](https://warpwire.duke.edu/w/T84DAA/)
- [More metacharacters and their functionality](https://warpwire.duke.edu/w/Tc4DAA/)
- [Quantifies](https://warpwire.duke.edu/w/S84DAA/)
Additional resources
- `stringr` [vignette](https://cran.r-project.org/web/packages/stringr/vignettes/stringr.html)
- `stringr` [cheat sheet](https://github.com/rstudio/cheatsheets/raw/master/strings.pdf)
- regex [guide](http://perso.ens-lyon.fr/lise.vaudor/Rfigures/Expressions_regulieres/regexp.png)
---
class: inverse, center, middle
# `stringr`
---
## Why `stringr`?
- Part of `tidyverse`
- Fast and consistent manipulation of string data
- Readable and consistent syntax
- If you master `stringr`, you know `stringi` - http://www.gagolewski.com/software/stringi/
---
## Usage
- All functions in `stringr` start with `str_` and take a vector of strings
as the first argument.
- Most `stringr` functions work with regular expressions.
- Seven main verbs to work with strings.
.small-text[
| Function | Description |
|:-------------|:------------------------------------|
|`str_detect()` | Detect the presence or absence of a pattern in a string. |
|`str_count()` | Count the number of patterns. |
|`str_locate()` | Locate the first position of a pattern and return a matrix with start and end. |
|`str_extract()` | Extracts text corresponding to the first match. |
|`str_match()` | Extracts capture groups formed by `()` from the first match. |
|`str_split()` | Splits string into pieces and returns a list of character vectors. |
|`str_replace()` | Replaces the first matched pattern and returns a character vector. |
]
Each have leading arguments `string` and `pattern`; all functions are vectorised
over arguments `string` and `pattern`.
Function assistance and visuals: `stringr` [cheat sheet](https://github.com/rstudio/cheatsheets/raw/master/strings.pdf)
---
class: inverse, center, middle
# Regexs
---
## Simple cases
A regular expression, regex or regexp, is a sequence of characters that
define a search pattern.
```{r}
library(tidyverse)
```
.small[
```{r}
twister <- "thirty-three thieves thought they thrilled the throne Thursday"
```
]
--
How many occurrences of `t` exist?
.small[
```{r}
str_count(string = twister, pattern = "t")
```
]
--
.pull-left[
How many of `t`, `th`, and `the` exist?
.small[
```{r}
str_count(twister, c("t", "th", "the"))
```
]
]
.pull-right[
Do these patterns exist?
.small[
```{r}
str_detect(twister, c("t", "th", "the"))
```
]
]
---
Separate our long string at each space.
```{r}
twister_split <- str_split(twister, " ") %>% unlist()
twister_split
```
--
Do these patterns exist?
```{r}
str_detect(twister_split, c("tho", "the"))
```
--
Replace certain occurrences.
.small[
```{r}
str_replace(twister_split, c("tho", "the"), replacement = c("bro", "Wil"))
```
]
---
## A step up in complexity
A `.` matches any character, except a new line. It is one of a few
metacharacters - special meaning and function.
.small[
```{r}
twister <- "thirty-three thieves thought they thrilled the throne Thursday"
```
]
Does this pattern, `.y.` exist?
```{r}
str_detect(twister, ".y.")
```
--
How many instances?
```{r}
str_count(twister, ".y.")
```
--
View in Viewer pane.
```{r}
str_view_all(twister, ".y.")
```
---
## How do we match an actual `.`?
You need to use an escape character to tell the regex you want exact matching.
Regexs use a `\` as an escape character. So why doesn't this work?
```{r error=TRUE}
str_view_all("show.me.the.dots...", "\.")
```
---
## R escape characters
There are some special characters in R that cannot be directly coded in a
string. An escape character is a character which results in an alternative
interpretation of the following character(s). These vary from language to
language but for most string implementations `\` is the escape character
which is modified by a single subsequent character.
Some common examples:
.small[
| Literal | Character |
|:--------|:-----------------|
|`\'` | single quote |
|`\"` | double quote |
|`\\` | backslash |
|`\n` | new line |
|`\r` | carriage return |
|`\t` | tab |
|`\b` | backspace |
|`\f` | form feed |
]
???
Escaping a character not in the following table is an error.
| Literal | Character |
|:--------|:-----------------|
|`\n` | newline |
|`\r` | carriage return |
|`\t` | tab |
|`\b` | backspace |
|`\a` | alert (bell) |
|`\f` | form feed |
|`\v` | vertical tab |
|`\\` | backslash \ |
|`\'` | ASCII apostrophe ' |
|`\"` | ASCII quotation mark " |
|`\\\\`` | ASCII grave accent (backtick) \` |
|`\nnn` | character with given octal code (1, 2 or 3 digits) |
|`\xnn` | character with given hex code (1 or 2 hex digits) |
|`\unnnn` | Unicode character with given code (1--4 hex digits) |
|`\Unnnnnnnn` | Unicode character with given code (1--8 hex digits) |
---
## Examples
.tiny[
```{r fig.width=9, fig.height=5.5}
mtcars %>%
ggplot(aes(x = factor(cyl), y = hp)) + ggpol::geom_boxjitter() +
labs(x = "Number \n of \n Cylinders", y = "\"Horse\" Power", #<<
title = "A \t boxjitter \t\t plot \n showing some escape \n characters") + #<<
theme_minimal(base_size = 18)
```
]
---
## Examples
```{r error=TRUE}
print("hello\world")
```
--
```{r error=TRUE}
cat("hello\world")
```
--
```{r}
print("hello\tworld")
```
--
```{r eval=FALSE}
cat("hello\bworld")
```
`#> [1] hellworld`
---
.pull-left[
```{r error=TRUE}
print("hello\"world")
print("hello\tworld")
print("hello\nworld")
print("hello\\world")
```
]
.pull-right[
```{r error=TRUE}
cat("hello\"world")
cat("hello\tworld")
cat("hello\nworld")
cat("hello\\world")
```
]
---
## Returning to: how do we match a `.`?
We need to escape the `\`.
```{r error=TRUE}
str_view_all("show.me.the.dots...", "\\.")
```
---
## Regex metacharacters
```regex
. ^ $ * + ? { } [ ] \ | ( )
```
Allow for more advanced forms of pattern matching.
As we saw with `.`, these cannot be matched directly. Thus, if you want to match
the literal `?` you will need to use `\\?`.
--
What do you need to match a literal `\` in regex pattern matching?
--
```{r}
str_view_all("find the \\ in this string", "\\\\")
```
---
## Regex anchors
Sometimes we want to specify that our pattern occurs at a particular
location in a string, we indicate this using anchor metacharacters.
| Regex | Anchor |
|-------------|-----------------|
| `^` or `\A` | Start of string |
| `$` or `\Z` | End of string |
---
## Examples: metacharacters and anchors
```{r}
text <- "Who? What? where? When? WHY?"
```
```{r}
str_locate_all(text, "\\?")
```
--
```{r}
str_replace(text, "^W...", "****")
```
--
```{r}
str_replace(text, "W...$", "****")
```
---
## Character classes
Special patterns exist to match more than one class.
| Meta Character | Class | Description |
|:--------------:|-------------|--------------------------------------|
| `.` | | Any character except new line (`\n`) |
| `\s` | `[:space:]` | White space (space, tab, newline) |
| `\S` | | Not white space |
| `\d` | `[:digit:]` | Digit (0-9) |
| `\D` | | Not digit |
| `\w` | | Word (A-Z, a-z, 0-9, or _) |
| `\W` | | Not word |
| | `[:punct:]` | Punctuation |
---
## Character class overview