class: center, middle, inverse, title-slide # Regular Expressions ### Colin Rundel ### 2018-10-01 --- exclude: true ```r library(stringr) ``` --- class: middle count: false # stringr --- ## stringr stringr is a string handling package written by Hadley Wickham that is designed to improve / simplify string handling in R. Most are wrappers for base R functions. .small[ | Function | Description | |:-------------|:------------------------------------| |`str_detect` | Detect the presence or absence of a pattern in a string. | |`str_locate` | Locate the first position of a pattern and return a matrix with start and end. | |`str_extract` | Extracts text corresponding to the first match. | |`str_match` | Extracts capture groups formed by `()` from the first match. | |`str_split` | Splits string into pieces and returns a list of character vectors. | |`str_replace` | Replaces the first matched pattern and returns a character vector. | ] <br /> Many of these functions have variants with an `_all` suffix (e.g. `str_replace_all`) which will match more than one occurrence of the pattern in a given string. --- class: middle count: false # Regular Expressions --- ## Simple Pattern Detection ```r text = c("The","quick","brown","fox","jumps","over","the","lazy","dog") ``` -- ```r str_detect(text, "quick") ``` ``` ## [1] FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE ``` -- ```r str_detect(text, "o") ``` ``` ## [1] FALSE FALSE TRUE TRUE FALSE TRUE FALSE FALSE TRUE ``` -- ```r str_detect(text, "row") ``` ``` ## [1] FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE ``` --- ## Aside - Escape Characters An escape character is a character which results in an alternative interpretation of the following character(s). These vary from language to language but for most string implementations `\` is the escape character which is modified by a single subsequent character. Some common examples: .small[ | Literal | Character | |:--------|:-----------------| |`\'` | single quote | |`\"` | double quote | |`\\` | backslash | |`\n` | new line | |`\r` | carriage return | |`\t` | tab | |`\b` | backspace | |`\f` | form feed | ] --- ## Examples .pull-left[ ```r print("a\"b") ``` ``` ## [1] "a\"b" ``` ```r print("a\tb") ``` ``` ## [1] "a\tb" ``` ```r print("a\nb") ``` ``` ## [1] "a\nb" ``` ```r print("a\\b") ``` ``` ## [1] "a\\b" ``` ] .pull-right[ ```r cat("a\"b") ``` ``` ## a"b ``` ```r cat("a\tb") ``` ``` ## a b ``` ```r cat("a\nb") ``` ``` ## a ## b ``` ```r cat("a\\b") ``` ``` ## a\b ``` ] --- ## RegEx Metacharacters The power of regular expressions comes from their ability to use special metacharacters to modify how pattern matching is performed. ```regex . ^ $ * + ? { } [ ] \ | ( ) ``` -- Because of their special properties they cannot be matched directly, if you need to match one you need to escape it first (precede it by `\`). The problem is that regex escapes live on top of string escapes, so there needs to use *two* levels of escapes. <br/> -- | Pattern | Regex | Literal | |---------|-------|-----------| | `.` | `\.` | `"\\."` | | `?` | `\?` | `"\\?"` | | `!` | `\!` | `"\\!"` | --- ## Example ```r str_detect("abc[def","\[") ``` ``` ## Error: '\[' is an unrecognized escape in character string starting ""\[" ``` -- ```r str_detect("abc[def","\\[") ``` ``` ## [1] TRUE ``` -- How do we detect if a string contains a `\` character? -- ```r cat("abc\\def\n") ``` ``` ## abc\def ``` -- ```r str_detect("abc\\def","\\\\") ``` ``` ## [1] TRUE ``` --- ## XKCD's take <img src="imgs/xkcd_backslashes.png" width="1523" style="display: block; margin: auto;" /> --- ## Anchors Sometimes we want to specify that our pattern occurs at a particular location in a string, we indicate this using anchor metacharacters. <br /> | Regex | Anchor | |-------|:----------| | `^` or `\A` | Start of string | | `$` or `\Z` | End of string | | `\b` | Word boundary | | `\B` | Not word boundary | --- ## Anchor Examples ```r text = "the quick brown fox jumps over the lazy dog" ``` -- ```r str_replace(text,"^the","---") ``` ``` ## [1] "--- quick brown fox jumps over the lazy dog" ``` -- ```r str_replace(text,"^dog","---") ``` ``` ## [1] "the quick brown fox jumps over the lazy dog" ``` -- ```r str_replace(text,"the$","---") ``` ``` ## [1] "the quick brown fox jumps over the lazy dog" ``` -- ```r str_replace(text,"dog$","---") ``` ``` ## [1] "the quick brown fox jumps over the lazy ---" ``` --- ## Anchor Examples ```r text = "the quick brown fox jumps over the lazy dog" ``` -- ```r str_replace_all(text,"\\Brow\\B","---") ``` ``` ## [1] "the quick b---n fox jumps over the lazy dog" ``` -- ```r str_replace_all(text,"\\brow\\b","---") ``` ``` ## [1] "the quick brown fox jumps over the lazy dog" ``` -- ```r str_replace_all(text,"\\bthe","---") ``` ``` ## [1] "--- quick brown fox jumps over --- lazy dog" ``` -- ```r str_replace_all(text,"the\\b","---") ``` ``` ## [1] "--- quick brown fox jumps over --- lazy dog" ``` --- ## More complex patterns If there are more than one pattern we would like to match we can use the or (`|`) metacharacter. -- ```r str_replace_all(text,"the|dog","---") ``` ``` ## [1] "--- quick brown fox jumps over --- lazy ---" ``` -- ```r str_replace_all(text,"a|e|i|o|u","-") ``` ``` ## [1] "th- q--ck br-wn f-x j-mps -v-r th- l-zy d-g" ``` -- ```r str_replace_all(text,"\\ba|e|i|o|u","-") ``` ``` ## [1] "th- q--ck br-wn f-x j-mps -v-r th- lazy d-g" ``` -- ```r str_replace_all(text,"\\b(a|e|i|o|u)","-") ``` ``` ## [1] "the quick brown fox jumps -ver the lazy dog" ``` --- ## Character Classes When we want to match whole classes of characters at a time there are a number of convenience patterns built in, <br /> | Meta Char | Class | Description | |:----:|:------------|:-| | `.` | | Any character except new line (`\n`) | | `\s` | `[:space:]` | White space | | `\S` | | Not white space | | `\d` | `[:digit:]` | Digit (0-9)| | `\D` | | Not digit | | `\w` | | Word (A-Z, a-z, 0-9, or _) | | `\W` | | Not word | | | `[:punct:]` | Punctionation | --- ## A hierarchical view <center> <img src="imgs/regex_char_classes.png" width=450> </center> .small[ From http://perso.ens-lyon.fr/lise.vaudor/strings-et-expressions-regulieres/ ] --- ## Example How would we write a regular expression to match a telephone number with the form `(###) ###-####`? ```r text = c("apple", "(219) 733-8965", "(329) 293-8753") ``` -- ```r str_detect(text, "(\d\d\d) \d\d\d-\d\d\d\d") ``` ``` ## Error: '\d' is an unrecognized escape in character string starting ""(\d" ``` -- ```r str_detect(text, "(\\d\\d\\d) \\d\\d\\d-\\d\\d\\d\\d") ``` ``` ## [1] FALSE FALSE FALSE ``` -- ```r str_detect(text, "\\(\\d\\d\\d\\) \\d\\d\\d-\\d\\d\\d\\d") ``` ``` ## [1] FALSE TRUE TRUE ``` --- ## Classes and Ranges We can also specify our own classes using the square bracket meta character <br /> | Class | Type | |----------|:------------| | `[abc]` | Class (a or b or c) | | `[^abc]` | Negated class (not a or b or c) | | `[a-c]` | Range lower case letter from a to c | | `[A-C]` | Range upper case letter from A to C | | `[0-7]` | Digit between 0 to 7 | --- ## Example ```r text = c("apple", "(219) 733-8965", "(329) 293-8753") ``` -- ```r str_replace_all(text, "[aeiou]", "&") ``` ``` ## [1] "&ppl&" "(219) 733-8965" "(329) 293-8753" ``` -- ```r str_replace_all(text, "[13579]", "*") ``` ``` ## [1] "apple" "(2**) ***-8*6*" "(*2*) 2**-8***" ``` -- ```r str_replace_all(text, "[1-5a-ep]", "^") ``` ``` ## [1] "^^^l^" "(^^9) 7^^-896^" "(^^9) ^9^-87^^" ``` --- ## Exercises 1 For the following vector of randomly generated names, write a regular expression that, * detects if the person's first name starts with a vowel (a,e,i,o,u) * detects if the person's last name starts with a vowel * detects if either the person's first or last name start with a vowel * detects if neither the person's first nor last name start with a vowel ``` c("Haven Giron", "Newton Domingo", "Kyana Morales", "Andre Brooks", "Jarvez Wilson", "Mario Kessenich", "Sahla al-Radi", "Trong Brown", "Sydney Bauer", "Kaleb Bradley", "Morgan Hansen", "Abigail Cho", "Destiny Stuckey", "Hafsa al-Hashmi", "Condeladio Owens", "Annnees el-Bahri", "Megan La", "Naseema el-Siddiqi", "Luisa Billie", "Anthony Nguyen" ) ``` --- ## Quantifiers Attached to literals or character classes these allow a match to repeat some number of time. <br /> | Quantifier | Description | |:-----------|:------------| | `*` | Match 0 or more | | `+` | Match 1 or more | | `?` | Match 0 or 1 | | `{3}` | Match Exactly 3 | | `{3,}` | Match 3 or more | | `{3,5}` | Match 3, 4 or 5 | --- ## Example How would we improve our previous regular expression for matching a telephone number with the form `(###) ###-####`? ```r text = c("apple", "(219) 733-8965", "(329) 293-8753") ``` -- ```r str_detect(text, "\\(\\d\\d\\d\\) \\d\\d\\d-\\d\\d\\d\\d") ``` ``` ## [1] FALSE TRUE TRUE ``` -- ```r str_detect(text, "\\(\\d{3}\\) \\d{3}-\\d{4}") ``` ``` ## [1] FALSE TRUE TRUE ``` --- ## Greedy vs ungreedy matching What went wrong here? ```r text = "<div class='main'> <div> <a href='here.pdf'>Here!</a> </div> </div>" ``` ```r str_extract(text, "<div>.*</div>") ``` ``` ## [1] "<div> <a href='here.pdf'>Here!</a> </div> </div>" ``` <br/> -- If you add `?` after a quantifier, the matching will be *non-greedy* (find the shortest possible match, not the longest). ```r str_extract(text, "<div>.*?</div>") ``` ``` ## [1] "<div> <a href='here.pdf'>Here!</a> </div>" ``` --- ## Groups Groups allow you to connect pieces of a regular expression for modification or capture. <br /> | Group | Description | |-----------------|:------------| | (a | b) | match literal "a" or "b", group either | | `a(bc)?` | match "a" or "abc", group bc or nothing | | `(abc)def(hig)` | match "abcdefhig", group abc and hig | | `(?:abc)` | match "abc", non-capturing group | --- ## Example ```r text = c("Bob Smith", "Alice Smith", "Apple") ``` -- ```r str_extract(text, "^[:alpha:]+") ``` ``` ## [1] "Bob" "Alice" "Apple" ``` -- ```r str_match(text, "^([:alpha:]+) [:alpha:]+") ``` ``` ## [,1] [,2] ## [1,] "Bob Smith" "Bob" ## [2,] "Alice Smith" "Alice" ## [3,] NA NA ``` -- ```r str_match(text, "^([:alpha:]+) ([:alpha:]+)") ``` ``` ## [,1] [,2] [,3] ## [1,] "Bob Smith" "Bob" "Smith" ## [2,] "Alice Smith" "Alice" "Smith" ## [3,] NA NA NA ``` --- ## How not to use a RegEx Validating an email address: <br /> .small[ ``` (?:[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*|" (?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])*") @(?:(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?|\[ (?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3} (?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?|[a-z0-9-]*[a-z0-9]: (?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21-\x5a\x53-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])+)\]) ``` ] --- ## Exercise 2 ```r text = c( "apple", "219 733 8965", "329-293-8753", "Work: (579) 499-7527; Home: (543) 355 3679" ) ``` * Write a regular expression that will extract *all* phone numbers contained in the vector above. * Once that works use groups to extracts the area code separately from the rest of the phone number. --- # Acknowledgments --- ## Acknowledgments * Hadley Wickham - [stringr vigneete](https://cran.r-project.org/web/packages/stringr/vignettes/stringr.html) * David Child - [RegEx Cheat Sheet](http://www.cheatography.com/davechild/cheat-sheets/regular-expressions/) * [Regular-Expression.info](http://www.regular-expressions.info/)