stringr is a string handling package written by Hadley Wickham that is designed to improve / simplify string handling in R. Most are wrappers for base R functions.
str_detect(string, pattern) |
Detect the presence or absence of a pattern in a string. |
str_locate(string, pattern) |
Locate the first position of a pattern and return a matrix with start and end. |
str_extract(string, pattern) |
Extracts text corresponding to the first match. |
str_match(string, pattern) |
Extracts capture groups formed by () from the first match. |
str_split(string, pattern) |
Splits string into pieces and returns a list of character vectors. |
str_replace(string, pattern, replacement) |
Replaces the first matched pattern and returns a character vector. |
Many of these functions have variants with an _all
suffix which will match more than one occurrence of the pattern in a given string.
text = c("The","quick","brown","fox","jumps","over","the","lazy","dog")
str_detect(text,"quick")
## [1] FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
str_detect(text,"o")
## [1] FALSE FALSE TRUE TRUE FALSE TRUE FALSE FALSE TRUE
str_detect(text,"row")
## [1] FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE
An escape character is a character which results in an alternative interpretation of the following character(s). These vary from language to language but most often \
is the escape character and it is followed by a single character.
Some common examples:
\' |
single quote |
\" |
double quote |
\\ |
backslash |
\n |
new line |
\r |
carriage return |
\t |
tab |
\b |
backspace |
\f |
form feed |
cat("a\"b")
## a"b
cat("a\tb")
## a b
cat("a\nb")
## a ## b
The power of regular expressions comes from the ability to use a number of special metacharacters that modify how pattern matching is performed.
. ^ $ * + ? { } [ ] \ | ( )
Because of their special properties they cannot be matched directly, if you need to match one of them you need to escape it first (precede it by \
). One problem is that regex escapes live on top of regular string escapes, so there needs to be two levels of escapes.
Pattern | Regex | String |
---|---|---|
. |
\. |
"\\." |
? |
\? |
"\\?" |
! |
\! |
"\\!" |
str_detect("abc[def","\[")
## Error: '\[' is an unrecognized escape in character string starting ""\["
str_detect("abc[def","\\[")
## [1] TRUE
How do you detect if a string contains a \
character?
cat("abc\\def\n")
## abc\def
str_detect("abc\\def","\\\\")
## [1] TRUE
Sometimes we want to specify that our pattern occurs at a particular location, we can do this using anchors.
^ or \A |
Start of string |
$ or \Z |
End of string |
\b |
Word boundary |
\B |
Not word boundary |
text = "the quick brown fox jumps over the lazy dog"
str_replace(text,"^the","---")
## [1] "--- quick brown fox jumps over the lazy dog"
str_replace(text,"^dog","---")
## [1] "the quick brown fox jumps over the lazy dog"
str_replace(text,"the$","---")
## [1] "the quick brown fox jumps over the lazy dog"
str_replace(text,"dog$","---")
## [1] "the quick brown fox jumps over the lazy ---"
text = "the quick brown fox jumps over the lazy dog"
str_replace(text,"\\Brow\\B","---")
## [1] "the quick b---n fox jumps over the lazy dog"
str_replace(text,"\\bthe","---")
## [1] "--- quick brown fox jumps over the lazy dog"
str_replace(text,"the\\b","---")
## [1] "--- quick brown fox jumps over the lazy dog"
str_replace_all(text,"\\bthe","---")
## [1] "--- quick brown fox jumps over --- lazy dog"
If there are more than one pattern we would like to match we can use the or (|
) metacharacter.
text = "the quick brown fox jumps over the lazy dog"
str_replace(text,"the|dog","---")
## [1] "--- quick brown fox jumps over the lazy dog"
str_replace(text,"the|row","---")
## [1] "--- quick brown fox jumps over the lazy dog"
str_replace_all(text,"the|dog","---")
## [1] "--- quick brown fox jumps over --- lazy ---"
str_replace_all(text,"the|row","---")
## [1] "--- quick b---n fox jumps over --- lazy dog"
When we want to match whole classes of characters at a time there are a number of convenience patterns built in,
. |
Any character except new line (\n ) |
\s |
White space |
\S |
Not white space |
\d |
Digit (0-9) |
\D |
Not digit |
\w |
Word (A-Z, a-z, 0-9, or _) |
\W |
Not word |
From http://perso.ens-lyon.fr/lise.vaudor/strings-et-expressions-regulieres/
How would we write a regular expression to match a telephone number with the form (###) ###-####
?
text = c("apple", "(219) 733-8965", "(329) 293-8753")
str_detect(text, "(\d\d\d) \d\d\d-\d\d\d\d")
## Error: '\d' is an unrecognized escape in character string starting ""(\d"
str_detect(text, "(\\d\\d\\d) \\d\\d\\d-\\d\\d\\d\\d")
## [1] FALSE FALSE FALSE
str_detect(text, "\\(\\d\\d\\d\\) \\d\\d\\d-\\d\\d\\d\\d")
## [1] FALSE TRUE TRUE
We can also specify our own character groups through the construction of lists and ranges
[abc] |
List (a or b or c) |
[^abc] |
Excluded list (not a or b or c) |
[a-q] |
Range lower case letter from a to q |
[A-Q] |
Range upper case letter from A to Q |
[0-7] |
Digit from 0 to 7 |
text = c("apple", "(219) 733-8965", "(329) 293-8753")
str_replace_all(text, "[aeiou]", "")
## [1] "ppl" "(219) 733-8965" "(329) 293-8753"
str_replace_all(text, "[13579]", "*")
## [1] "apple" "(2**) ***-8*6*" "(*2*) 2**-8***"
str_replace_all(text, "[1-5a-ep]", "*")
## [1] "***l*" "(**9) 7**-896*" "(**9) *9*-87**"
For the following vector of randomly generated names, write a regular expression that,
detects if the person's first name starts with a vowel (a,e,i,o,u)
detects if the person's last name starts with a vowel
detects if either the person's first or last name start with a vowel
detects if neither the person's first nor last name start with a vowel
c("Haven Giron", "Newton Domingo", "Kyana Morales", "Andre Brooks", "Jarvez Wilson", "Mario Kessenich", "Sahla al-Radi", "Trong Brown", "Sydney Bauer", "Kaleb Bradley", "Morgan Hansen", "Abigail Cho", "Destiny Stuckey", "Hafsa al-Hashmi", "Condeladio Owens", "Annnees el-Bahri", "Megan La", "Naseema el-Siddiqi", "Luisa Billie", "Anthony Nguyen" )
Attached to literals, character classes, ranges or groups to match repeats.
* |
Match 0 or more |
+ |
Match 1 or more |
? |
Match 0 or 1 |
{3} |
Match Exactly 3 |
{3,} |
Match 3 or more |
{3,5} |
Match 3, 4 or 5 |
How would we improve our previous regular expression for matching a telephone number with the form (###) ###-####
?
text = c("apple", "(219) 733-8965", "(329) 293-8753")
str_detect(text, "\\(\\d\\d\\d\\) \\d\\d\\d-\\d\\d\\d\\d")
## [1] FALSE TRUE TRUE
str_detect(text, "\\(\\d{3}\\) \\d{3}-\\d{4}")
## [1] FALSE TRUE TRUE
What went wrong here?
text = "<div class='main'> <div> <a href='here.pdf'>Here!</a> </div> </div>"
str_extract(text, "<div>.*</div>")
## [1] "<div> <a href='here.pdf'>Here!</a> </div> </div>"
If you add ?
after a quantifier, the matching will be ungreedy (find the shortest possible match, not the longest).
str_extract(text, "<div>.*?</div>")
## [1] "<div> <a href='here.pdf'>Here!</a> </div>"
Group together parts of a regular expression for modification or capture.
(a|b) |
match literal a or b, group either |
a(bc)? |
match literal a or abc, group bc or "" |
(?:abc) |
Non-capturing group |
`(abc)def(hig) | match abcdefhig, group abc and hig |
text = c("Bob Smith", "Alice Smith", "Apple")
str_extract(text, "^[:alpha:]+")
## [1] "Bob" "Alice" "Apple"
str_match(text, "^([:alpha:]+) [:alpha:]+")
## [,1] [,2] ## [1,] "Bob Smith" "Bob" ## [2,] "Alice Smith" "Alice" ## [3,] NA NA
str_match(text, "^([:alpha:]+) ([:alpha:]+)")
## [,1] [,2] [,3] ## [1,] "Bob Smith" "Bob" "Smith" ## [2,] "Alice Smith" "Alice" "Smith" ## [3,] NA NA NA
Validating an email address:
(?:[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*|"(?: [\x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])*") @(?:(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?|\[ (?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?|[a-z0-9-]* [a-z0-9]:(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21-\x5a\x53-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])+)\])
text = c("apple", "219 733 8965", "329-293-8753", "Work: (579) 499-7527; Home: (543) 355 3679")
Write a regular expression that will extract all phone numbers contained in the vector above.
Once that works use groups to extracts the area code separately from the rest of the phone number.
Hadley Wickham - stringr vigneete
David Child - RegEx Cheat Sheet