stringr is a string handling package written by Hadley Wickham that is designed to improve / simplify string handling in R. Most are wrappers for base R functions.
str_detect(string, pattern) |
Detect the presence or absence of a pattern in a string. |
str_locate(string, pattern) |
Locate the first position of a pattern and return a matrix with start and end. |
str_extract(string, pattern) |
Extracts text corresponding to the first match. |
str_match(string, pattern) |
Extracts capture groups formed by () from the first match. |
str_split(string, pattern) |
Splits string into pieces and returns a list of character vectors. |
str_replace(string, pattern, replacement) |
Replaces the first matched pattern and returns a character vector. |
Many of these functions have variants with an _all
suffix which will match more than one occurrence of the pattern in a given string.
text = c("The","quick","brown","fox","jumps","over","the","lazy","dog")
str_detect(text,"quick")
## [1] FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
str_detect(text,"o")
## [1] FALSE FALSE TRUE TRUE FALSE TRUE FALSE FALSE TRUE
str_detect(text,"row")
## [1] FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE
An escape character is a character which results in an alternative interpretation of the following character(s). These vary from language to language but most often \
is the escape character and it is followed by a single character.
Some common examples:
\' |
single quote |
\" |
double quote |
\\ |
backslash |
\n |
new line |
\r |
carriage return |
\t |
tab |
\b |
backspace |
\f |
form feed |
The power of regular expressions comes from the ability to use a number of special metacharacters that modify how pattern matching is performed.
. ^ $ * + ? { } [ ] \ | ( )
Because of their special properties they cannot be matched directly, if you need to match one of them you need to escape it first (precede it by \
). One problem is that regex escapes live on top of regular string escapes, so the following causes an error
str_detect("abc[def","\[")
We actually need to prefix [
with \\
because \
is also an escape for normal character strings.
str_detect("abc[def","\\[")
## [1] TRUE
How do you detect if a string contains a \
character?
cat("abc\\def\n")
## abc\def
str_detect("abc\\def","\\\\")
## [1] TRUE
Sometimes we want to specify that our pattern ours at a particular location, we can do this using anchors.
^ or \A |
Start of string |
$ or \Z |
End of string |
\b |
Word boundary |
\B |
Not word boundary |
text = "the quick brown fox jumps over the lazy dog"
str_replace(text,"^the","---")
## [1] "--- quick brown fox jumps over the lazy dog"
str_replace(text,"^dog","---")
## [1] "the quick brown fox jumps over the lazy dog"
str_replace(text,"the$","---")
## [1] "the quick brown fox jumps over the lazy dog"
str_replace(text,"dog$","---")
## [1] "the quick brown fox jumps over the lazy ---"
str_replace(text,"\\Athe","---")
## [1] "--- quick brown fox jumps over the lazy dog"
str_replace(text,"dog\\Z","---")
## [1] "the quick brown fox jumps over the lazy ---"
str_replace(text,"\\brow\\b","---")
## [1] "the quick brown fox jumps over the lazy dog"
str_replace(text,"\\Brow\\B","---")
## [1] "the quick b---n fox jumps over the lazy dog"
When we want to match whole classes of characters at a time there are a number of convenience patterns built in,
\s |
White space |
\S |
Not white space |
\d |
Digit (0-9) |
\D |
Not digit |
\w |
Word (A-Z, a-z, 0-9, or _) |
\W |
Not word |
We can also specify our own character groups through the construction of ranges
. |
Any character except new line () |
[abc] |
Range (a or b or c) |
[^abc] |
Not (a or b or c) |
[a-q] |
Lower case letter from a to q |
[A-Q] |
Upper case letter from A to Q |
[0-7] |
Digit from 0 to 7 |
Attached to literals, character classes, ranges or groups to match repeats.
* |
Match 0 or more |
+ |
Match 1 or more |
? |
Match 0 or 1 |
{3} |
Match Exactly 3 |
{3,} |
Match 3 or more |
{3,5} |
Match 3, 4 or 5 |
Add a ? to a quantifier to make it ungreedy.
text = "<div class='main'> <div> <a href='here.pdf'>Here!</a> </div> </div>"
str_extract(text, "<div>.*</div>")
## [1] "<div> <a href='here.pdf'>Here!</a> </div> </div>"
str_extract(text, "<div>.*?</div>")
## [1] "<div> <a href='here.pdf'>Here!</a> </div>"
Group together parts of a regular expression for modification or capture.
(a|b) |
match literal a or b, group either |
a(bc)? |
match literal a or abc, group bc or “” |
(?:abc) |
Non-capturing group |
`(abc)def(hig) | match abcdefhig, group abc and hig |
text = c("apple", "219 733 8965", "329-293-8753", "Work: 579-499-7527; Home: 543.355.3679")
phone = "([2-9][0-9]{2})[- .]([0-9]{3})[- .]([0-9]{4})"
str_extract(text, phone)
## [1] NA "219 733 8965" "329-293-8753" "579-499-7527"
str_extract_all(text, phone)
## [[1]]
## character(0)
##
## [[2]]
## [1] "219 733 8965"
##
## [[3]]
## [1] "329-293-8753"
##
## [[4]]
## [1] "579-499-7527" "543.355.3679"
str_match(text[2:4], phone)
## [,1] [,2] [,3] [,4]
## [1,] "219 733 8965" "219" "733" "8965"
## [2,] "329-293-8753" "329" "293" "8753"
## [3,] "579-499-7527" "579" "499" "7527"
str_match_all(text[2:4], phone)
## [[1]]
## [,1] [,2] [,3] [,4]
## [1,] "219 733 8965" "219" "733" "8965"
##
## [[2]]
## [,1] [,2] [,3] [,4]
## [1,] "329-293-8753" "329" "293" "8753"
##
## [[3]]
## [,1] [,2] [,3] [,4]
## [1,] "579-499-7527" "579" "499" "7527"
## [2,] "543.355.3679" "543" "355" "3679"
Validating an email address:
(?:[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*|"(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])*")@(?:(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?|\[(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?|[a-z0-9-]*[a-z0-9]:(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21-\x5a\x53-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])+)\])
For the following vector of randomly generated names, write a regular expression that,
detects if the person’s first name starts with a vowel (a,e,i,o,u)
detects if the person’s last name starts with a vowel
detects if either the person’s first or last name start with a vowel
detects if neither the person’s first nor last name start with a vowel
## c("Jarvez Nguyen", "Kaleb Cho", "Kyana Bradley", "Malik La",
## "Mario Morales", "Trong Nguyen", "Abigail Ohmie", "Anthony Kessenich",
## "Laura Gonzales", "Thomas Vue", "Nicolasa Soltero", "Sanjana Stuckey",
## "Destiny Langley", "Brianna Ortiz", "Condeladio Owens", "Joshua Wilson",
## "Abigail Adu", "Cassidy Chavez", "Megan Dorsey", "Maomao Brown"
## )
Write one or more regular expressions to extract the data contained in the xml file below:
<breakfast_menu>
<food>
<name>Belgian Waffles</name>
<price>$5.95</price>
<calories>650</calories>
<addon>
<name>Strawberries</name>
<price>$2.00</price>
<calories>250</calories>
</addon>
</food>
<food>
<name>French Toast</name>
<price>$4.50</price>
<calories>600</calories>
<addon>
<name>Strawberries</name>
<price>$2.00</price>
<calories>250</calories>
</addon>
</food>
</breakfast_menu>
Hadley Wickham - stringr vigneete
David Child - RegEx Cheat Sheet