stringr is a string handling package written by Hadley Wickham that is designed to improve / simplify string handling in R.
str_detect(string, pattern) |
Detect the presence or absence of a pattern in a string. |
str_locate(string, pattern) |
Locate the first position of a pattern and return a matrix with start and end. |
str_extract(string, pattern) |
Extracts text corresponding to the first match. |
str_match(string, pattern) |
Extracts capture groups formed by () from the first match. |
str_split(string, pattern) |
Splits string into pieces and returns a list of character vectors. |
str_replace(string, pattern, replacement) |
Replaces the first matched pattern and returns a character vector. |
Many of these functions have variants with an _all
suffix which will match more than one occurrence of the pattern in a given string.
text = c("The","quick","brown","fox","jumps","over","the","lazy","dog")
str_detect(text,"quick")
## [1] FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
str_detect(text,"o")
## [1] FALSE FALSE TRUE TRUE FALSE TRUE FALSE FALSE TRUE
str_detect(text,"row")
## [1] FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE
The power of regular expressions comes from the ability to use a number of special metacharacters that modify how the pattern matching is performed.
. ^ $ * + ? { } [ ] \ | ( )
Because of their special properties they cannot be matched directly, if you need to match one of them you need to escape it first (precede it by \
). One problem is that
str_detect("abc[def","[")
## Error in stri_detect_regex(string, pattern, opts_regex = attr(pattern, : Missing closing bracket on a bracket expression. (U_REGEX_MISSING_CLOSE_BRACKET)
str_detect("abc[def","\\[")
## [1] TRUE
Note we actually need to prefix [
with \\
because \
is also a escape character for normal strings.
^ or \A |
Start of string |
$ or \Z |
End of string |
\b |
Word boundary |
\B |
Not word boundary |
text = "The quick brown fox jumps over the lazy dog"
str_replace(text,"^The","-")
## [1] "- quick brown fox jumps over the lazy dog"
str_replace(text,"^dog","-")
## [1] "The quick brown fox jumps over the lazy dog"
str_replace(text,"The$","-")
## [1] "The quick brown fox jumps over the lazy dog"
str_replace(text,"dog$","-")
## [1] "The quick brown fox jumps over the lazy -"
str_replace(text,"\\AThe","-")
## [1] "- quick brown fox jumps over the lazy dog"
str_replace(text,"dog\\Z","-")
## [1] "The quick brown fox jumps over the lazy -"
str_replace(text,"\\bbrown\\b","-")
## [1] "The quick - fox jumps over the lazy dog"
str_replace(text,"\\Brow\\B","-")
## [1] "The quick b-n fox jumps over the lazy dog"
\s |
White space |
\S |
Not white space |
\d |
Digit (0-9) |
\D |
Not digit |
\w |
Word (A-Z, a-z, 0-9, or _) | | \W | Not word | |
. |
Any character except new line () |
[abc] |
Range (a or b or c) |
[^abc] |
Not (a or b or c) |
[a-q] |
Lower case letter from a to q |
[A-Q] |
Upper case letter from A to Q |
[0-7] |
Digit from 0 to 7 |
Attached to literals, character classes, ranges or groups to match repeats.
* |
Match 0 or more |
+ |
Match 1 or more |
? |
Match 0 or 1 |
{3} |
Match Exactly 3 |
{3,} |
Match 3 or more |
{3,5} |
Match 3, 4 or 5 |
Add a ? to a quantifier to make it ungreedy.
text = "<div class='main'> <div> <a href='here.pdf'>Here!</a> </div> </div>"
str_extract(text, "<div>.*</div>")
## [1] "<div> <a href='here.pdf'>Here!</a> </div> </div>"
str_extract(text, "<div>.*?</div>")
## [1] "<div> <a href='here.pdf'>Here!</a> </div>"
Group together parts of a regular expression for modification or capture.
(a|b) |
match literal a or b, group either |
a(bc)? |
match literal a or abc, group bc or "" |
(?:abc) |
Non-cÂaptÂuring group |
`(abc)def(hig) | match abcdefhig, group abc and hig |
text = c("apple", "219 733 8965", "329-293-8753", "Work: 579-499-7527; Home: 543.355.3679") phone = "([2-9][0-9]{2})[- .]([0-9]{3})[- .]([0-9]{4})"
str_extract(text, phone)
## [1] NA "219 733 8965" "329-293-8753" "579-499-7527"
str_extract_all(text, phone)
## [[1]] ## character(0) ## ## [[2]] ## [1] "219 733 8965" ## ## [[3]] ## [1] "329-293-8753" ## ## [[4]] ## [1] "579-499-7527" "543.355.3679"
str_match(text, phone)
## [,1] [,2] [,3] [,4] ## [1,] NA NA NA NA ## [2,] "219 733 8965" "219" "733" "8965" ## [3,] "329-293-8753" "329" "293" "8753" ## [4,] "579-499-7527" "579" "499" "7527"
str_match_all(text, phone)
## [[1]] ## [,1] [,2] [,3] [,4] ## ## [[2]] ## [,1] [,2] [,3] [,4] ## [1,] "219 733 8965" "219" "733" "8965" ## ## [[3]] ## [,1] [,2] [,3] [,4] ## [1,] "329-293-8753" "329" "293" "8753" ## ## [[4]] ## [,1] [,2] [,3] [,4] ## [1,] "579-499-7527" "579" "499" "7527" ## [2,] "543.355.3679" "543" "355" "3679"
Write one or more regular expressions to extract the data contained in the xml file below:
<breakfast_menu> <food> <name>Belgian Waffles</name> <price>$5.95</price> <calories>650</calories> <addon> <name>Strawberries</name> <price>$2.00</price> <calories>250</calories> </addon> </food> <food> <name>French Toast</name> <price>$4.50</price> <calories>600</calories> <addon> <name>Strawberries</name> <price>$2.00</price> <calories>250</calories> </addon> </food> </breakfast_menu>
Hadley Wickham - stringr vigneete
David Child - RegEx Cheat Sheet