6.10 Regexp Pattern Matching
20180608 One of the most powerful string processing concepts is the concept of regular expressions. A regular expression is a sequence of characters that describe a pattern. The concept was formalised by the American mathematician Stephen Cole Kleene. A regular expression pattern can contain a combination of alphanumeric and special characters. It is a complex topic and we take an introductory look at it here to craft regular expressions in R.
Visit RegExr, by G Skinner to explore regular expressions interactively.
An important concept is that of metacharacters which have special meaning within a regular expression. Unlike other characters that are used to match themselves, metacharacters have a specific meaning beyond the character they represent. The following table contains a list of common metacharacters used in regular expressions.
% latex table generated in R 4.4.1 by xtable 1.8-4 package % Thu Nov 21 18:16:27 2024Such metacharacters are used to match different patterns which can be
found using base::grep(). According to
<gnu.org/software/grep> g/re/p
is a command from the
command line tool ed
to get the regular
expression and print it.
s <- c("hands", "data", "on", "data$cience", "handsondata$cience", "handson")
grep(pattern="^data", s, value=TRUE)
## [1] "data" "data$cience"
## [1] "on" "handson"
## [1] "handsondata$cience"
In order to match a metacharacter in R we need to escape it with \(\backslash\backslash\) (double backslash).
## [1] "data$cience" "handsondata$cience"
Your donation will support ongoing availability and give you access to the PDF version of this book. Desktop Survival Guides include Data Science, GNU/Linux, and MLHub. Books available on Amazon include Data Mining with Rattle and Essentials of Data Science. Popular open source software includes rattle, wajig, and mlhub. Hosted by Togaware, a pioneer of free and open source software since 1984. Copyright © 1995-2022 Graham.Williams@togaware.com Creative Commons Attribution-ShareAlike 4.0