6.10 Regexp Pattern Matching

20180608 One of the most powerful string processing concepts is the concept of regular expressions. A regular expression is a sequence of characters that describe a pattern. The concept was formalised by the American mathematician Stephen Cole Kleene. A regular expression pattern can contain a combination of alphanumeric and special characters. It is a complex topic and we take an introductory look at it here to craft regular expressions in R.

Visit RegExr, by G Skinner to explore regular expressions interactively.

An important concept is that of metacharacters which have special meaning within a regular expression. Unlike other characters that are used to match themselves, metacharacters have a specific meaning beyond the character they represent. The following table contains a list of common metacharacters used in regular expressions.

% latex table generated in R 4.4.1 by xtable 1.8-4 package % Thu Nov 21 18:16:27 2024

Such metacharacters are used to match different patterns which can be found using base::grep(). According to <gnu.org/software/grep> g/re/p is a command from the command line tool ed to get the regular expression and print it.

s <- c("hands", "data", "on", "data$cience", "handsondata$cience", "handson")
grep(pattern="^data", s, value=TRUE)
## [1] "data"        "data$cience"
grep(pattern="on$", s, value=TRUE)
## [1] "on"      "handson"
grep(pattern="(nd)..(nd)", s, value=TRUE)
## [1] "handsondata$cience"

In order to match a metacharacter in R we need to escape it with \(\backslash\backslash\) (double backslash).

grep(pattern="\\$", s, value=TRUE)
## [1] "data$cience"        "handsondata$cience"


Your donation will support ongoing availability and give you access to the PDF version of this book. Desktop Survival Guides include Data Science, GNU/Linux, and MLHub. Books available on Amazon include Data Mining with Rattle and Essentials of Data Science. Popular open source software includes rattle, wajig, and mlhub. Hosted by Togaware, a pioneer of free and open source software since 1984. Copyright © 1995-2022 Graham.Williams@togaware.com Creative Commons Attribution-ShareAlike 4.0