10.27 Special Case Variable Name Transformations
20180721 When reviewing the variables of a dataset we often notice other changes that could be made to the variable names. This might be to simplify the variables or to clarify the meaning of the variable. The string processing functions provided by stringr (Wickham 2019b) come in handy for such processing.
In the following example we remove the prefix of the variable names where we identify that the prefix consists of all characters up to the first underscore. This is useful where a dataset has prefixed each variable by a sequential number or by some other code and we have no real use of such a prefix in our processing.
names(ds) %<>% str_replace("^[^_]*_", "")
This will take a variable name like
ab123_tax_payable and convert it to
str_replace("ab123_tax_payable", "^[^_]*_", "")
##  "tax_payable"
The odd looking characters in the argument to
stringr::str_replace() are a . Regular expressions are a very powerful concept and can
get quite complex. The reader is referred to the many resources
on-line that cover regular expressions. The regular expression is a
pattern used to match some part of the variable name. The pattern
^ which anchors the match to the beginning of the
variable name. This can be followed by zero or more characters
*) that do not match the underscore (
* specifies that the preceding pattern can be repeated zero or
more times. The preceding pattern here is actually a list of
characters included between square brackets. Since this list begins
^ the listed characters are excluded from the
matching. That is, the pattern preceding the
* will match any
character that is not an underscore. The third component of the match
is then an actual underscore. Combined this regular expression matches
any sequence (including an empty sequence) of characters (except for
an underscore) that is at the beginning of the variable name and
followed by an underscore.
The next argument to stringr::str_replace() is the replacement string. In this case we are replacing the matched pattern with an empty string.
The example here is simply one example of very many possible transformations we become used to in cleaning our datasets. The aim in transforming the variable names is to make then easier to use and to understand, both for ourselves and for others.
Your donation will support ongoing development and give you access to the PDF version of the book. Desktop Survival Guides include Data Science, GNU/Linux, and MLHub. Books available on Amazon include Data Mining with Rattle and Essentials of Data Science. Popular open source software includes rattle, wajig, and mlhub. Hosted by Togaware, a pioneer of free and open source software since 1984.
Copyright © 1995-2021 Graham.Williams@togaware.com Creative Commons Attribution-ShareAlike 4.0.