Go to TogaWare.com Home Page. Data Science Desktop Survival Guide
by Graham Williams
Duck Duck Go



CLICK HERE TO VISIT THE UPDATED SURVIVAL GUIDE

Special Case Variable Name Transformations

20180721 When reviewing the variables of a dataset we often notice other changes that could be made to the variable names. This might be to simplify the variables or to clarify the meaning of the variable. The string processing functions provided by stringr come in handy for such processing.

In the following example we remove the prefix of the variable names where we identify that the prefix consists of all characters up to the first underscore. This is useful where a dataset has prefixed each variable by a sequential number or by some other code and we have no real use of such a prefix in our processing.

names(ds) %<>% str_replace("^[^_]*_", "")

This will take a variable name like ab123_tax_payable and convert it to tax_payable.

str_replace("ab123_tax_payable", "^[^_]*_", "")
## [1] "tax_payable"

The odd looking characters in the argument to stringr::str_replace() are a regular expression. Regular expressions are a very powerful concept and can get quite complex. The reader is referred to the many resources on-line that cover regular expressions. The regular expression is a pattern used to match some part of the variable name. The pattern begins with ^ which anchors the match to the beginning of the variable name. This can be followed by zero or more characters (*) that do not match the underscore ([^_])—the * specifies that the preceding pattern can be repeated zero or more times. The preceding pattern here is actually a list of characters included between square brackets. Since this list begins with ^ the listed characters are excluded from the matching. That is, the pattern preceding the * will match any character that is not an underscore. The third component of the match is then an actual underscore. Combined this regular expression matches any sequence (including an empty sequence) of characters (except for an underscore) that is at the beginning of the variable name and followed by an underscore.

The next argument to stringr::str_replace() is the replacement string. In this case we are replacing the matched pattern with an empty string.

The example here is simply one example of very many possible transformations we become used to in cleaning our datasets. The aim in transforming the variable names is to make then easier to use and to understand, both for ourselves and for others.


Support further development by purchasing the PDF version of the book.
Other online resources include the GNU/Linux Desktop Survival Guide.
Books available on Amazon include Data Mining with Rattle and Essentials of Data Science.
Popular open source software includes rattle and wajig.
Hosted by Togaware, a pioneer of free and open source software since 1984.
Copyright © 2000-2020 Togaware Pty Ltd. . Creative Commons ShareAlike V4.