Data Science Desktop Survival Guide
by Graham Williams
Special Case Variable Name Transformations
20180721 When reviewing the variables of a dataset we often notice other changes that could be made to the variable names. This might be to simplify the variables or to clarify the meaning of the variable. The string processing functions provided by stringr come in handy for such processing.
In the following example we remove the prefix of the variable names where we identify that the prefix consists of all characters up to the first underscore. This is useful where a dataset has prefixed each variable by a sequential number or by some other code and we have no real use of such a prefix in our processing.
names(ds) %<>% str_replace("^[^_]*_", "")
This will take a variable name like
str_replace("ab123_tax_payable", "^[^_]*_", "")
The odd looking characters in the argument to
stringr::str_replace() are a regular
expression. Regular expressions are a very powerful concept and can
get quite complex. The reader is referred to the many resources
on-line that cover regular expressions. The regular expression is a
pattern used to match some part of the variable name. The pattern
The next argument to stringr::str_replace() is the replacement string. In this case we are replacing the matched pattern with an empty string.
The example here is simply one example of very many possible transformations we become used to in cleaning our datasets. The aim in transforming the variable names is to make then easier to use and to understand, both for ourselves and for others.