Normalise Variables

		Data Science Desktop Survival Guide by Graham Williams

CLICK HERE TO VISIT THE UPDATED SURVIVAL GUIDE

Normalise Variables

20200912 To rename variables in a dataset we can use dplyr::rename_with() which can apply a function, like rattle::normVarNames(), to the variable names and replace those names with the result from the function. A tidy alternative is to use janitor::clean_names() with the option numerals="right" to replicate rattle::normVarNames().

The choice of variable naming style is suggested in Chapter 23. all variable names are lowercase with words separated by the underscore. This normalisation is useful when different upper/lower case conventions are intermixed inconsistently in names like Incm_tax_PyBl. Remembering how to capitalize when interactively exploring the data with thousands of such variables can be quite a cognitive load. Yet we often see such variable names arising in practise especially when we import data from databases which are often case insensitive.

The example below shows the transformation into the preferred normalised form.

# Normalise variable names.

library(janitor) # Cleanup: clean_names().

names(ds)

##  [1] "Date"          "Location"      "MinTemp"       "MaxTemp"      
##  [5] "Rainfall"      "Evaporation"   "Sunshine"      "WindGustDir"  
##  [9] "WindGustSpeed" "WindDir9am"    "WindDir3pm"    "WindSpeed9am" 
## [13] "WindSpeed3pm"  "Humidity9am"   "Humidity3pm"   "Pressure9am"  
## [17] "Pressure3pm"   "Cloud9am"      "Cloud3pm"      "Temp9am"      
## [21] "Temp3pm"       "RainToday"     "RISK_MM"       "RainTomorrow"

ds %<>%
clean_names(numerals="right")

names(ds)

##  [1] "date"            "location"        "min_temp"        "max_temp"   ...
##  [5] "rainfall"        "evaporation"     "sunshine"        "wind_gust_di...
##  [9] "wind_gust_speed" "wind_dir_9am"    "wind_dir_3pm"    "wind_speed_9...
## [13] "wind_speed_3pm"  "humidity_9am"    "humidity_3pm"    "pressure_9am...
## [17] "pressure_3pm"    "cloud_9am"       "cloud_3pm"       "temp_9am"   ...
## [21] "temp_3pm"        "rain_today"      "risk_mm"         "rain_tomorrow"

Notice the use of the assignment pipe here as introduced in Chapter 3 . We will recall that the magrittr::https://www.rdocumentation.org/packages/magrittr/topics/to the function on the right-hand side and then returns the result to the left-hand side overwriting the original contents of the memory referred to on the left-hand side.

Support further development by purchasing the PDF version of the book.
Other online resources include the GNU/Linux Desktop Survival Guide.
Books available on Amazon include Data Mining with Rattle and Essentials of Data Science.
Popular open source software includes rattle and wajig.
Hosted by Togaware, a pioneer of free and open source software since 1984.
Copyright © 2000-2020 Togaware Pty Ltd. . Creative Commons ShareAlike V4.