Go to TogaWare.com Home Page. Data Science Desktop Survival Guide
by Graham Williams
Duck Duck Go



CLICK HERE TO VISIT THE UPDATED SURVIVAL GUIDE

Ignore IDs and Outputs

20180723 The identifiers and any risk variable (which is an output variable) should be ignored in any predictive modelling. Always watch out for treating output variables as inputs to modelling—this is a surprisingly common trap for beginners. We will build a vector of the names of the variables to ignore. Above we have already recorded the id variables and (optionally) the risk. Here we join them together into a new vector using lubridate::union() which performs a set union operation—that is, it joins the two arguments together and removes any repeated variables.

# Initialise ignored variables: identifiers and risk.

ignore <- union(id, risk) %T>% print()
## [1] "date"     "location" "risk_mm"

We might also check for any variable that has a unique value for every observation. These are often identifiers and if so they are candidates for ignoring. We select the vars from the dataset and pipe through to base::sapply() for any variables having only unique values. In our case there are no further candidate identifiers. as indicated by the empty result, character(0).

# Heuristic for candidate indentifiers to possibly ignore.

ds[vars] %>%
  sapply(function(x) x %>% unique() %>% length()) %>%
  equals(nrow(ds)) %>%
  which() %>%
  names() %T>%
  print() ->
ids
## character(0)

# Add them to the variables to be ignored for modelling.

ignore <- union(ignore, ids) %T>% print()
## [1] "date"     "location" "risk_mm"


Support further development by purchasing the PDF version of the book.
Other online resources include the GNU/Linux Desktop Survival Guide.
Books available on Amazon include Data Mining with Rattle and Essentials of Data Science.
Popular open source software includes rattle and wajig.
Hosted by Togaware, a pioneer of free and open source software since 1984.
Copyright © 2000-2020 Togaware Pty Ltd. . Creative Commons ShareAlike V4.