Data Science Desktop Survival Guide
by Graham Williams |
|||||
Ignore IDs and Outputs |
20180723 The identifiers and any risk variable (which is an output variable) should be ignored in any predictive modelling. Always watch out for treating output variables as inputs to modelling—this is a surprisingly common trap for beginners. We will build a vector of the names of the variables to ignore. Above we have already recorded the id variables and (optionally) the risk. Here we join them together into a new vector using lubridate::union() which performs a set union operation—that is, it joins the two arguments together and removes any repeated variables.
# Initialise ignored variables: identifiers and risk.
ignore <- union(id, risk) %T>% print()
We might also check for any variable that has a unique value for every observation. These are often identifiers and if so they are candidates for ignoring. We select the vars from the dataset and pipe through to base::sapply() for any variables having only unique values. In our case there are no further candidate identifiers. as indicated by the empty result, character(0).
|
# Heuristic for candidate indentifiers to possibly ignore.
ds[vars] %>% sapply(function(x) x %>% unique() %>% length()) %>% equals(nrow(ds)) %>% which() %>% names() %T>% print() -> ids
# Add them to the variables to be ignored for modelling.
ignore <- union(ignore, ids) %T>% print()
|