10.58 Ignore IDs and Outputs
20180723 The identifiers and any risk variable (which is an
output variable) should be ignored in any predictive modelling. Always
watch out for treating output variables as inputs to modelling—this
is a surprisingly common trap for beginners. We will build a vector of
the names of the variables to ignore. Above we have already recorded
the id
variables and (optionally) the risk
. Here
we join them together into a new vector using dplyr::union()
which performs a set union operation—that is, it joins the two
arguments together and removes any repeated variables.
## [1] "date" "location" "risk_mm"
We might also check for any variable that has a unique value for every
observation. These are often identifiers and if so they are candidates
for ignoring. We select the vars
from the dataset and pipe
through to base::sapply() for any variables having only
unique values. In our case there are no further candidate identifiers.
as indicated by the empty result, character(0)
.
# Heuristic for candidate indentifiers to possibly ignore.
ds[vars] %>%
sapply(function(x) x %>% unique() %>% length()) %>%
equals(nrow(ds)) %>%
which() %>%
names() %T>%
print() ->
ids
## character(0)
## [1] "date" "location" "risk_mm"
Your donation will support ongoing availability and give you access to the PDF version of this book. Desktop Survival Guides include Data Science, GNU/Linux, and MLHub. Books available on Amazon include Data Mining with Rattle and Essentials of Data Science. Popular open source software includes rattle, wajig, and mlhub. Hosted by Togaware, a pioneer of free and open source software since 1984. Copyright © 1995-2022 Graham.Williams@togaware.com Creative Commons Attribution-ShareAlike 4.0