Data Science Desktop Survival Guide
by Graham Williams
Ignore Excessive Level Variables
20180723 Another issue we traditionally come across in our datasets are those factors with very many levels. This is more common when we read data as factors rather than as character, and so this step depends on where the data has come from. Nonetheless We might want to check for and ignore such variables.
# Identify a threshold above which we have too many levels.
levels.threshold <- 20
# Identify variables that have too many levels.
sapply(function(x) ds %>% pull(x) %>% levels() %>% length()) %>%
# Add them to the variables to be ignored for modelling.
ignore <- union(ignore, too.many) %T>% print()