Data Science Desktop Survival Guide
by Graham Williams |
|||||
Ignore Excessive Level Variables |
20180723 Another issue we traditionally come across in our datasets are those factors with very many levels. This is more common when we read data as factors rather than as character, and so this step depends on where the data has come from. Nonetheless We might want to check for and ignore such variables.
# Identify a threshold above which we have too many levels.
levels.threshold <- 20 # Identify variables that have too many levels. ds[vars] %>% sapply(is.factor) %>% which() %>% names() %>% sapply(function(x) ds %>% pull(x) %>% levels() %>% length()) %>% '>='(levels.threshold) %>% which() %>% names() %T>% print() -> too.many
# Add them to the variables to be ignored for modelling.
ignore <- union(ignore, too.many) %T>% print()
|