10.45 Ignore Excessive Level Variables

20180723 Another issue we traditionally come across in our datasets are those factors with very many levels. This is more common when we read data as factors rather than as character, and so this step depends on where the data has come from. Nonetheless We might want to check for and ignore such variables.

# Identify a threshold above which we have too many levels.

levels.threshold <- 20

# Identify variables that have too many levels.

ds[vars] %>%
  sapply(is.factor) %>%
  which() %>%
  names() %>%
  sapply(function(x) ds %>% pull(x) %>% levels() %>% length()) %>%
  '>='(levels.threshold) %>%
  which() %>%
  names() %T>%
  print() ->
too.many
## character(0)
# Add them to the variables to be ignored for modelling.

ignore <- union(ignore, too.many) %T>% print()
## [1] "date"     "location" "risk_mm"


Your donation will support ongoing development and give you access to the PDF version of the book. Desktop Survival Guides include Data Science, GNU/Linux, and MLHub. Books available on Amazon include Data Mining with Rattle and Essentials of Data Science. Popular open source software includes rattle, wajig, and mlhub. Hosted by Togaware, a pioneer of free and open source software since 1984.
Copyright © 1995-2021 Graham.Williams@togaware.com Creative Commons Attribution-ShareAlike 4.0.