Data Science Desktop Survival Guide
by Graham Williams |
|||||
Ignore Missing |
20180723 We next remove any variable where all of the values are missing. There are none like this in the weather dataset but in general for other datasets with thousands of variables there may be some. Here we first count the number of missing values for each variable and then list the names of those variables that have no values.
# Identify variables with only missing values.
ds[vars] %>% sapply(function(x) x %>% is.na %>% sum) %>% equals(nrow(ds)) %>% which() %>% names() %T>% print() -> missing
# Add them to the variables to be ignored for modelling.
ignore <- union(ignore, missing) %T>% print()
It is also useful to identify those variables which are very sparse—that have mostly missing values. We can decide on a threshold of the proportion missing above which to ignore the variable as not likely to add much value to our analysis. For example, we may want to ignore variables with more than 70% of the values missing:
|
# Identify a threshold above which proportion missing is fatal.
missing.threshold <- 0.7 # Identify variables that are mostly missing. ds[vars] %>% sapply(function(x) x %>% is.na() %>% sum()) %>% '>'(missing.threshold*nrow(ds)) %>% which() %>% names() %T>% print() -> mostly
# Add them to the variables to be ignored for modelling.
ignore <- union(ignore, mostly) %T>% print()
|