Data Science Desktop Survival Guide
by Graham Williams
Correlated Numeric Variables
20200814 It is often useful to identify highly correlated variables. Such variables will often record the same information but in different ways and often arise when we combine data from different sources.
The correlation is calculated by dplyr::select()ing the numeric columns from the dataset and passing that through to stats::cor(). This matrix of pairwise correlations is based on only the complete observations so that observations with missing values are ignored.
We set the upper triangle of the correlation matrix to NA's as they are a mirror of the values in the lower triangle and thus redundant. We also set diag=TRUE to set the diagonals as NA since they will always be perfect correlations.
The processing continues by making all values positive using base::abs(). With conversion to base::data.frame() then to dplyr::as_tibble() the dataset column names need to be reset appropriately using magrittr::set_colnames(). We dplyr::mutate() the dataset with a new column using dplyr::mutate(), reshape the dataset using tidyr::gather() from tidyrand then omit missing correlations using data.table::na.omit(). Finally the rows are dplyr::arrange()'d with the highest absolute correlations appearing first.
# For the numeric variables generate a table of correlations
ifelse(upper.tri(., diag=TRUE), NA, .) %>%
gather(var2, cor, -var1) %>%
That could do with some work!