Go to TogaWare.com Home Page. Data Science Desktop Survival Guide
by Graham Williams
Duck Duck Go


Correlated Numeric Variables

20200814 It is often useful to identify highly correlated variables. Such variables will often record the same information but in different ways and often arise when we combine data from different sources.

The correlation is calculated by dplyr::select()ing the numeric columns from the dataset and passing that through to stats::cor(). This matrix of pairwise correlations is based on only the complete observations so that observations with missing values are ignored.

We set the upper triangle of the correlation matrix to NA's as they are a mirror of the values in the lower triangle and thus redundant. We also set diag=TRUE to set the diagonals as NA since they will always be perfect correlations.

The processing continues by making all values positive using base::abs(). With conversion to base::data.frame() then to dplyr::as_tibble() the dataset column names need to be reset appropriately using magrittr::set_colnames(). We dplyr::mutate() the dataset with a new column using dplyr::mutate(), reshape the dataset using tidyr::gather() from tidyrand then omit missing correlations using data.table::na.omit(). Finally the rows are dplyr::arrange()'d with the highest absolute correlations appearing first.

# For the numeric variables generate a table of correlations

ds %>%
  select(all_of(numc)) %>%
  cor(use="complete.obs") %>%
  ifelse(upper.tri(., diag=TRUE), NA, .) %>%
  abs() %>%
  data.frame() %>%
  as_tibble() %>%
  set_colnames(numc) %>%
  mutate(var1=numc) %>%
  gather(var2, cor, -var1) %>%
  na.omit() %>%
  arrange(-abs(cor)) %T>%
  print() ->
## # A tibble: 120 x 3
##    var1           var2              cor
##    <chr>          <chr>           <dbl>
##  1 temp_3pm       max_temp        0.984
##  2 pressure_3pm   pressure_9am    0.962
##  3 temp_9am       min_temp        0.908

That could do with some work!

TODO Explore corrr:

ds %>%
  correlate() %>%
  shave() %>%

ds %>%
  correlate() %>%
  rearrange() %>%

ds %>%
  corrr::correlate() %>%
  shave() %>%
  stretch() %>%
  filter(abs(r) > 0.90)

ds %>%
  correlate() %>%

ds %>%
  correlate() %>%
  network_plot()    # Fail

Support further development by purchasing the PDF version of the book.
Other online resources include the GNU/Linux Desktop Survival Guide.
Books available on Amazon include Data Mining with Rattle and Essentials of Data Science.
Popular open source software includes rattle and wajig.
Hosted by Togaware, a pioneer of free and open source software since 1984.
Copyright © 2000-2020 Togaware Pty Ltd. . Creative Commons ShareAlike V4.