 Data Science Desktop Survival Guide by Graham Williams Desktop Survival Project Home Preface Data Science Introducing R R Constructs R Tasks R Strings R Read, Write, and Create Data Template Data Exploration Data Wrangling Data Visualisation Statistics ML Template ML Scenarios ML Activities ML Applications ML Algorithms Cluster Analysis Decision Trees Computer Vision Graph Data Privacy Literate Data Science Coding with Style Resources Bibliography Index

## Correlated Numeric Variables

20200814 It is often useful to identify highly correlated variables. Such variables will often record the same information but in different ways and often arise when we combine data from different sources.

The correlation is calculated by dplyr::select()ing the numeric columns from the dataset and passing that through to stats::cor(). This matrix of pairwise correlations is based on only the complete observations so that observations with missing values are ignored.

We set the upper triangle of the correlation matrix to NA's as they are a mirror of the values in the lower triangle and thus redundant. We also set diag=TRUE to set the diagonals as NA since they will always be perfect correlations.

The processing continues by making all values positive using base::abs(). With conversion to base::data.frame() then to dplyr::as_tibble() the dataset column names need to be reset appropriately using magrittr::set_colnames(). We dplyr::mutate() the dataset with a new column using dplyr::mutate(), reshape the dataset using tidyr::gather() from tidyrand then omit missing correlations using data.table::na.omit(). Finally the rows are dplyr::arrange()'d with the highest absolute correlations appearing first.

# For the numeric variables generate a table of correlations

ds %>%
select(all_of(numc)) %>%
cor(use="complete.obs") %>%
ifelse(upper.tri(., diag=TRUE), NA, .) %>%
abs() %>%
data.frame() %>%
as_tibble() %>%
set_colnames(numc) %>%
mutate(var1=numc) %>%
gather(var2, cor, -var1) %>%
na.omit() %>%
arrange(-abs(cor)) %T>%
print() ->
mc
 ```## # A tibble: 120 x 3 ## var1 var2 cor ## ## 1 temp_3pm max_temp 0.984 ## 2 pressure_3pm pressure_9am 0.962 ## 3 temp_9am min_temp 0.908 .... ```

That could do with some work!

TODO Explore corrr:

 ds %>%   correlate() %>%   shave() %>%   fashion() ds %>%   correlate() %>%   rearrange() %>%   rplot() ds %>%   corrr::correlate() %>%   shave() %>%   stretch() %>%   filter(abs(r) > 0.90) ds %>%   correlate() %>%   focus() ds %>%   correlate() %>%   network_plot()    # Fail