Data Science Desktop Survival Guide
by Graham Williams
Introducing Template Variables
20180721 A reference to the original dataset can be created using a template (or generic) variable. The new variable will be called ds (short for dataset).
# Take a copy of the dataset into a generic variable.
ds <- weatherAUS
Both ds and weatherAUS will now reference the same dataset within the computer's memory. As we modify ds those modifications will only affect the data referenced by ds. Effectively, an extra copy of the dataset in the computer's memory will start to grow as we change the data from its original form. R avoids making copies of datasets unnecessarily and so a simple assignment does not create a new copy. As modifications are made to one or the other copy of a dataset then extra memory will be used to store the columns that differ between the datasets.
From here on we no longer refer to the dataset as weather but as ds. This allows the following analyses and processing to be rather generic—turning the R code into a template and so requiring only minor modification when used with a different dataset assigned into ds.
Often we will find that we can simply load a different dataset into memory, store it as ds and the remaining steps of our analyses and processing will essentially work unchanged.
# Prepare for a templated analysis and processing.
dsname <- "weatherAUS"
ds <- get(dsname)
ds %<>% clean_names(numerals="right")
We are a little tricky here in recording the dataset name in the variable dsname and then using the function base::get() to make a copy of the dataset reference and link it to the generic variable ds. We could simply assign the data to ds directly as we saw above. Either way the generic variable ds refers to the same dataset. The use of base::get() allows us to be a little more generic in our template.
The use of generic variables within a template for the tasks we perform on each new dataset will have obvious advantages but we need to be careful. A disadvantage is that we may be working with several datasets and accidentally overwrite previously processed datasets referenced using the same generic variable (ds). The processing of the dataset might take some time and so accidentally losing it is not an attractive proposition. Care needs to be taken to avoid this.