Data Science Desktop Survival Guide
by Graham Williams
Save the Dataset
For large datasets we may want to save it to a binary RData file once we have wrangled it into the right shape and collected the metadata. Loading a binary dataset is generally quicker than loading a CSV file—a CSV file with 2 million observations and 800 variables can take 30 minutes to utils::read.csv(), 5 minutes to base::save(), and 30 seconds to base::load().
# Timestamp for the dataset.
dsdate <- "_" %s+% format(Sys.Date(), "%y%m%d") %T>% print()
# Filename for the saved dataset
dsrdata <- dsname %s+% dsdate %s+% ".RData" %T>% print()
# Save relevant R objects to binary RData file.
save(ds, dsname, dspath, dsdate, nobs,
vars, target, risk, id, ignore, omit,
inputi, inputs, numi, numc, cati, catc,
Notice that in addition to the dataset (ds) we also store the collection of metadata. This begins with items such as the name of the dataset, the source file path, the date we obtained the dataset, the number of observations, the variables of interest, the target variable, the name of the risk variable (if any), the identifiers, the variables to ignore and observations to omit. We continue with the indicies of the input variables and their names, the indicies of the numeric variables and their names, and the indicies of the categoric variables and their names.
load(dsrdata) %>% print()
We place the call to base::load() within a call to ( (i.e., we have surrounded the call with round brackets) to ensure the result of the function call is printed. A call to base::load() returns its result invisibly since we are primarily interested in its side-effect. The side-effect is to read to R binary data from disk and to make it available within our current R session.