10.70 Save the Dataset

For large datasets we may want to save it to a binary RData file once we have wrangled it into the right shape and collected the metadata. Loading a binary dataset is generally quicker than loading a CSV file—a CSV file with 2 million observations and 800 variables can take 30 minutes to utils::read.csv(), 5 minutes to base::save(), and 30 seconds to base::load().

# Timestamp for the dataset.

dsdate  <- "_" %s+% format(Sys.Date(), "%y%m%d") %T>% print()

## [1] "_260522"

# Filename for the saved dataset

dsrdata <- dsname %s+% dsdate %s+% ".RData" %T>% print()

## [1] "weatherAUS_260522.RData"

# Save relevant R objects to binary RData file.

save(ds, dsname, dspath, dsdate, nobs,
     vars, target, risk, id, ignore, omit,
     inputi, inputs, numi, numc, cati, catc,
     file=dsrdata)

Notice that in addition to the dataset (ds) we also store the collection of metadata. This begins with items such as the name of the dataset, the source file path, the date we obtained the dataset, the number of observations, the variables of interest, the target variable, the name of the risk variable (if any), the identifiers, the variables to ignore and observations to omit. We continue with the indicies of the input variables and their names, the indicies of the numeric variables and their names, and the indicies of the categoric variables and their names.

Each time we wish to use the dataset we can now simply base::load() it into R. The value that is invisibly returned by base::load() is a vector naming the R objects loaded from the binary RData file.

load(dsrdata) %>% print()

##  [1] "ds"     "dsname" "dspath" "dsdate" "nobs"   "vars"   "target" "risk"  
##  [9] "id"     "ignore" "omit"   "inputi" "inputs" "numi"   "numc"   "cati"  
## [17] "catc"

We place the call to base::load() within a call to [(](https://www.rdocumentation.org/packages/base/topics/(){ target=“_blank” } (i.e., we have surrounded the call with round brackets) to ensure the result of the function call is printed. A call to base::load() returns its result invisibly since we are primarily interested in its side-effect. The side-effect is to read to R binary data from disk and to make it available within our current R session.

Your donation will support ongoing availability and give you access to the PDF version of this book. Desktop Survival Guides include Data Science, GNU/Linux, and MLHub. Books available on Amazon include Data Mining with Rattle and Essentials of Data Science. Popular open source software includes rattle, wajig, and mlhub. Hosted by Togaware, a pioneer of free and open source software since 1984. Copyright © 1995-2022 Graham.Williams@togaware.com Creative Commons Attribution-ShareAlike 4.0