Data Science Desktop Survival Guide
by Graham Williams
Raw In much of our modelling we will be randomly sampling datasets. In sampling datasets a random number sequence will be used. Such a sequence can be repeatable by initialising with a “randomly” selected seed. We do this so that we can replicate the examples presented throughout this book. We will shortly identify a random training dataset as a subset of the whole dataset. To ensure the same random subset is selected each time we initiate the random number generator with a specific seed using base::set.seed(). For no particular reason we choose a seed.
seed <- 42
It is worth noting that many model builders use heuristics to search for a good model. The general approach is to search for a good model rather than the best model. This is often necessary because the computational requirements to find the best model will generally be prohibitive and can be as much as years of computer time. Searching for the best model involves searching through an enormous search space of all of the possible models. Our algorithms will reduce the computational requirements to something feasible using heuristics. Such heuristics often involve some level of random decision making in deciding which paths to follow in any search. By setting the seed for the random number generator to a known initial value will ensure we can replicate the model building later.