8.12 Train, Tune, and Test Datasets

20200607 The final data wrangling is to partition the dataset into three separate datasets: training, tuning, and test datasets. The training dataset will be used to fit a model to the data. The tuning dataset is used to the tune the parameters for the model building process. Whilst these observations are not directly modelled they do guide the model fitting process. The test dataset is then only used to assess the performance of the finally fit and tuned model. This will provide an unbiased estimate of the performance of the final model on new observations.

To build the datasets we will use a random selection process. We are now ready to partition the dataset into two subsets. The first is a 70% random sample for building the model (the training dataset) and the second is the remainder, used to evaluate the performance of the model (the test dataset).

PTR <- 0.7   # Proportion for training
PTU <- 0.15  # Proportion for tuning
PTE <- 0.15  # Proportion for testing

tr <- sample(nobs, PTR*nobs) %T>%
  {length(.) %>% print()}

## [1] 70

tu <- nobs %>% seq_len() %>% setdiff(tr) %>% sample(PTU*nobs) %T>%
  {length(.) %>% print()}

## [1] 15

te <- nobs %>% seq_len() %>% setdiff(tr) %>% setdiff(tu) %T>%
  {length(.) %>% print()}

## [1] 15

head(tr)

## [1]  49  65  25  74  18 100

Any model building we do will be based on the 70% training dataset. Our model may then be quite good at predicting these observations. The model’s performance on this data on which it was trained will be a very optimistic (or biased) estimate of the true performance of the model on other datasets. We might thus ask how will the model perform when we use it to predict outcomes for other, yet unseen, observations?

The testing dataset is a hold-out dataset in that it has not been used at all for building the model. When we apply the model to this dataset we would expect it to have a lesser performance (e.g., a higher error rate). This is what we will generally observe and we will see this in the following sections.

The overall error rate measured on the training dataset will be shown to be less than the error rate calculated on the test dataset. The error rate (or any performance measure in general) calculated on the test dataset is closer to what we will obtain in general when we begin to use the model. It is an unbiased estimate of the true performance of the model.

We also record the actual target values and the risks. These will be used in the evaluation of the performance of the models.

target.tr <- ds %>% slice(tr) %>% pull(target)
target.tu <- ds %>% slice(tu) %>% pull(target)
target.te <- ds %>% slice(te) %>% pull(target)

risk.tr   <- ds %>% slice(tr) %>% pull(risk)
risk.tu   <- ds %>% slice(tu) %>% pull(risk)
risk.te   <- ds %>% slice(te) %>% pull(risk)

Your donation will support ongoing availability and give you access to the PDF version of this book. Desktop Survival Guides include Data Science, GNU/Linux, and MLHub. Books available on Amazon include Data Mining with Rattle and Essentials of Data Science. Popular open source software includes rattle, wajig, and mlhub. Hosted by Togaware, a pioneer of free and open source software since 1984. Copyright © 1995-2022 Graham.Williams@togaware.com Creative Commons Attribution-ShareAlike 4.0