Go to TogaWare.com Home Page. Data Science Desktop Survival Guide
by Graham Williams
Duck Duck Go


Model Building

20200607 We now build, fit, or train a model. R has most machine learning algorithms available. We will begin with a simple favourite—the decision tree algorithm— using rpart::rpart(). We record this information using the generic variables mdesc (human readable description of the model type) and mtype (type of the model).

mtype <- "rpart"
mdesc <- "decision tree"

The model will be built using tidyselect::all_of() the dplyr::select()'ed variables from the training dplyr::slice() of the dataset. The training slice is identified as the row numbers stored as tr and the column names stored as vars. This training dataset is piped on to rpart::rpart() together with a specification of the model to be built as stored in form. Using generic variables allows us to change the formula, the dataset, the observations and the variables used in building the model yet retain the same programming code. The resulting model is saved into the variable model.

ds %>%
  select(all_of(vars)) %>%
  slice(tr) %>%
  rpart(form, .) ->

To view the model simply reference the generic variable model on the command line. This asks R to base::print() the model.

## n= 123722 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
##  1) root 123722 25921 No (0.7904900 0.2095100)  
##    2) humidity_3pm< 71.5 104352 14394 No (0.8620630 0.1379370) *
##    3) humidity_3pm>=71.5 19370  7843 Yes (0.4049045 0.5950955)  
##      6) humidity_3pm< 83.5 11433  5331 No (0.5337182 0.4662818)  
##       12) wind_gust_speed< 42 6885  2533 No (0.6320988 0.3679012) *
##       13) wind_gust_speed>=42 4548  1750 Yes (0.3847845 0.6152155) *
##      7) humidity_3pm>=83.5 7937  1741 Yes (0.2193524 0.7806476) *

This textual version of the model provides the basic structure of the tree. We present the details in Chapter 18. Different model builders will base::print() different information.

This is our first predictive model. Be sure to spend some time to understand and reflect on the knowledge that the model is exposing.

Support further development by purchasing the PDF version of the book.
Other online resources include the GNU/Linux Desktop Survival Guide.
Books available on Amazon include Data Mining with Rattle and Essentials of Data Science.
Popular open source software includes rattle and wajig.
Hosted by Togaware, a pioneer of free and open source software since 1984.
Copyright © 2000-2020 Togaware Pty Ltd. . Creative Commons ShareAlike V4.