Go to TogaWare.com Home Page. Data Science Desktop Survival Guide
by Graham Williams
Duck Duck Go

Model Building

20200607 We now build a model. This is also variously referred to as fitting and training a model. R has most machine learning algorithms available. We will begin with a simple favourite—the decision tree algorithm— using rpart::rpart(). We record this information using the generic variables mdesc (human readable description of the model type) and mtype (type of the model).

mtype <- "rpart"
mdesc <- "decision tree"

The model is to be built from a subset of the observations (ds). The training subset is identified as the row numbers stored as tr and the column names stored as vars. This training dataset is piped on to rpart::rpart() together with a specification of the model to be build as stored in form. Using generic variables here allows us to change the formula, the dataset, the observations and the variables used in building the model yet retain the same programming codes. The resulting model is saved into the variable model.

ds %>%
  dplyr::select(all_of(vars)) %>%
  slice(tr) %>%
  rpart(form, .) ->

To view the model simply reference the generic variable model on the command line. This asks R to base::print() the model.

## n= 123722 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
##  1) root 123722 26006 No (0.7898029 0.2101971)  
##    2) humidity_3pm< 71.5 104384 14403 No (0.8620191 0.1379809) *
##    3) humidity_3pm>=71.5 19338  7735 Yes (0.3999897 0.6000103)  
##      6) humidity_3pm< 83.5 11352  5369 No (0.5270437 0.4729563)  
##       12) rain_today=No 6560  2416 No (0.6317073 0.3682927) *
##       13) rain_today=Yes 4792  1839 Yes (0.3837646 0.6162354) *
##      7) humidity_3pm>=83.5 7986  1752 Yes (0.2193839 0.7806161) *

This textual version of the model provides the basic structure of the tree. We present the details in Chapter 18. Different model builders will base::print() different information. Here we see a brief overview of the model build.

This then is our first predictive model.

Support further development by purchasing the PDF version of the book.
Other online resources include the GNU/Linux Desktop Survival Guide.
Books available on Amazon include Data Mining with Rattle and Essentials of Data Science.
Popular open source software includes rattle and wajig.
Hosted by Togaware, a pioneer of free and open source software since 1984.
Copyright © 2000-2020 Togaware Pty Ltd. . Creative Commons ShareAlike V4.