## 27.19 Kuhn Checklist

Max Kuhn, author of developed a checklist and posted it to the R developers mailing list in January 2012. I have paraphrased some of the points here and embellished it a little with my views, but they are mostly in sync with Kuhn’s views.

• Extend the work of others and avoid redundancy. Reuse others functions, with due credit, to add any missing features.
• For a categorical model builder ensure the target is a factor (like Yes/No) rather than integers (like 1/0). The factor levels should be identified in the resulting model object and the stats::predict() function should return predicted classes as factors with the same levels and ordering of levels. Support a to switch between predicted classes and class probabilities. Use for probabilities.
• Implement a separate stats::predict(), using predict.class() where class is the class of the object returned by the model builder. Do not use special functions like modelPredict().
• Provide both a formula interface as in foo(y~{}x, data=ds) and non-formula interface as in foo(x, y) to the function. “Formula methods are really inefficient at this time for large dimensional data but are fantastically convenient. There are some good reasons to not use formulas, such as functions that do not use a design matrix (e.g., party::cforest()) or need factors to be handled in a non-standard way (e.g., Cubist::cubist()).”
• “Don’t require a test set when model building.”
• If not all variables are used in the resulting model, allow the required subset of variables to be provided for stats::predict() and not all the original variables, and avoid referencing variables by position rather than name.

Your donation will support ongoing development and give you access to the PDF version of this book. Desktop Survival Guides include Data Science, GNU/Linux, and MLHub. Books available on Amazon include Data Mining with Rattle and Essentials of Data Science. Popular open source software includes rattle, wajig, and mlhub. Hosted by Togaware, a pioneer of free and open source software since 1984.