Go to TogaWare.com Home Page. Data Science Desktop Survival Guide
by Graham Williams
Duck Duck Go



CLICK HERE TO VISIT THE UPDATED SURVIVAL GUIDE

Confusion Matrix

The accuracy and error rate are rather blunt measures of performance. They are a good starting point to get a sense of how good the model is but more is required. A confusion matrix allows us to review how well the model performs against the actual classes.

round(100*table(actual_te, predict_te, dnn=c("Actual", "Predicted"))
      /length(actual_te))
##       Predicted
## Actual No Yes
##    No  76   3
##    Yes 14   7

This is useful since the consequences of a wrong decision will be different for the different decisions. For example, if it is incorrectly predicted that it will not rain tomorrow and I decide not to carry an umbrella with me then the consequence is that I will get wet. We might experience this as a more severe consequence than the situation where it is incorrectly predicted that it will rain and so we unnecessarily carry an umbrella with us all day.

We again compare this to the performance on the training dataset to note that model performs better, at least when predicting that it will not rain tomorrow.

round(100*table(actual_tr, predict_tr, dnn=c("Actual", "Predicted"))
      /length(actual_tr))

Notice that the false negative rate (the errors that have a higher consequence—getting uncomfortably wet if it does actually rain) is reduced from 14% to 14%. The performance as measured over the te dataset is likely to be more indicative of the actual model performance and the false negative rate is one that we would rather minimize.

We will be generating confusion matrices quite regularly based on the predicted classes (counting just those that are not missing using base::is.na()) and the target classes. This is another candidate for wrapping up into a function to save us having to remember the command in detail and also to save us having to type out the command each time.

con <- function(predicted, actual)
{
  tbl <- table(actual, predicted, dnn=c("Actual", "Predicted"))
  tbl <- round(100*tbl/sum(!is.na(predicted)))
  return(tbl)
}

We can simply call this function with the appropriate arguments to have the confusion matrix printed.

con(predict_tr, actual_tr)
##       Predicted
## Actual No Yes
##    No  76   3
##    Yes 14   7


Support further development by purchasing the PDF version of the book.
Other online resources include the GNU/Linux Desktop Survival Guide.
Books available on Amazon include Data Mining with Rattle and Essentials of Data Science.
Popular open source software includes rattle and wajig.
Hosted by Togaware, a pioneer of free and open source software since 1984.
Copyright © 2000-2020 Togaware Pty Ltd. . Creative Commons ShareAlike V4.