Data Science Desktop Survival Guide
by Graham Williams
The accuracy and error rate are rather blunt measures of performance. They are a good starting point to get a sense of how good the model is but more is required. A confusion matrix allows us to review how well the model performs against the actual classes.
round(100*table(actual_te, predict_te, dnn=c("Actual", "Predicted"))
This is useful since the consequences of a wrong decision will be different for the different decisions. For example, if it is incorrectly predicted that it will not rain tomorrow and I decide not to carry an umbrella with me then the consequence is that I will get wet. We might experience this as a more severe consequence than the situation where it is incorrectly predicted that it will rain and so we unnecessarily carry an umbrella with us all day.
We again compare this to the performance on the training dataset to note that model performs better, at least when predicting that it will not rain tomorrow.
round(100*table(actual_tr, predict_tr, dnn=c("Actual", "Predicted"))
Notice that the false negative rate (the errors that have a higher consequence—getting uncomfortably wet if it does actually rain) is reduced from 13% to 14%. The performance as measured over the te dataset is likely to be more indicative of the actual model performance and the false negative rate is one that we would rather minimize.
We will be generating confusion matrices quite regularly based on the predicted classes (counting just those that are not missing using base::is.na()) and the target classes. This is another candidate for wrapping up into a function to save us having to remember the command in detail and also to save us having to type out the command each time.
con <- function(predicted, actual)
tbl <- table(actual, predicted, dnn=c("Actual", "Predicted"))
tbl <- round(100*tbl/sum(!is.na(predicted)))
We can simply call this function with the appropriate arguments to have the confusion matrix printed.