Accuracy and Error Rate

		Data Science Desktop Survival Guide by Graham Williams

CLICK HERE TO VISIT THE UPDATED SURVIVAL GUIDE

Accuracy and Error Rate

From the two vectors cl_te and target_te we can calculate the overall accuracy of the predictions over the te dataset. This will simply be the sum of the number of times the prediction agrees with the actual class, divided by the size of the test dataset (which is the same as the length of target_te).

acc_te <- sum(predict_te == actual_te, na.rm=TRUE)/length(actual_te)
round(100*acc_te, 2)

## [1] 83.65

Here we can see that the model has an overall accuracy of 83.65%. That is a relatively high accuracy for a typical model build.

We can also calculate the overall error rate in a similar fashion. Some Data Scientists prefer to talk in terms of the error rate rather than the accuracy:

err_te <- sum(predict_te != actual_te, na.rm=TRUE)/length(actual_te)
round(100*err_te, 2)

## [1] 16.35

Thus our decision tree model has an overall error rate of 16.35%.

Notice also that we have now twice converted a proportion (generally a number between 0 and 1) into a percentage (generally a number between 0 and 100) by multiplying the proportion by 100 and then base::round()ing it to 2 decimal places. We will no doubt want to do this regularly (if we find percentages to be more quickly accessible than proportions). This is thus a candidate for packaging up as a function. To do so we use base::function() and provide it with a single argument—the number we wish to convert to a percentage:

per <- function(n) { p <- round(100*n, 2); return(p) }

We can now use this as a convenience:

per(acc_te)

## [1] 83.65

per(err_te)

## [1] 16.35

To illustrate the more optimistic measure that we obtain when we apply our model to the training dataset we can repeat the above calculations:

acc_tr <- sum(predict_tr == actual_tr, na.rm=TRUE)/length(actual_tr)
per(acc_tr)

## [1] 83.5

err_tr <- sum(predict_tr != actual_tr, na.rm=TRUE)/length(actual_tr)
per(err_tr)

## [1] 16.5

The overall accuracy over the training dataset is 83.5% compared to the 83.65% accuracy calculated over the te dataset. The difference for this small dataset is small but we do see that the accuracy is higher on the training dataset. Similarly the overall error rate is 16.5% on the training dataset compared to the te error rate of 16.35%.

Support further development by purchasing the PDF version of the book.
Other online resources include the GNU/Linux Desktop Survival Guide.
Books available on Amazon include Data Mining with Rattle and Essentials of Data Science.
Popular open source software includes rattle and wajig.
Hosted by Togaware, a pioneer of free and open source software since 1984.
Copyright © 2000-2020 Togaware Pty Ltd. . Creative Commons ShareAlike V4.