Evaluation metrics for machine learning

This page provides an introduction to evolution metrics for machine learning.

1. Commonly used metrics

(1) (Overall) Accuracy

The (overall) accuracy is computed by the ratio between the number of the correctly classified test samples and the total test samples.

(2) Average/mean accuracy

Mean Accuracy of Class N is computed by the ratio between the number of the correctly classified test samples that are labeled as N and the total test samples that are labeled as N.

An example confusion matrix to calculate Class Mean Accuracy and Overall Accuracy:

Mean Accuracy of Class N: 1971/ (1971 + 19 + 1 + 8 + 0 + 1) = 98.55%

Overall accuracy: (1971 + 1940 + 1891 + 1786 + 1958 + 1926) / (2000 + 2000 + 2000 + 2000 + 2000 + 2000) = 95.60%

(3) Precision

(4) Recall

(5) F1 score

 F1 score (also called F-score or F-measure) is a measure of a test’s accuracy. It considers both the precision p and the recall r of the test to compute the score: p is the number of correct positive results divided by the number of all positive results returned by the classifier, and r is the number of correct positive results divided by the number of all relevant samples (all samples that should have been identified as positive).

The F 1 score is the harmonic average of the precision and recall, where an F1 score reaches its best value at 1 (perfect precision and recall) and worst at 0.

(6) Confusion matrix

2. Other metrics

(1) Cohen’s kappa

Cohen’s kappa is a statistic that measures inter-rater agreement for qualitative (categorical) items. It is a score that expresses the level of agreement between two annotators on a classification problem. 

It is generally thought to be a more robust measure than simple percent agreement calculation, as κ takes into account the possibility of the agreement occurring by chance. There is controversy surrounding Cohen’s kappa due to the difficulty in interpreting indices of agreement. 

(2) Matthews correlation coefficient

The Matthews correlation coefficient is used in machine learning as a measure of the quality of binary (two-class) classifications, introduced by biochemist Brian W. Matthews in 1975.[1] It takes into account true and false positives and negatives and is generally regarded as a balanced measure which can be used even if the classes are of very different sizes.[2] The MCC is in essence a correlation coefficient between the observed and predicted binary classifications; it returns a value between −1 and +1.  A coefficient of +1 represents a perfect prediction, 0 no better than random prediction and −1 indicates total disagreement between prediction and observation.

Binary and multi-class labels are supported. Only in the binary case does this relate to information about true and false positives and negatives.

(3) ROC curve



References and further reading list: