This page provides an introduction to evolution metrics for machine learning.

**1. Commonly used metrics**

**(1) (Overall) Accuracy**

The (overall) accuracy is computed by the ratio between the number of the correctly classified test samples and the total test samples.

**(2) Average/mean accuracy**

Mean Accuracy of Class *N* is computed by the ratio between the number of the correctly classified test samples that are labeled as N and the total test samples that are labeled as N.

An example confusion matrix to calculate Class Mean Accuracy and Overall Accuracy:

Mean Accuracy of Class N: *1971/ (1971 + 19 + 1 + 8 + 0 + 1) = 98.55%*

Overall accuracy: *(1971 + 1940 + 1891 + 1786 + 1958 + 1926) / (2000 + 2000 + 2000 + 2000 + 2000 + 2000) = 95.60%*

**(3) Precision**

**(4) Recall**

**(5) F1 score**

**F1 score** (also called **F-score** or **F-measure**) is a measure of a test’s accuracy. It considers both the precision *p* and the recall *r* of the test to compute the score: *p* is the number of correct positive results divided by the number of all positive results returned by the classifier, and *r* is the number of correct positive results divided by the number of all relevant samples (all samples that should have been identified as positive).

The F 1 score is the harmonic average of the precision and recall, where an F_{1} score reaches its best value at 1 (perfect precision and recall) and worst at 0.

**(6) Confusion matrix**

**2. Other metrics**

**(1) Cohen’s kappa**

Cohen’s kappa is a statistic that measures inter-rater agreement for qualitative (categorical) items. It is a score that expresses the level of agreement between two annotators on a classification problem.

It is generally thought to be a more robust measure than simple percent agreement calculation, as *κ* takes into account the possibility of the agreement occurring by chance. There is controversy surrounding Cohen’s kappa due to the difficulty in interpreting indices of agreement.

**(2) Matthews correlation coefficient**

The **Matthews correlation coefficient** is used in machine learning as **a measure of the quality of binary (two-class) ****classifications**, introduced by biochemist Brian W. Matthews in 1975.^{[1]} It **takes into account true and false positives and negatives and is generally regarded as a balanced measure** which can be used even if the classes are of very different sizes.^{[2]} The MCC is in essence a correlation coefficient between the observed and predicted binary classifications;** it returns a value between −1 and +1. A coefficient of +1 represents a perfect prediction, 0 no better than random prediction and −1 indicates total disagreement between prediction and observation.**

Binary and multi-class labels are supported. **Only in the binary case does this relate to information about true and false positives and negatives.**

**(3) ROC curve**

**References and further reading list:**

https://en.wikipedia.org/wiki/Cohen%27s_kappa

https://scikit-learn.org/stable/modules/generated/sklearn.metrics.cohen_kappa_score.html

https://en.wikipedia.org/wiki/Matthews_correlation_coefficient

https://scikit-learn.org/stable/modules/generated/sklearn.metrics.matthews_corrcoef.html

https://en.wikipedia.org/wiki/Receiver_operating_characteristic

https://en.wikipedia.org/wiki/Precision_and_recall

https://scikit-learn.org/stable/auto_examples/model_selection/plot_precision_recall.html

https://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_score.html

https://scikit-learn.org/stable/modules/generated/sklearn.metrics.recall_score.html

https://en.wikipedia.org/wiki/F1_score

https://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html

https://en.wikipedia.org/wiki/Confusion_matrix

https://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html