->
Classification accuracy is an example of a single-number evaluation metric: You run your classifier on the dev set (or test set), and get back a single number about what fraction of examples it classified correctly. According to this metric, if classifier A obtains 97% accuracy, and classifier B obtains 90% accuracy, then we judge classifier A to be superior.
->
In contrast, Precision and Recall[3] is not a single-number evaluation metric: It gives two numbers for assessing your classifier. Having multiple-number evaluation metrics makes it harder to compare algorithms. Suppose your algorithms perform as follows:
->
Classifier | Percision | Recall |
---|---|---|
A | 95% | 90% |
B | 98% | 85% |
Here, neither classifier is obviously superior, so it doesn’t immediately guide you toward picking one.
->
Classifier | Percision | Recall | F1 score |
---|---|---|---|
A | 95% | 90% | 92.4% |
During development, your team will try a lot of ideas about algorithm architecture, model parameters, choice of features, etc. Having a single-number evaluation metric such as accuracy allows you to sort all your models according to their performance on this metric, and quickly decide what is working best.
->
If you really care about both Precision and Recall, I recommend using one of the standard ways to combine them into a single number. For example, one could take the average of precision and recall, to end up with a single number. Alternatively, you can compute the “F1 score,” which is a modified way of computing their average, and works better than simply taking the mean.[4]
->
Classifier | Percision | Recall | F1 score |
---|---|---|---|
A | 95% | 90% | 92.4% |
B | 98% | 85% | 91.0% |
Having a single-number evaluation metric speeds up your ability to make a decision when you are selecting among a large number of classifiers. It gives a clear preference ranking among all of them, and therefore a clear direction for progress.
->
As a final example, suppose you are separately tracking the accuracy of your cat classifier in four key markets: (i) US, (ii) China, (iii) India, and (iv) Other. This gives four metrics. By taking an average or weighted average of these four numbers, you end up with a single number metric. Taking an average or weighted average is one of the most common ways to combine multiple metrics into one.
->
FOOTNOTE:
[3] The Precision of a cat classifier is the fraction of images in the dev (or test) set it labeled as cats that really are cats. Its Recall is the percentage of all cat images in the dev (or test) set that it correctly labeled as a cat. There is often a tradeoff between having high precision and high recall.
->
[4] If you want to learn more about the F1 score, see https://en.wikipedia.org/wiki/F1_score. It is the “harmonic mean” between Precision and Recall, and is calculated as 2/((1/Precision)+(1/Recall)).
->