Classification problems predict discrete values. In this post, we will be analyzing a classification metrics - F1 score, that uses harmonic mean as it's backbone. Before we start reading, please make sure that you understand terms like, False Positive, True Negative... Recall and Precision. If you need a refresher, check out my previous post. And if you do understand, then keep reading.
So Why Use F1-Score?
We don't want to implement and interpret both precision(P) and recall(R) on our classification models. F1 sort of represents precision and recall. One would think, that taking mean of the two classification metrics would solve our problem. Let's go through an example together and then we can re-visit this questions again.
Let's us analyze transaction dataset. We have 100 transactions and out of these we have 5 fraud cases. Our values would be as following: 1. TP = 5 2. FP = 95 3. TN = 0 4. FN = 0
Based on the above numbers, our precision and recall are following:
Now if we take mean of both calculations, we get a model that did 52.5% on both precision and recall. Hopefully, you can see how this is not so good.
Let's us take a different approach and instead of using mean, we can try harmonic mean. Harmonic mean can be defined as the "reciprocal of the arithmetic mean of the reciprocal". Another way to look at harmonic mean would mean when X and Y are on opposite ends, i.e. 5 and 100. Mean gives a value that gives more weight to the bigger value, however in harmonic mean we get a value that gives more weight to the smaller value. Let me show you so math and hopefully that gives a more clear picture.
from scipy import stats
import numpy as np
print('Harmonic mean of: ' + str([37, 35, 40, 35, 29, 51, 31, 33, 34, 30, 29, 33, 37, 36]))
print(stats.hmean([37, 35, 40, 35, 29, 51, 31, 33, 34, 30, 29, 33, 37, 36]))
print()
print('Adding a large number(1000)')
print('Harmonic mean of: ' + str([1000, 37, 35, 40, 35, 29, 51, 31, 33, 34, 30, 29, 33, 37, 36]))
print(stats.hmean([1000, 37, 35, 40, 35, 29, 51, 31, 33, 34, 30, 29, 33, 37, 36]))
print()
print('Regular Mean of: ' + str([1000, 37, 35, 40, 35, 29, 51, 31, 33, 34, 30, 29, 33, 37, 36]))
print(np.mean([1000, 37, 35, 40, 35, 29, 51, 31, 33, 34, 30, 29, 33, 37, 36]))
print()
print()
print('Adding a small number(12)')
print('Harmonic mean of: ' + str([12, 37, 35, 40, 35, 29, 51, 31, 33, 34, 30, 29, 33, 37, 36]))
print(stats.hmean([12, 37, 35, 40, 35, 29, 51, 31, 33, 34, 30, 29, 33, 37, 36]))
print()
print('Regular Mean of: ' + str([12, 37, 35, 40, 35, 29, 51, 31, 33, 34, 30, 29, 33, 37, 36]))
print(np.mean([12, 37, 35, 40, 35, 29, 51, 31, 33, 34, 30, 29, 33, 37, 36]))
Harmonic mean of: [37, 35, 40, 35, 29, 51, 31, 33, 34, 30, 29, 33, 37, 36]
34.300503611618986
Adding a large number(1000)
Harmonic mean of: [1000, 37, 35, 40, 35, 29, 51, 31, 33, 34, 30, 29, 33, 37, 36]
36.66071950232791
Regular Mean of: [1000, 37, 35, 40, 35, 29, 51, 31, 33, 34, 30, 29, 33, 37, 36]
99.33333333333333
Adding a small number(12)
Harmonic mean of: [12, 37, 35, 40, 35, 29, 51, 31, 33, 34, 30, 29, 33, 37, 36]
30.51940326329871
Regular Mean of: [12, 37, 35, 40, 35, 29, 51, 31, 33, 34, 30, 29, 33, 37, 36]
33.46666666666667
You can see that in the above three examples, the harmonic mean stayed in the range of 30 even though we added a 1000 in the range. However, mean got inflated to 99.33. In conclusion, harmonic mean gives out a weighted average that favors smaller numbers and ignore really big outliers. The formula of F1 score in terms of precision and recall is as follows:
Hopefully, now you see how harmonic mean is better than regular mean. Let us implement F1-score then.
from sklearn.metrics import f1_score
f1s = f1_score(y_test, predicted, average='micro')
f1s
0.9842442146725751
Hmm, not the best F1-score but I will surely update this post once I find a good dataset to show you the F1-score. I tried explaining all the terms that could help you understand F1 score, starting from confusion matrix to precision and recall. Let me know if you any following questions.
Note: I am using the same dataset I used in my pervious blog about classification metrics.