In deep learning, one of the tradeoffs we consider when developing algorithms is that of precision and recall.
Precision and recall is a simple yet useful way to measure the quality of predictions. Let’s create a classification scenario and how we can apply precision and recall numbers, and an F score to determine how effective my algorithms are.
In order to do this, my model takes in thousands of market features and outputs a probability of the price increasing, this value being my h(x) value, or the result of hypothesis given x.
Let’s say we are predicting the outcome of a market price, whereby we predict y = 1 for the price increasing and y = 0 for the price decreasing. Traders are using my predictions towards their strategies, so it is very important I give them reliable data.
Furthermore, I may wish to introduce bias to my hypotheses to suit both bullish traders and bearish traders.
Bullish traders will only want to bet on the market increasing, and if they are very confident, therefore my hypothesis could only predict a positive result if it is 80% confident or more that the price will appreciate.
This bias will minimise the trader’s losses, but will also result in some missed opportunities. My algorithm may output a 65% probability of the price increasing - which actually does, but I have classified this a negative prediction as it is below my 80% threshold.
In this case, I have predicted a false negative result. The predicted class was negative (price decreasing), however the actual prediction was positive, as the price actually increased.
Using this terminology, we can break our results into 4 categories:
Result | actual class 1 | actual class 0 |
---|---|---|
predicted class 1 | True Positive | False Positive |
predicted class 0 | True Negative | False Negative |
By storing the number of true positives, false positives, true negatives and false negatives derived from my model’s predictions, I can then calculate my precision and recall values.
Precision - the amount of positive predictions that were correct.
Calculation: true positives / number of predicted positives.
Recall - the percentage of positive cases you caught
Calculation: true positives / number of actual positives
Both precision and recall output a value between 0 and 1. What we would like is a high precision and a high recall, but this is very rarely the case. By testing a range of models and plotting their precision and recall values, their curves will give you an idea of the ideal tradeoff you should be aiming for.
One way to tell if our algorithm is biased towards a positive class is if we have a very low precision, but a very high recall.
For example, if we are in a bullish market where the price continuously increases on a daily basis, I could just return 1 (without processing my neural networks), and get a better precision than if I used my prediction model!
By saying y = 1 on every prediction, without further processing, I will get a perfect recall, but a lower precision, as there will inevitably be false positives predicted.
This leads us onto the final piece of our quality testing, where we use our precision and recall values to calculate an F Score - a score between 0 and 1 to measure how effective our algorithms are.
The term F Score, also referred to as F(subscript 1) score, is just a favoured term the deep learning field have adopted for this calculation, and the most popular implementation is:
F Score = 2 (PR / P + R)
This calculation takes into consideration biased hypotheses in the result we generate. What we are after is a high F score; the closer to 1, the better our hypothesis.
The F Score is also referred to as the Accuracy.
It is very common to use this methodology and apply it to many prediction models with various weighting to ultimately test which one is producing the most reliable / accurate predictions.