AI in Finance: how to finally start to believe your backtests

Dear all,

Hope these article links are useful as a reference before conduct backtesting.


Thanks for the links. Shows how easily one falls in the curve fitting trap. This quote is worth repeating “We rely on a single outcome of infinity from a tremendously complex system to test a trading algorithm, which is insane by itself”

We’ve seen much more complex strategies, with multiple stocks and many more factors, fail miserably out of sample. This is a BIG problem for us since it leads to user cancellations.

I also really like Matthew’s Correlation Coefficient or MCC, which the author seems to rely on heavily to evaluate the strategies. The confusion matrix (what a great name) is straightforward to compute when you have actual vs. predicted values. It’s also very intuitive with values from -1 to +1. We’ll definitely add it as one of the statistics in our upcoming AI/ML factors.

MCC for Ranking

It would be interesting to calculate the MCC coefficient for our ranking systems. We have very little right now to evaluate a ranking system. Mostly it’s just the annualized performance of the ranks grouped in buckets. Users simply alter weights to achieve a “better looking” bucketized performance (btw, we’re working on improving this right now, so this comes at the perfect time).

But how would we calculate the confusion matrix with ranks and no target to aim at? What is a True/False Positive/Negative for a rank? Perhaps we can calculate multiple MCC’s which can help two-fold:

  • Measure the accuracy of a rank
  • Determine the ideal rebalance frequency for a strategy

I’ll illustrate with an example where we calculate three MCC’s for 1w, 4w, 13w:

  1. For every rank data point (a stock's rank on a particular date) we calculate the future 1w, 4w, 13w performance relative to the benchmark.
  2. We then populate the 1w, 4w, 13w confusion matrices as follows
    • Rank > 80
      • True Positive if stock outperforms
      • False Positive if stock underperforms
    • Rank < 20
      • True Positive if stock underperforms
      • False Positive if stock outperforms
    • Rank between 20 and 80
      • Throw away
    • We calculate MCC's for 1w, 4w, 13w

With these three MCCs you could determine if a) the ranking system is accurate and b) for which time horizon. You would then set the portfolio strategy rebalance frequency to match the highest MCC.

Naturally this being financial data full of noise, the bar for a good MCC would be quite low. Perhaps even small values of 0.05 can create winning strategies.

And, since not everyone wants to short, another refinement is to calculate MCCs for long, short and long/short systems.


1 Like


I am glad that you find the links especially MCC coefficient interesting and potentially useful.

This is the first time I learn about MCC coefficient from these links and I agree that it is simple (+1 to -1) to be used in P123 for the upcoming AI/ML factors as one of the statistics and also to evaluate the existing rankings system.

Your line of thinking of using MCC coefficient to measure the accuracy of a rank and determine the rebalance frequency make sense.

As I am also new to this MCC coefficient topic, I have nothing more to suggest at this point.

Perhaps there are other P123 members who can contribute to this further.


We’ve seen much more complex strategies, with multiple stocks and many more factors, fail miserably out of sample. This is a BIG problem for us since it leads to user cancellations.

@Marco: I have ideas about some tweaks to the site that will reduce oos failures.

My problem is that my ideas don’t generate excitement. People don’t realize the extent of the problem, and/or how my ideas help solve it.

I think that my track record and credibility is pretty good, even if they don’t always understand my thinking.

If you are seriously interested, (and it sounds like you might be motivated), then I can work on presenting them to you. Otherwise, I wouldn’t want to spend the time working on it and not seeing it come to fruition.

1 Like

For now, lets limit this discussion to 1w future performance.

Are you proposing summarizing all the future performance into one MCC value (perhaps augmented with a box-plot)? That is, a MCC that covers multiple years.

Or computing multiple MCCs over time? Maybe on a weekly basis and presenting that as a time-series?

I think I would prefer the latter. For the compounding game that we’re playing, time-series behavior can provide much insight. I’ve learned that from the simulation rolling test tool.

From some quick google research, it appears that MCC is identical to the Pearson correlation coefficient estimated for two binary variables. In that case, MCC should be widely accepted here.

I’m not sure how the ML system will work but from the example provided by @marco I can imagine that predicted values are ranks (or predicted ranks) from 0 to 1, while the actual values are binary outcomes (outperforms, underperforms).

Then we can treat this problem as imbalanced classification problem (similarly to modelling probability of default in banks) and simplify thing a bit.

Positive if stock outperforms by X%, assign 1.
Negative if stock does not outperforms by X%, assign 0.

We compare predicted ranks with binary outcome. Only threshold for X is needed. Then we can use metrics: ROC AUC, or cumulative accuracy profile.

In addition I would like to see VIF (variance inflation factor) for selected models, and maybe some sort of optimiser which decreases max VIF, while keeping ROC AUC at similar level.

1 Like


I am about to reveal my ignorance, so maybe you folks can provide with links to more info (*), but I am really bothered by the paper on Matthew’s Correlation Coefficient. The paper gives an example of a predictor model that gets 91 out of 100 predictions right, and 9 wrong, and yet the rating given on the prediction system is 0.14, which as I interpret it is only slightly better than flipping a coin. Really? If I had a gambling game and I had a predictor model that was right 91% of the time, then I would assume I could make a killing, assuming the payoff versus the cost of each bet were equal. So what am I missing here?


(*) Such links may be to @Jrinne’s comment about the Oracle of the matrix, where he says to a bewildered investor “There, there, you don’t have to worry your little head over these statistics. Here, have a cookie” (Mr, @jrinne, sir, I could not find that posting to get that exact wording so apologies if I misremembered it.)

Can someone explain to me how this is more helpful than out of sample t-stats? It appears alot of information is being lost using MCC for what amounts to simply measuring the mean squared error of predicted stock returns.


That’s a good question. I played around with redistributing the number of correct predictions. In one case, I set TP=46 and TN=45. So the number of correct predictions is still 91. The MCC is now 0.91.

If this is applied to a racking system, that system better be able to discern not just true-positive but also true-negative. That’s also an incomplete opinion.

Yeah, the paper example is poor. Depends on what field your are using it for. For example in medicine Type II errors, False Negative, are very bad.

Lets take a realistic example of what a good MCC value is for investing. For example lets say you trade 420 times going long on positive signals and short on negative signals with this confusion matrix

TP	110
TN	110
FP	100
FN	100
Win/Loss = (TP+TN)/(FP+FN) = 1.1

A 1.1 Win/Loss is very good, wouldn’t you agree?

And the MCC is only 0.0476

Perhaps because outliers are a big problem in finance? Or simply that too much science in finance leads to overconfidence which leads to disasters?

Simplifying always feels like the right solution. Case in point, everyone loves Piotrosky F-score and it’s just a simple addition of 9 binary conditions.

@pitmaster Thanks . We definitely need to add more insights to our ranking performance tests

1 Like

Sorry - not tracking: How does this statistic reduce the outlier problem in live trading? Also, how does this statistic solve the problem of overconfidence?

I think it gives a more balanced view. For example if the rank performance tells you the top decile did 20% annualized return and the bottom decile di 5% you’d think you have a terrific long-short strategy and leverage up. But what if the top decile was due to a curve fit of some outsized winners?

I think there’s a tendency in P123 to use a much larger number of buckets (like 200) and zero in on the very top bucket (the top 0.5%) since you will only be buying a handful of stocks. That’s curve fitting and a P123 subscriber that will not last.

A statistic that looks at the top 80 and bottom 20 feels right. Maybe even top 50 bottom 50 percentile. Perhaps a terrible score indicating pure randomness will give you pause.

Not sure what is the right answer. All I know that most systems suffer from curve fitting.

1 Like

If I may suggest, r^2 of models across all buckets, of alpha, with out of sample testing. I think that is the best bet.

Using all statistics, not just a handful, is also likely better. More statistics pointing in the same direction should lead to better out of sample performance.

Hi Marco, thanks for sharing your thoughts. Another area of curve fitting I worry about are constraints within the screen itself. Most of the screens I currently select from a ranking system with Rank > 95 ±, but then add on criteria. Any criteria I add to the rank is also a source of potential curve fitting. For example, a recent screen I worked on added 2 constraints to a ranking system - 1) require the stock to underperform of the past 3 yrs (FRank < 50) and 2) exclude net insider selling over past 12 or 6 months.

The idea is to select from the top 5% of stocks in the ranking system. So I currently start with 107 stocks and of those,

  1. restrict to poor price performance over 3 yrs (FRank < 50) which currently leaves about 27. It’s possible this step is curve fitting despite the concept making sense. FRank < 50 performs about 7pp better than FRank > 50 for stocks selected by this ranking system (value/quality wtd ranking system). I’m almost sure if the criteria of poor price performance over the past 3 yrs didn’t back test well I would’ve dropped the idea right there.
  2. Additionally, since it seemed like I had enough surviving companies, I also wanted to add the constraint to avoid net insider selling for companies with poor price performance. Doing that added about 3pp to the results and currently reduces surviving companies from 27 to 17. Again, what I’m doing seems reasonable in a way of probing the data, but on the other hand I am certain that if adding that constraint reduced results I wouldn’t have used it. I naturally think insider selling should be worse than insider buying in these type of stocks, but I’m pretty sure I’ve seen papers indicating insiders might not be as good at timing purchases as we might expect. But in any event, constraint inclusion is empirical, and I wonder how much curve fitting did I add even in this simple example?

Even though I start with the top 5% of stocks in my ranking system and start with a seemingly sensical idea of a) buy stocks with depressed price action that, b) have insider buying/don’t have insider selling - it’s very possible I’m doing some curve fitting.

I’m pretty sure I tested this on more stocks, maybe Rank > 90 and Rank >80. I tend to start wide and narrow down at later stages. But the truth is if it didn’t backtest well I would’ve discarded both ideas at the outset, and even though the screen makes sense I still worry about curve-fitting. Every step of the way there’s the danger of curvefitting, but at the same time I have to expect there’s some signal in what I’m seeing and accept type 1 and type 2 error concerns.

Not sure where I’m going with this, but ultimately it seems I blend this in with several other approaches and hope I’m capturing more signal than noise despite a degree of curve-fitting that seems part of the process.


After I posted my previous post and had gone to bed but not gone to sleep (and not having read any of the subsequent posts), I thought of the flaw in my thinking. I will provide a medical example to show what’s wrong with my previous reasoning.

Suppose a medical researcher developed a medical test to detect Maple Syrup Urine disease, which frequently occurs in Mennonite communities. (You can read about it in the delightful book When a Gene Makes You Smell Like a Fish by Lisa Seachrist Chiu.)

Suppose that the researcher goes to an Old Mennonite community and chooses 100 test subjects. Suppose also that of the sample, 95 test subjects actually have the genetic condition while 5 do not. The researcher runs his test and finds 90 of the actual cases, and failing to find the other 5. The test also falsely implicates 4 others as having the condition who do not and correctly diagnoses 1 of 1 test subjects as not having the condition. He concludes (using my own faulty reasoning) that his test method is a great success.

Suppose that he decides to test 100 other subjects. In this second sample, suppose 4 have the condition, 96 do not. The researcher finds that his test correctly reports that 4 (3.8) subjects have the condition, falsely reports that 77 (76.8) subjects have the condition, and correctly reports that 19 (19.2) do not have the condition. So 4+77 subjects are told they have the condition, when only 4 actually do.

If the two examples were two different runnings of two different populations of a trading system, and the first run gives the results in the development phase (where the condition to be exploited is rich), and the second run gives the results of using the trading system where the condition is scarce, the trader is likely to be greatly disappointed in the results.

So now I have a better understanding of the Matthew’s Correlation Coefficient and its results. Thanks to everyone else for your thoughtful posts.

I will close with this story. I formerly worked with a man who worked with me at Raytheon in Dallas Texas, which bought Texas Instruments Defense Division in 1997. He originally started work with Texas Instruments when it was a relatively small company before its big growth. He invested a large portion of his income in TI stock, which proved a big win for him. So he was a multimillionare by the time he retired. He and I would talk about investing in the stock market sometimes, and whenever I suggested or made an investment he thought was dumb, he would quote his dad, who told him “I buy you books, I send you to school, and what do you do? You eat the book covers.” Well, my previous post was like eating the book covers.


That would be great I’ll send you an email

Great. I started to work on it. It will take some time.

Marco, I sent you an email with some ideas.

Interesting thread.Thank you @ustonapc and @marco for introducing MMC to the forum and also the idea of the use of classifiers. People will have to decide whether to use a classifier or a regressor if they want to do machine learning with P123’s AI/ML or use the DataMiner downloads.

I wish to add one new idea about classifiers to this thread. For classifiers you generally have 2 options for the predictions. You can have a binary prediction for a class with the output 1 or 0 (predicted to be in the class or not). You can also get a probability for a class. E.g., the probability of ticker XYZ outperforming the benchmark for the next week or month is 0.67.

Probability outputs can form a basis for a ranking system is my new point for this thread. You can rank the stocks according to the the probability of the ticker outperforming, buying the stocks that have the greatest probability of outperforming. This should integrate well into what P123 does now, I think.

Bottom line for brevity: it works with results that are on par with ranking the predicted returns using regressors based on my limited experience.

BTW, if you use probability predictions MCC will no longer work. You may want to look at Brier Skill Score as a metric when using the probability output (ask ChatGPT).

Here a way we could keep the returns, not tested it yet.
Would it make sense?

Here's the formula: performance_metric = (strategy_return - fp_return) / (max_return - fn_return)


  • strategy_return: mean return of the selected stocks (y_pred)
  • fp_return: mean return of the false positives (selected stocks not in the actual top 100)
  • max_return: mean return of the actual top 100 stocks (y_test)
  • fn_return: mean return of the false negatives (actual top 100 stocks not selected)

fp and fn we get from difference of y_pred and y_test