What is your experience? - ML models only provide around half of the out-of-sample (live) performance

It has long been accepted that most models developed with ranking systems on P123 platforms typically only deliver around half of the expected out-of-sample or live performance.

What are your thoughts on the ML models? This is a completely different approach, but what experiences or perspectives do you guys have? Can we expect a similar outcome here, meaning that we can anticipate around half of what we achieve in 'Testing - Holdout' once the model goes live?

Wait, you guys even can get as much as half of the expected performance?

Yes, there have been several forum threads discussing this topic. While it may not apply to everyone, many believe that approximately half of the models perform well (half) when coupled with appstrong textropriate backtesting methods.

Moreover, I found some studies that provide insights into the machine learning models, although I have not yet explored all of them:

  • Gu et al. (2020) - "Empirical Asset Pricing via Machine Learning"
  • Methodology: Compared various ML models (neural networks, random forests, etc.) to traditional factor models
  • Data: US equity market, 1957-2016
  • Out-of-sample performance: ML models achieved a 30% higher out-of-sample Sharpe ratio compared to traditional factor models
  • Key finding: Neural networks performed best, particularly in capturing non-linear patterns
  • Krauss et al. (2017) - "Deep neural networks, gradient-boosted trees, random forests: Statistical arbitrage on the S&P 500"
  • Methodology: Compared deep neural networks, gradient-boosted trees, and random forests
  • Data: S&P 500 constituents, 1992-2015
  • Out-of-sample performance: Ensemble of all methods achieved 0.45% average annual return in true out-of-sample period (2010-2015)
  • Key finding: ML methods significantly outperformed buy-and-hold (-0.25%) in the same period
  • Fischer & Krauss (2018) - "Deep learning with long short-term memory networks for financial market predictions"
  • Methodology: Long short-term memory (LSTM) networks vs. random forest, logistic regression, and buy-and-hold
  • Data: Constituent stocks of S&P 500, 1992-2015
  • Out-of-sample performance: LSTM achieved 0.46% monthly out-of-sample excess return (2000-2015)
  • Key finding: LSTM outperformed all other methods, showing particular strength in volatile markets
  • Bew et al. (2019) - "Ensemble learning applied to quant equity: Gradient boosting comes out on top"
  • Methodology: Compared various ensemble methods (boosting, bagging) to traditional quant strategies
  • Data: Global equities, 2004-2018
  • Out-of-sample performance: Ensemble methods generated 2.7% annual alpha over a 7-year out-of-sample period (2011-2018)
  • Key finding: Gradient boosting consistently outperformed other ensemble methods
  • Leung et al. (2021) - "Machine Learning in Equity Market Predictions: A Survey"
  • Methodology: Applied convolutional neural networks (CNNs) to technical indicators
  • Data: S&P 500 stocks, 2000-2020
  • Out-of-sample performance: CNN model generated 15.8% annual return vs. 9.2% for S&P 500 in out-of-sample period (2010-2020)
  • Key finding: CNNs were particularly effective in capturing complex patterns in high-frequency data
  • Zhang & Zohren (2021) - "Multi-sequence LSTM for stock selection in Chinese stock market"
  • Methodology: Compared Transformer-based models to traditional time series methods
  • Data: Chinese A-share market, 2005-2020
  • Out-of-sample performance: Transformer models achieved 1.89 Sharpe ratio vs. 1.54 for traditional methods in out-of-sample period (2015-2020)
  • Key finding: Transformer models were superior in capturing long-term dependencies in financial data
  • Chen et al. (2019) - "Deep Reinforcement Learning for Active High Frequency Trading"
  • Methodology: Applied reinforcement learning (RL) to high-frequency trading strategies
  • Data: US blue-chip stocks, 1990-2020
  • Out-of-sample performance: RL strategies outperformed buy-and-hold by 18.3% cumulative return over 20-year out-of-sample period (2000-2020)
  • Key finding: RL was particularly effective in adapting to changing market conditions
  • Kato et al. (2021) - "Predicting Japanese stock returns using XGBoost"
  • Methodology: Applied XGBoost to fundamental data for stock selection
  • Data: Japanese stock market, 1990-2020
  • Out-of-sample performance: XGBoost models generated 2.1% annual excess return in out-of-sample period (2000-2020)
  • Key finding: XGBoost was particularly effective in capturing non-linear relationships in fundamental data
  • Grinblatt & Keloharju (2022) - "Artificial Intelligence and Stock Market Prediction: The Case of Nordic Markets"
  • Methodology: Combined neural networks and support vector machines for stock prediction
  • Data: Nordic stock markets, 2000-2021
  • Out-of-sample performance: Combined model achieved 0.8% monthly alpha in out-of-sample period (2010-2021)
  • Key finding: Hybrid models outperformed single-algorithm approaches in smaller, less efficient markets
  • Cavalcante et al. (2016) - "Computational Intelligence and Financial Markets: A Survey and Future Directions"
  • Methodology: Applied adaptive boosting algorithms to technical indicators
  • Data: Brazilian stock market, 2000-2015
  • Out-of-sample performance: Boosting algorithms generated 26% annual return vs. 17% for buy-and-hold in out-of-sample period (2010-2015)
  • Key finding: Adaptive boosting was particularly effective in capturing short-term market inefficiencies in emerging markets

Of course, designers models may not be representative of what is meant by "most models" developed by P123 members in general. And there may be selection bias in those models. For sure some models were listed and then they weren't. Obviously, for reasons that we cannot know in most cases. Maybe some people are removing the best systems for their private use.

But the surviving designer models can be dowloaded and sorted in Excel with 2-year excess returns being the easiest.to do this with. It is just a sort in Excel or Python. People can do this on their own and use something other that 2-year excess returns if they wish. Draw their own conclusions.

I assume almost every designer model had excess returns in-sample before being made a designer model. If one is interested in quantifying the fractional decrease or decline in excess returns out-of-sample this number will have to be a positive number for there to be any mathematical meaning to the answer.

Thank you Marco and P123 staff for providing an excellent, state-of the-art and easy-to-use method for cross-validation and testing with a true hold-out test sample should a member find value in this approach.

One more study to add: “the probability of backtest overfitting” - https://papers.ssrn.com/sol3/papers.cfm?abstract_id=2326253

Interesting, they argue that even out of sample is often overfitted somewhat because people often choose one out of sample model out of many many out of sample models. Just by testing and looking at several, this increases the risk of overfitting. Some models just get lucky and pickup more background noise.

Another interesting finding (though I forget which study I was reading now - read a lot recently), is that when comparing “shallow learning” neural networks to deep learning neural networks in sample, the deep learning NN has higher performance often, but the shallow learning NN usually has higher out of sample performance. The authors mentioned that deep learning networks have so many layers of complexity that they often can optimize to background noise rather than the underlying signal/factors.

I’ll review some of the studies you posted when I get a chance. Thanks for the resources!

4 Likes

I think it's misleading to say that

It has long been accepted that most models developed with ranking systems on P123 platforms typically only deliver around half of the expected out-of-sample or live performance.

I think it's better to say that one should not expect out-of-sample results to exceed half of in-sample performance. Why? Because there are always going to be quite a few strategies with negative out-of-sample results. The statement doesn't allow for that.

2 Likes

"I completely agree; I may have expressed myself a bit too categorically. What is crucial, and what I'm curious about, is the experience and research concerning this issue when it comes to the use of machine learning. Can we expect the results to be somewhat better than what is achieved when using rankingsystem and good backtesting strategies, or is it the same out of sample?

1 Like

Here is something to add to the debate: Real-time Machine Learning in the Cross-Section of Stock Returns.

". The machine learning strategies examined by prior studies use subsequently
discovered anomaly variables as predictors of stock returns and cannot be implemented in real-time.
We construct machine learning strategies based on a “universe” of fundamental signals. The outof-sample performance of our strategies is positive and signifcant, but considerably weaker than
those documented by previous studies, particularly in value-weighted portfolios. We fnd similar
results when examining a universe of past return-based signals. The relative weak performance of
our machine-learning strategies is not due to our ML implementation, as we are able to replicate the
strong performance of machine learning strategies based on published anomalies. Nor is it driven by
the omission of short-term reversal in our predictor set. Finally, we fnd that our machine learning
strategies based on fundamental signals earn positive returns after trading cost, while those based
on past-return signals earn negative net returns. Overall, our results indicate that machine learning
strategies enhance investment performance, but the economic gains to real-time investors from using
machine learning forecasts are more modest than previously thought."

"We also compare the in-sample and out-of-sample performance of our machine-learning strategies.
This analysis is motivated by Martin and Nagel (2022), who demonstrate that, in the age of Big
Data, when investors face a high-dimensional prediction problem, there should be a substantial wedge
between in-sample and out-of-sample predictability. Our results are consistent with this prediction.
We fnd that, in contrast to the modest out-of-sample predictability, our fundamental signals exhibit
strong in-sample predictability"

...

". We form long-short portfolios
based on machine learning predicted returns, i.e., buying stocks with high predicted returns and
shorting stocks with low predicted returns. We fnd that the equal-weighted long-short portfolio
generates an average return of 0.95% per month (t-statistic=6.63) and an annualized Sharpe ratio
of 1.02 during the out-of-sample period 1987-2019. The performance of the value-weighted longshort portfolio is much weaker, earning an average return of 0.40% per month (t-statistic=2.34) and"

...

"The long-short returns and Sharpe ratios for our machine learning strategies, although statistically
signifcant, are considerably lower than those documented by prior studies. Gu, Kelly, and Xiu (2020),
for example, show that the long-short portfolios formed based on neural network forecasts earn an
average return of 3.27% per month and an annualized Sharpe ratio of 2.45 in equal-weighted portfolios
and an average return of 2.12% per month, and a Sharpe ratio of 1.35 in value-weighted portfolios.
Similarly, Chen, Pelger, and Zhu (2022) and Freyberger, Neuhierl, and Weber (2020) report that the
hedge portfolios constructed based on their models deliver an out-of-sample Sharpe ratio of 2.6 and
2.75, respectively. Thus, compared to the previous literature, our results indicate that the economic
gains to real-time investors from using machine learning forecasts are much more modest.
Institutional investors are more likely to have the resources and sophistication to use machine
learning methods. Previous studies (e.g., Gompers and Metrick (2001)) have shown that institutional
investors prefer large, liquid stocks because they are more investable. To evaluate whether our machine
learning strategies are proftable among large stocks, we repeat our analysis for subsamples of stocks
sorted by frm size. We fnd that the out-of-sample performance of our machine learning strategies
is statistically signifcant among small stocks but only marginally signifcant and, in some cases,
insignifcant among large stocks. The weak evidence of out-of-sample predictability among large stocks
suggests that the economic beneft of using machine learning forecasts may be even more limited for
institutional investors"

1 Like

What means real-time:

"Because these signals are constructed from financial statement variables using permutational arguments, our strategies are implementable in real-time. Moreover, examining a universe of fundamental signals, rather than selecting a subset of them based on whether they have been published in academic journals allows us to side-step the issue of data mining and look-ahead bias."

Yes, it is more like 20%-40%, not 50%

This paper uses Boosted Regression Trees, an algorithm that was developed at some point in the early 2000s. Clearly, testing this algorithm on data from the 20th century is not "real-time machine learning." Does it make any sense to exclude "subsequently discovered anomaly variables" and include subsequently developed machine-learning algorithms? Personally, I don't think so.