Why such a large discrepancy between backtesting and real-life results?

I have tried to look more into the publicly available simulations and DM models. I have probably looked at over 150 different models. There is a common thread that most of them produce only half or less out of sample.

Does anyone have any thoughts on why this might be:

** Is it overfitting, not real costs (slippage), timing luck, has the factor’s predictive ability been eliminated, too strong of a tilt towards a factor that has been out of favor, changes in the underlying universe, overweighting in a cap size or industry?*

I tried to look into some of the studies that have been conducted, but they do not provide any obvious answers as to whether they are transferable to DM models or publicly available simulations:

  1. “The Limits of Quantitative Modeling Techniques and the Future of Computational Trading” by Marcos Lopez de Prado, 2018 - The study discusses the limitations of backtesting in quantitative finance and highlights the importance of incorporating uncertainty into models. It argues that models that fail to account for uncertainty will produce inaccurate results.
  2. “Backtesting Trading Strategies with R” by Tim Trice, 2019 - The study examines the process of backtesting trading strategies using R and highlights the pitfalls of overfitting models to historical data. It recommends using out-of-sample testing to improve the accuracy of models.
  3. “Backtesting Investment Strategies with R” by Gero Weichert, 2018 - The study explores the challenges of backtesting investment strategies using R and discusses the importance of data quality and model assumptions. It recommends a thorough validation process to ensure the accuracy of models.
  4. “Backtesting and Simulation of High Frequency Trading Strategies” by D. Easley, M. Lopez de Prado, and M. O’Hara, 2012 - The study examines the difficulties of backtesting high-frequency trading strategies and highlights the importance of accurate market data. It recommends using more sophisticated market models to account for market dynamics.
  5. “The Risks of Backtesting” by Paul Barnes, 2018 - The study examines the risks associated with backtesting and argues that historical data is not always representative of today’s market conditions. It recommends incorporating economic intuition into models to improve accuracy.
  6. “The Pitfalls of Backtesting” by Richard Martin, 2017 - The study explores the limitations of backtesting and highlights the challenges of selecting appropriate historical data. It recommends using multiple datasets to validate models and account for market volatility.
  7. “The Dangers of Backtesting” by David Easley, 2017 - The study examines the risks associated with backtesting and argues that models can be overly optimistic due to overfitting. It recommends using robust statistical methods to account for uncertainty.
  8. “The Challenges of Backtesting” by Thomas Wiecki, 2018 - The study explores the challenges of backtesting and highlights the importance of data quality and model assumptions. It recommends using a variety of techniques to validate models and improve accuracy.
  9. “The Flaws of Backtesting” by Matthew Dixon and Kerem Tomak, 2018 - The study examines the flaws of backtesting and highlights the challenges of selecting appropriate historical data. It recommends using robust statistical methods to account for uncertainty and avoiding overfitting.
  10. “The Limitations of Backtesting” by Michael Dempster, 2018 - The study explores the limitations of backtesting and highlights the challenges of incorporating market dynamics into models. It recommends using more sophisticated models to account for changes in market conditions.
1 Like

There’s a good discussion of this in Is There a Replication Crisis in Finance? by Theis Ingerslev Jensen, Bryan T. Kelly, Lasse Heje Pedersen :: SSRN. Here are the main points.

  1. You should look at alpha, not raw returns.
  2. You should use a Bayesian framework. “Our Bayesian framework shows that, given a prior belief of zero alpha but an OLS alpha (ˆα) that is positive, then our posterior belief about alpha lies somewhere between zero and ˆα. Hence, a positive but attenuated post-publication alpha is the expected outcome based on Bayesian learning, rather than a sign of non-reproducibility.”

See especially section 1 of the paper, which explains the Bayes case extremely well. Read it closely and carefully and you’ll be convinced that out-of-sample returns that match in-sample returns are wildly improbable.

In addition to all of the above, publicly available simulations and DM models have absolutely no quality control and are not subjected to any kind of rigor when backtesting, which makes them much more subject to overfitting. Every successful ranking system is going to be overfit to some degree to the data available. But many of the systems you’re looking at might have fallen apart over the period tested if backtested using slightly different parameters (e.g. larger number of holdings, a week earlier or later in the rebalancing, a partial universe, etc.). See https://blog.portfolio123.com/?s=stress+test. If you were to look only at strategies that have been successfully stress-tested over the in-sample period and if you were to make sure that the period was at least eight years long, then you might see better results.

Again, though, the best explanation here is the Bayesian one.

2 Likes

YEEEEEESSSS Yuval!!! JASP works for this. Personally, I am working on getting my old MacBook Pro to run PyMC3 in my lifetime (a Python Bayesian program that Tony has helped me with). Thank you Tony!

I think Yuval has mentions other great concepts on this, like regression-toward-the-mean and/or mean reversion. But actually they are not entirely separate ideas from Bayesian analysis—as Yuval well knows I think.

Any type of shrinkage work when applicable e.g., as found in Ridge Regressions, I think. But Ledoit-Wolf covariance matrix for the discussion of “Portfoilio Risk Control brainwashing” thread.

Thank you for the feedback.

I will look into Bayesian framework. As I understand the method so far, it seems to require a lot of ongoing work to update - with new data the investor’s prior distribution (which represents our initial assumptions about the probability of a given event), and then to use Bayes’ theorem.

Thank you for the link to your stress test post. I don’t think there are many publicly available simulations that would pass such a stress test.

1 Like