Stock picking with machine learning


I don’t understand why you are using such a long (3 year embargo). I thought that an embargo equal to the planned trading frequency which is typically the future price goal in training would be enough. I understand using k fold with factors containing factors like 3 year average ROI there will be some small amount of data leakage but the factor importance of these is small in my models.

I’m not trying to be critical, your obviously ahead of me on the curve but just trying to understand what you are doing.

Hi Bob,

I think there are several good ways to answer that starting with what you might need to know most.

  1. You probably do not need an embargo. You can probably just use k-fold cross-validation or time-series validation if you prefer that method (or both). I use both at times (or either).
  1. I think this is exactly right if you are talking about the target. An embargo is a different thing. But what you are doing is what I would use as a target if that is what you are thinking of. You are on the right track if you are thinking of what to use for a target.

  2. Just if you want a direct answer to year question as clearly as I am able to make it: Using an embargo could be considered as simply a modification of the k-fold cross-validation. It was developed by de Prado and is discussed in this book.Advances in Financial Machine Learning

So the term used in the book is “information leakage.” I struggled with his for the longest time. It was VERY DIFFICULT for me.

The advantage of the way ChatGPT does it is that it explains it in several different ways. Perhaps, one will resonate with your present experience. I am happy to try to expand on this but ChatGPT simply says it better:

"Marcos López de Prado, in his work on financial machine learning, emphasizes the importance of using an embargo period when performing k-fold cross-validation in financial modeling. An embargo period refers to a time gap between the training set and the testing set.

Here are the reasons why an embargo period is important:

  1. Non-Overlapping Samples: Financial time series data is often autocorrelated, meaning that observations close in time tend to be similar. If a model is trained on a dataset that includes information very close in time to the test set, it can inadvertently “peek” into the future. This leads to overfitting as the model may just be learning the noise specific to the dataset rather than any true underlying patterns.

(Note: This first reason is the reason most similar to what is in de Prado’s book. It is, perhaps, the most difficult to understand and he manages to make it even more difficult, IMHO)

  1. Out-of-Sample Testing: By including an embargo period, we ensure that the model is truly tested on out-of-sample data which it has not seen before. This is crucial for assessing the model’s ability to generalize to new, unseen data.

  2. Avoiding Look-Ahead Bias: Look-ahead bias occurs when a model uses information that would not have been available at the time of prediction. An embargo period helps to prevent this by ensuring that there is a clear separation between the training data and the data used to simulate the “future” where predictions will be made.

  3. Market Regime Changes: Financial markets can go through different regimes or periods where the underlying behavior of assets can change significantly. An embargo can help to ensure that the model is robust to these shifts by not allowing the model to become too finely tuned to a specific regime that is present in the training data.

In practice, the implementation of an embargo period means leaving out a gap between the end of the validation dataset and the beginning of the training dataset during the cross-validation process. The length of the embargo period can be determined based on the autocorrelation structure of the data or based on domain knowledge of the asset or instrument being modeled."

BTW, the embargo probably does not need to be that long if you end up using it. 3-years is pretty long and was done more as a convenience (leaves out one DataFrame from the DataMiner downloads). Hope that helps some.




Thanks for the definition of your embargo. I’m going to have to start using Chat GPT, they actually do seem to come up with some clear definitions along with some off the wall answers. I have De Prado’s AiFML book but his explanations in chapter 12 are not the easiest to understand. The Hudson & Thames youtube Cross-Validation in Finance and Backtesting - YouTube and quantinsti’s blog with python code examples Cross Validation in Finance: Purging, Embargoing, Combinatorial ( were helpful for me. But I still haven’t gotten a ML using it working yet, and like you I find a lot of this very difficult!

Thanks again.

1 Like

Hi Jim,

what would be the return without the ML, just classic factor investing with equal weights?


Hi Carsten,

I do not have a comparable study of that using the same cross-valdiatoin method or the same periods.

Equal-weight of my factors seems to do pretty good. But is seems so random to me. The weight of growth factors, for example, is totally dependent on the number of growth factors I have looked at. If I keep adding growth factors when is it too much weight on growth. Maybe I force myself to spend equal time looking at each type of factor?

So I don’t usually use it for a comparison anymore. But it did pretty good when I looked at it. It was not the best model when I looked at it.



Hi Jim,

let me put in in other works…

I guess you build a ranking system that you later download for ML.
Do you now how much you get if you import the ranking system into a simulation?



Walk forward of what I am doing now including selecting factors with an algorithm as part of the training to reduce look-ahead bias if I understood your question:

Screenshot 2024-02-11 at 3.58.50 PM

I do not want to get into a competition with my funded out-of-sample results But I looked at the statistical significance of my median-performing model this week-end. Let me give you the results without a lot of comment.

If you are a fan of using Bayesian statistics to avoid p-haking then not quite into the category of "worth mentioning with the limited data. It is what it is:

As a reminder according to Jeffrey’s scale “not worth more than a bare mention” yet but close to three. Time will tell what happens with more out-of-sample data It is what it is:

  • 1 < BF < 3: Not worth more than a bare mention
  • 3 < BF < 10: Substantial evidence
  • 10 < BF < 30: Strong evidence



Hi Jim,


Independent of the Bayesian statistics, the ML is always better than 2x of the original version.

Second, if one uses the factors without ML they will not make a great strategy :wink:

Now I do not understand at all, why I can’t get an ML signal from the Small Cap Focus to improve it at least onto that same level… :slight_smile:

Some other topic, regarding Spearman correlation.
I found in the book from Stefan Jansen… How to evaluate if a factor produces forward returns. It looks like the “standart” is Spearman correlation (AKA IC) . “An IC of 0.05 or even 0.1 allows for significant outperformance…” This IC is built into Alphalens from Quantopian. Just peaked into the code on GitHub, looks like its stats.spearmanr what they are using for IC.


1 Like

It’s been discussed a few times. But future enhancements of the “classic P123” will have to be carefully designed to combat curve-fitting head on. It’s a big problem with our tools now. You have to know a great deal to use them properly, and with many cumbersome steps.

We have to put the user experience first. The latest upgrade to Rank Performance tool is a step in the right direction.

We’ll see after the ML release. Perhaps you will not want it so much and convert your ranking system to an ML model.

It uncanny how well the paper’s analysis aligns with p123 capabilities. Any consideration about inviting the authors to use p123 ML tools for critical feedback?

The one thing that’s missing from this paper is any comparison between machine-learning-based stockpicking and stockpicking based on, say, ranking systems or multiple regression. The authors could have easily optimized a ranking system and/or a multiple regression model on exactly the same data and then run it on exactly the same period. I really wonder what the results might have been.

It’s also quite disappointing that these models require an average transaction cost of less than 5 or 10 basis points to break even.

I remain bullish about the potential of ML models, but given the extremely low transaction costs required, this paper lessened rather than increased my confidence in them.

1 Like


A quote from the paper: “A variety of machine learning models are trained on the binary classification task to predict whether a specific stock outperforms or underperforms the cross-sectional median return over the subsequent week.”

In other words they used classification and only classification with no regression in this paper. This is the most salient feature of the article, IMHO.

Everyone seems to think this is a wonderful paper to be emulated. Because of this almost unbounded enthusiasm, I wonder is P123 pursuing classification, regression or both in the beta release? Not only is it the most salient feature in the article but I think it is probably the most important decision P123 will make (will have made) if P123 is going to chose just one method (regression versus classification). If both, then I guess they will be including the better method (whichever one it happens to be).

I had assumed P123 would be going with regression but the enthusiasm for this paper that, again, only uses classification makes me wonder.

This is just a question that without a doubt would have been considered and answered by a trained AI/ML expert by now. Especially considering how close we are to the release. I am agnostic as to which way P123 should go on this. I don’t have an opinion and use cross-validation to guide me with each model I use in what is hopefully a relatively unbiased manner. Other people may have different results–including the author of this paper apparently.

BTW, if P123 goes with the classification method I would like to see Bier Skill Score as a metric in the rank performance test. I don’t see how you could use Pearson’s correlation for classification models.



Regression at first. We did discuss classification. Checking with developers again to make sure adding it won’t require major work (it did not seem like it last time it was discussed).

1 Like

Marco, wise choice I think. I would have done regression first too: see my post above… Thanks for the reply.

My take on the paper: might have been okay if the authors had included regression models as a comparison. But not pertinent to the upcoming P123 beta based on Marco’s reply above.


how was this Bayesian test done, is there a program or website?


Yes: JASP (Just Another Statistics Program) The download is free and can be used with Windows or Mac.

What is your stock universe?

Here is some extra reading. Not all of the papers are as positive as the one I posted in the first post. If you have any thoughts on some of the research, please share them.

Researchers and practitioners hope that machine learning strategies will deliver better performance than traditional methods. But do they? This study documents that stock return predictability with machine learning depends critically on three dimensions: forecast horizon, firm size, and time. It works well for short-term returns, small firms, and early historical data; however, it disappoints in opposite cases. Consequently, annual return forecasts have failed to produce substantial economic gains within most of the U.S. market in the last two decades. These findings challenge the practical utility of predicting returns with machine learning models.

From the paper:
… Third, machine learning profits decline over time. As is the case for many anomalies, the abnormal returns over the past 20 years have been nowhere close to what they have been from
three or four decades before. … Combining all three dimensions does not build an optimistic picture of machine learning strategies. The interactions between return horizon, firm size, and time further undermined the implementability of machine learning strategies. For example, when yearly return forecasts are
considered, no significant abnormal returns have been recorded in the big-firm segment during
the period of 2001 to 2020. To put it simply, our machine learning strategies failed to produce
any alpha over the past 20 years in the stocks representing 90% of the U.S. market.

We propose a statistical model of heterogeneous beliefs where investors are represented as different machine learning model specifications. Investors form return forecasts from their individual models using common data inputs. We measure disagreement as forecast dispersion across investor-models. Our measure aligns with analyst forecast disagreement but more powerfully predicts returns. We document a large and robust association between belief disagreement and future returns. A decile spread portfolio that sells stocks with high disagreement and buys stocks with low disagreement earns a value-weighted alpha of 14% per year. Further analyses suggest the alpha is mispricing induced by short-sale costs and limits-to-arbitrage.

This article reviews ten notable financial applications where ML has moved beyond hype and proven its usefulness. This success does not mean that the use of ML in finance does not face important challenges. The main conclusion is that there is a strong case for applying ML to current financial problems, and that financial ML has a promising future ahead.

We reconsider the idea of trend-based predictability using methods that flexibly learn price patterns that are most predictive of future returns, rather than testing hypothesized or pre-specified patterns (e.g., momentum and reversal). Our raw predictor data are images—stock-level price charts—from which we elicit the price patterns that best predict returns using machine learning image analysis methods. The predictive patterns we identify are largely distinct from trend signals commonly analyzed in the literature, give more accurate return predictions, translate into more profitable investment strategies, and are robust to a battery of specification variations. They also appear context-independent: Predictive patterns estimated at short time scales (e.g., daily data) give similarly strong predictions when applied at longer time scales (e.g., monthly), and patterns learned from US stocks predict equally well in international markets.

The emerging literature suggests that machine learning (ML) is beneficial in many asset pricing applications because of its ability to detect and exploit nonlinearities and interaction effects that tend to go unnoticed with simpler modelling approaches. In this paper, we discuss the promises and pitfalls of applying machine learning to asset management, by reviewing the existing ML literature from the perspective of a prudent practitioner. The focus is on the methodological design choices that can critically affect predictive outcomes and on an evaluation of the frequent claim that ML gives spectacular performance improvements. In light of the practical considerations, the apparent advantage of ML is reduced, but still likely to make a difference for investors who adhere to a sound research protocol to navigate the intrinsic pitfalls of ML.

From the paper:

Many studies that report strong results for ML models focus on predicting next 1-month returns
based on a large number of traditional factor characteristics as input features. Although the models
load on traditional short-term return predictors they are able to exploit additional nonlinear alpha
opportunities. The challenge with these models is to turn the resulting fast alpha signals into a
profitable investment strategy after costs and other real-life implementation frictions. The
corresponding literature is scarce, and the few works naturally suggest that the opportunity set for
ML models to outperform traditional ones is often reduced given the reliance of ML models on
high-turnover signals.

Machine learning for asset management faces a unique set of challenges that differ markedly from other domains where machine learning has excelled. Understanding these differences is critical for developing impactful approaches and realistic expectations for machine learning in asset management. We discuss a variety of beneficial use cases and potential pitfalls, and emphasize the importance of economic theory and human expertise for achieving success through financial machine learning.

Machine learning (ML) models for predicting stock returns are typically trained on one-month forward returns. While these models show impressive full-sample gross alphas, their performance net of transaction costs post 2004 is close to zero. By training on longer prediction horizons and using efficient portfolio construction rules, we demonstrate that ML-based investment strategies can still yield significant positive net returns. Longer-horizon strategies select slower signals and load more on traditional asset pricing factors but still unlock unique alpha. We conclude that design choices are critical for the success of ML models in real-life applications.

From the paper:

… but that their performance has weakened substantially in the second half of our
sample period. However, we show that by incorporating efficient portfolio construction rules, … and using longer prediction horizons to train the machine learning models, significant after-cost
returns can still be achieved by machine learning-based investment strategies.


We theoretically characterize the behavior of machine learning asset pricing models. We prove that expected out-of-sample model performance—in terms of SDF Sharpe ratio and test asset pricing errors—is improving in model parameterization (or “complexity”). Our empirical findings verify the theoretically predicted “virtue of complexity” in the cross-section of stock returns. Models with an extremely large number of factors (more than the number of training observations or base assets) outperform simpler alternatives by a large margin.

We analyze machine learning algorithms for stock selection. Our study builds on weekly data for the historical constituents of the S&P 500 over the period from January 1999 to March 2021 and build on typical equity factors, additional firm fundamentals and technical indicators. A variety of machine learning models are trained on the binary classification task to predict whether a specific stock out- or underperforms the cross sectional median return over the sub-sequent week. We analyze weekly trading strategies that invest in stocks with the highest pre-dicted outperformance probability. Our empirical results show substantial and significant out-performance of machine learning based stock selection models compared to an equally weighted benchmark. Interestingly, we find more simplistic regularized logistic regression models to perform similarly well compared to more complex machine learning models. The results are robust when applied to the STOXX Europe 600 as alternative asset universe.

This ML module, that may be released already in the middle of March, would that require the use of API credits, or cost anything extra ?

It will not use API credits. That’s for traffic to/from our network, AI is all internal for now. There will be additional costs which will be initially be somewhat high and gradually decrease. The costs breakdown is (tentatively): 1) a fixed monthly cost 2) a variable cost based on resource utilization 3) a storage cost for datasets.

Which means that to research for a few months should not be that costly. And just doing predictions for live models, or model maintenance, will be affordable. As always, our mission is to make it affordable for retail users as well as non-retail. But our AI backend is not very big right now (we have our own servers) so we need to limit the usage at first somehow. If demand materializes, we may also leverage the cloud for large jobs like CV of hundreds of models, or just buy more hardware from the likes of SuperMicro (how did I miss this stock !?)

1 Like