Seminal paper cementing AI to the top spot for developing quantitative strategies

Marco,

Recently I have been using a 203 x 16 array mostly of technical data to specifically answer your question. This was with Sklearn’s boosting program which is different than XGBoost.

But XGBoost is specifically designed to be fast and to use parallel processors which takes advantage of special programing for boosting.

I have done over a millions rows (using the returns of all the stocks in the universe each week) on a MacBook pro with no GPU. I never really timed it but I think it could take 20 minutes perhaps.

Random forests are particularly good at using parallel processors. How fast it will run depends solely on how many parallel processors you can bring to the task. I have 4 threads available and it is actually pretty fast.

Neural nets are sped up through matrix multiplication which is ideal for GPUs but that too is doable on an old MacBook pro (with no GPU)…

I wish I had a better answer (and better computer).

Jim

Where I can find information about P123’s beta AI feature? Using the P123 API?

Gs3

Marco,

So, I get that one could read the paper and think one should move to boosting and neural-nets. I used to be that way myself. Also, I would recommend that you continue to incorporate those methods to the extent that your hardware (computing power) allows.

But please take a moment to look at the advantages of partial least squares (PLS) with me. I think you will find that it is worth your time.

  1. when you step back and look (maybe squint a little bit :thinking:) at PLS and what Yuval does you will find them pretty similar. I argue that Yuval is essentially doing a manual version of PLS. Of course, it is possible that Yuval self-identifies as being a discretionary trader or whatever. But I do mean this as the highest-of-complements with regard to his final automated quantitative-investing product(s) and natural core understanding of math and statistics that led him to his method indepentently.

  2. PLS DOES BETTER THAN ANY OTHER METHOD for small-cap value with regard to out-of-sample predictive R^2. There are a lot of people investing in Small-cap value at P123. Small value at the bottom of this table from the paper:

  1. I think you will find PLS is the method with the least computational requirements. P123 is known for its backtest speeds and those speeds may not need to be any slower or have much greater computational requirements. THIS IS FOR A COMPLETE STATE-OF-THE-ART AI IMPLEMENTATION WITH NO COMPROMISES. Why:

a) linear models are less prone to overfitting. Some of that train, validate and test becomes less necessary.

b) linear models are very quick—not needing GPUs etc

c) because of (a) and (b) you could probably go to a train/test method (skipping validation) in a true walk-forward backtest. And it will be fast or at least fast enough to implement it.

d) And P123 could continue to implement this with an online algorithm for the live port. Meaning the machine learning method is retrained before each rebalance.

YOU MAY NOT KNOW IT YET BUT THE COMBINATION OF (c) and (d) IS HUGE!!!

And it bears repeating: PLS was the best method for small-cap value in the paper you referenced.

TL;DR. If you explore this further you may find PLS is the method that provides the best returns for small-cap value stocks, and that it is less computationally intensive–allowing for the best implementation (walk-forward backtesting with online learning). ANYTHING THAT I FORGOT TO CONSIDER?

Jim

I’m confirming that we will have both PLS and OLS with penalization. Not sure about automated brute force tuning initially; you will need to create permutations by hand. The k-fold is very interesting to “multiply” whatever data one has.

Not sure what you are saying re. block forward. In your example, do you end up with three trained models? And then are you saying that a 3 year Portfolio backtest uses a different model each year?

Another major goal of this “AI” project is that it’s usable by non data scientists. Some features will be part of the AI factor subsystem, others like the block forward simulation, can be constructed (if I understood correctly)

PS: perhaps we should rename the project “ML Factor”? Can’t really call PLS “AI”.

Marco,

Thank you. Interesting how caught up in definitions we are. I like to bypass all of that and say everything we do with a remote server in Chicago falls under the umbrella of reinforcement learning (perhaps some of the learning is done manually and by a human). But a serious question since you bought it up: what parts of P123’s project are AI (as opposed to machine learning)? Not PLS I guess, but boosting is AI? Maybe just neural-nets meet the definition of AI? I am okay with any definition. Its just a definition and I want to use anything that works whatever it is called. I promise not to debate this but I truly find it interesting how much people like to parse the definitions and even find it important sometimes (not that you do necessarily).

Definitions aside, very cool that you have PLS and cannot wait to try it!!!

I agree that K-fold is interesting and de Prado’s modification of it with an “embargo” is not really a difficult improvement to make programing-wise. Definitely worth looking into I think.

I think some sort of cross-validation is absolutely necessary. Walk-forward is just one way to do it. K-fold (with or without embargo) is another.

The way the paper does it is a third way. They did say it was important. You called it a “seminal” paper. I think they have a point regarding the need for cross-validation and some of their comments about the advantage of the way they did it might be accurate.

Walk-forward may be worth discussing with your AI ( or machine learning) specialist. You might continue to look at walk-forward even if my explanation is not so good. It is one way to make 20 years of data be more than enough and it has other advantages. Better than say K-fold or the way the paper your referenced did it? Sometimes IMHO.

Most simple example with a fast computer and monthly rebalance. Train 5 years of data and predict the next month. Then train on 5 years and one month of data and predict the following month. Then train 5 years and 2 months of data and predict the next month……until now where you are training on all 20 years of data and predicting the next month that has not happened yet in a live port.

So you are backtesting what you would have done each month with the data you had at the time. It gives 15 years of predictions (pretty good). There is absolutely no possibility of look-ahead bias (unless there is a problem with the data provider in that regard).

Presently at P123 we train on all of the data then run a backtest that includes the data we just used to train the model with–along with future data that never would have been available at the time.

We can do better than that, I hope, Walk-forward is just one way we could do better.

Jim

Well I was quoting our Stanford Professor friend who forwarded it me. But a quick google search shows a bona fide explosion in research using AI in trading strategies. Incredible actually. 916 citations in just two years. Piotroski’s FSCORE has 1545 citations in 20 years.

3 small remarks:

  1. Providing users with modelling methods like this, in addition to rank methods, is incredibly helpful to source alpha. P123 has been late to the game. We should allow models to be used for both forecasting stock returns AND volatility. If P123 is willing, any explanatory variable should be able to be modeled (macro, factors, factor spreads, etc.).

  2. Least squares falls under “AI”. There are many OLS learners available. Most of them are focused on selecting the factors that minimize squared deviations. Neural nets provide more permutations to maximize fit. This maximizes the chances to overfit. LS by design make it harder to overfit. Jim is 100% correct with his remarks re it’s power. Deep learning in fact does just this: combines models for maximum effectiveness. This was the research breakthrough that made AI a household name.

  3. Boosting, bagging etc. are all proven in their effectiveness. But again, these are double edged swords. Great if you are right in your research. God awful if you are wrong. Optimization techniques maximize research errors, by definition.

No idea how you are implementing this Marco et al but be sure to allow users to conduct alot of out of sample testing (training vs testing sampling). Linear, stratified, etc.

Thanks Jim!

Really excellent paper. I’ve read the paper carefully and looked to apply it to some of the p123 solutions as well as my own applied machine learning knowledge and tools. Specifically, I thought about how I could create a random forest model using Dataminer and my excel random forest plugin to predict excess forward returns. The initial results, conducted only over a limited sample period (admittedly unusual) of 2020. My next step will be to test this model over a rolling/walkforward period.
I pulled the simple rank (1-100 scale) of each of the Portfolio123 core ranking systems (combo, value, growth, etc.) for each stock in the Russell 1000, along with trailing 12 month return. I used forward 12 month return as the predictor variable (the value we are trying to predict with the random forest).
Of course, we want to split the data into training and testing segments. To do this, I split the data by market cap, training the model on the bottom 500 stocks in the Russell 2000, and testing the data on the top half (e.g. the S&P 500). I hope this gives a robust out of sample test to conduct, but I am open to any feedback on increasing robustness or reducing biases – am I missing any here?
The random forest then uses its sorting technique to use the factor scores to predict forward returns. Through tuning, I found that a 3-layer tree works best, consistent with the results of the Kelly, Gu, Xiu paper.
In practice, you would then rank the full list of stocks from high to low by expected return, and form a portfolio of the top 10, 20%.
Backtesting this through 2020 gives a total return of 124%. All in all, more fantastic data from P123 to explore, and the initial results have been positive. Feedback is welcome and appreciated in advance.


RFchart2
RFchart3
Rfchart4

1 Like

Chambers,

Nice!

I would only note your use or planned-use of a “rolling/walkforward” method. As I noted above, I think maybe P123 might be able to implement this for some machine learning methods.

That would be impressive to say the least. But whether P123 can implement this or not: very nice!!!

As you know, walkforward is the ultimate test of what would have happened had you been using this strategy in the past with no lookahead bias whatsoever.

For those who like to use the full train, validate and test method one can train and validate up until a certain date to find the best model. After selecting the best model with the validation subsample one can test the final holdout sample. A true out-of-sample test of what you will be doing going forward.

I think P123 has enough data to do that—especially if one uses a relatively short rolling window is willing to start the training with a short window.

Like you, I think walkforward is a good way to go. And I also find a rolling window often works best.

Jim

Just a quick update here as I continue the research. Unfortunately the strong results i saw initially were at least partially due to lookahead bias. This is because i originally set up the model to make predictions on Jan 1, 2020, and trained the model on the full year returns, then used those forward returns to forecast the rest of the sample for 2020. So I would not have had access to that information if I was using this model in real time.

I went ahead and re-trained the model on the full 2020 data set, and used it to forecast 2021 returns. Unfortunately at this point the model broke down and OOS R^2 was insignificant. However, this only used one set of data points (as of Jan 2020) - so i think by building a rolling/walkforward test that incorporates more data points for each stock, it will increase the robustness.

All in all, the first test i think shows that returns are based significantly on factor exposures, but that forecasting which factors will work best in any particular time period is extremely challenging.

I’ve also noticed in the literature that RF models tend to work better on shorter forecast periods, so I will investigate that, and also that predictive power is better in forecasting positive/negative returns, rather than the absolute value, so i will also look into that.

Finally, i want to use a random forest model to sort and select factors for inclusion in a ranking system. This will take a bit more time for me to work through since it requires structuring the data in a different way.

Chambers,

You probably saw this but the paper limits the splits for random forests more than you are used to, I suspect.

I have not done a lot with random frorests on P123 stock data recently but when I did I used something similar to the authors in this regard, Specifically, I had a pretty large minimum leaf size (around 500 or 600). Obviously, I determined the leaf size “hyperparameter” using cross-validation.

I suspect this sounds crazy to you as generally we do not have minimum leaf size for random forest, in part because they are supposed to be so good at preventing overfitting. Ultimately we tend to use one or maybe five as a minimum leaf size. Five is the default for skelarn, I think.

I developed the concerpt of “resolution.” By that I mean what kind of detail (or resolution) do you want to resolve in your data or are even capable of resolving with your algorithm?

If you have weekly data and and are just hoping you can find the top 5 stocks each week that turns out to be a “resolution” of 5 x 52 x 20 or 5,200 stocks!!! That is all the resolution you need to become a bazillionaire. And in practice your resolution will not be even that good. That is all of the detail you need to see and you probably are not capable of "resolving " any further detail without a quantum computer or something.

Sure it would be great if your program was so omnipresent it could predict what stock would be the best one in the next 20 years but that is not necessary. Nor is possible in the real world.

TL;DR. The authors may have a point about limiting the splits or leaf size with a random forest or with boosting. At least put a few “crazy high” leaf sizes (or small split numbers) into your GridSearch to start with.

Jim

Jrinne, thanks for your thoughtful response. By limiting the split size, are you referring to the number of trees the model trains, or the depth of each tree? My model has up to 8 layers/tree depth, and runs 100 trees in each model.
I do see your point re resolution- my current configuration has it picking the top 50 stocks monthly from s&p500 universe. However, this seems to line up with how the paper is structured (ie it makes many predictions)
I suppose I was a bit surprised that the walkforward did not produce better results, at least something in the range of what the paper put out. The divergence in performance could be due to 1) the long forecast period / predictor variable (12m instead of 1), and 2) fewer parameters for the model to train.
On point 2, current the model uses the 7 factor model (including combo ranking) , trailing returns (12m 6m 3m and 1m) market cap and various forms of momentum volatility and liquidity factors - about 25 in all (vs 100+ in the paper). That said , the momentum and vol/liquidity factors perform best in the paper, so I was expecting better results based on their inclusion in my model.

Chambers,

I am not doing a lot of what you are doing at this time so just some general thoughts.

First, I have looked at random forests, logistic regression, support vector machines, boosting, Naive Bayes and other things for technical data on ETFs fairly recently. Keep in mind that paper used a lot of technical variables so maybe my findings are pertinent. In any case a 2 year window seemed optimal for that. Using all of the historical data (without a window) tended to give the results I already knew (and did not need machine learning to figure it out). An example of what I mean by that is XLY might give better returns long-term than XLU—just look at the equity curve for the last 20 years. Machine learning was able to tell me that when all of the historical data was used (yay machines). A 2 year window could, sometimes, tell me when XLU would outperform (a sincere yay for machines).

BTW, this was a classification model but logistic regression and support vector classifiers (linear and radial bias kernels) were the best performers for my look at ETF classifications. Classified by whether the ETF beat the median return of the ETFs I was looking at that month.

K-fold gave results that just did not work when doing walk-forward (suggesting a problem of look-ahead bias or data leakage with K-fold). So stick with walk-forward if you can. Maybe de Prado’s “embargo” method would make K-fold work better but I have not done that yet. Embargo can be done with not that much Python programming.

I mean “depth of the trees” was limited in the paper. Limited to 6 I think and the author made some points about that. I have not looked at depth as a hyper parameter but setting a minimum leaf size has a similar effect as you know. I spent a lot of time optimizing minimum leaf size when looking at stocks. For ETFs I usually do not set a minimum leaf size (I usually set it at 1).

Maybe it makes sense that ETFs would be different. ETFs already have a lot of stocks in them (averaging the results and reducing the influence of outliers) and some of the market noise is reduced by the fact that the holdings have some similarities (eg utility stocks in XLU might not be influenced by some headline news or at least be influenced in a similar manner).

THE MORE TREES THE BETTER as far as having the best model with the sacrifice of more computer time and/or need for parallel processors. The improvement does seem to level off at 500 trees for what I have done recently. 200 may be adequate and possibly better than 100. You will want to test this and avoid long run times by setting it as low as possible for what you are looking at. It will probably be different for you.

I remain skeptical about beating the institutions (including the ones that may ultimately hire this author) for the SP 500. But regardless, wouldn’t it be surprising if you found something that produced a lot of alpha for the SP 500 on your first run? Do not be discouraged by your first attempts.

I hope that helps a little.

Best,

Jim

These machine learning papers are definitely not the most easy ones to understand. I’m all good with extensions to linear models (OLS) and dimension reduction techniques like PCA. Also parts of generalized linear models as described in 1.5 of the paper make sense to me.

But when they start talking about fixing splines with knots and lasso’s I’m starting to feel a bit like a cowboy lost in the woods (to be precise: lost between regression trees within random forests, so I understand).

What parts of the techniques described in the paper will P123 touch upon with the implementation of their machine learning tools?

I may be missing one and I could not find the link but I am sure Riccardo said regression, support vector machines (SVM), random forests, XGBoost and neural-nets which is a great start in my book! I am sure P123 will respond if I made an error.

LASSO regression is a linear regression method that uses regularization to remove noise variables and prevent overfitting. It is a method mentioned by @pitmaster–who has at least a master’s degree in machine learning, has a PhD and does machine learning at a bank (he has a link to his linked in page in his bio). He mentions LASSO regression for good reasons, I believe.

LASSO regression does indeed prevent overfitting. It runs extremely fast without being resource-intensive and has only one hyper parameter. Personally I think it would a nice addition to regression for P123 at some point.

Hi Victor,

I actually used regression splines, I’m not expert but I understand the concept. It is quite useful for testing non-linearity between a factor and response (and risky due to overfitting) .

In short, you want to use regression splines if you assume that there is non-linear (non-monotonic) relationship between a factor value, e.g., FRank('SalesGr%PYQ) and future return.

Then you can divide a factor values into one or more regions e.g., FRank output: [0-50, 50-80, 80-100] and fit a function for each region. The famous knots are simply a region boundaries [50 and 80].

Then you would possibly run linear spline regression to estimate three coefficients for each region. From the practical perspective this involve sorting dataframe by a factor and run more or less sophisticated regression.

Some maybe familiar with transformation: abs(70 - FRank('SalesGr%PYQ)). This is simplification of linear spline with one knot. (credits to Yuval)

The image below show cubic and linear spline regressions (knot is at age 50).


source: Trevor Hastie - An Introduction to Statistical Learning

Thank you, Jim and pitmaster.

I’m still in the process of learning and fully digesting all the new terms that are coming at us. Besides your post and that of pitmaster, I found this older post from Marco that describes (most of the) tools that will be available:

Source: Machine learning, Portfolio123, historical data and ratios - #5 by marco.

I will focus my understanding on these subjects by reading the papers that have been shared by Marco and some posts that I have seen by amonst others you and pitmaster.

Edit: I really like your example of a spline with a knot of abs(70 - FRank('SalesGr%PYQ)), pitmaster. It helps connecting some of the theoretical concepts to what is already being done by some users within P123. Slowly connecting the dots!

Thanks again.

@Marco,

That is a lot of methods. Impressive and exciting.

I would be interested in what @pitmaster and others have to say.

But I am absolutely sure I am not the only one who will want to consider “model averaging”. At the extreme running all of those models and averaging or even stacking the results (probably sorted predicted returns to make a rank).

“Stacking” would be a little hard but model averaging would be trivial. We could probably do it without P123’s help by making each model a node and having the ranking system average the ranks. Probably, it could be done without it being an explicit feature although it could, potentially, be marketed.

But with or out without P123 needing to address it, looking at model averaging (or even stacking) would be an obvious thing to do with such a rich supply of models.

Addendum: So a think a TL;DR is pretty good and leave it to members to go to ChatGPT it they want more information about stacking. Probably for Jupyter Notebooks at home to start.

TL:DR: When stacking, the predictions of individual models (e.g., regression, random forest, XGBoost….etc) are used as features for the final model.

Jim aka “The grocer” because I stack and bag (Bootstrap Aggregate) so much.

I would be happy to see these models in initial stages:

  • Linear Regression, Ridge, Lasso, Elastic Net
  • Decision Trees and Random Forrest

But models are least important for me, more important is framework. I would start with simple/standard framework and extend it based on user feedback.

The exciting feature for me is saving model predictions as a custom formula. Then you could do averaging by yourself in a ranking system or even using optimiser.

2 Likes

Thank you Pitmaster. I totally agree.

FWIW here is s k-fold cross-validation (witth a 3 year embargo using the easy to trade universe)

Screenshot 2024-02-16 at 6.58.55 AM

Addendum: So here is what someone new to this ACTUALLY NEEDS TO KNOW: It runs fast even on a laptop and only has 2 hyper parameters to adjust. Which could be done with a grid-seach and could take 15 minutes to look at every reasonable hyper parameter. Then you are done. Full stop. Do something else. Machine learning can make things easier.

Disclaimer. Just one model to look at with cross-validation. Maybe something else does better than elastic net for you. I would only suggest that Pitmaster has a good idea when suggesting the idea of possibly adding an elastic net regression to a cross-validation study.

Jim