ML workflow

So you could upload these rankings as a custom stock factor, and then seamlessly integrate into a strategy by creating a ranking system consisting of a single node of that custom factor.

Current row limitations for custom stock factors are relatively tight so you may want to PoC that to make sure you don’t hit a wall.

If you do, you could import those ranking CSVs into some other platform for back testing and/or live trading.

Thanks – I suppose you wouldn’t need to upload every stock’s ranking. Just the top N you want to for your portfolio. For 10 years of quarterly ranks, with a “top 50” portfolio strategy, that would only require 10 * 4 * 50 = 2000 rows to upload.

deleted by author (meant for a different thread)

I have two suggestions you may want to be mindful of.

  1. There’s currently no provision for automatically NaNing out an imported factor if it hasn’t been updated within a certain time period. So, you’ll have to update ranks for the previous “top 50” at the next update which will at most double your row estimate depending on overlap.

  2. Think about whether you really want to just have quarterly rank updates. Even if your factors are factors from quarterly statements, the updates to those factors will update throughout the quarter depending on when each company announces earnings and how long it takes for FactSet to process that update. You may also have some value factors that compare something from the income/cashflow statements or balance sheet to some measure of price. Those factors will update more frequently than just quarterly. In other words, even if you only plan to re-estimate model weights each quarter, you may want to update your rank forecasts more frequently, depending on the update frequency of your features and how often you plan to make trading decisions.

Thanks for the replies. I think I’m sufficiently informed to put my CC in and give this a shot! I’ll try to post a status report here later.

Great point. If there’s not automatic zero-ing out, is there a manual one?

EDIT: The documentation says there is an option to ‘delete all existing data’ before uploading an up-dated series. I assume that takes care of this issue: Imported Data Series & Stock Factors - Help Center

Regarding what frequency is best for AI…

Here’s an excerpt from a recent AI paper mentioned in this post Stock picking with machine learning

Almost all of the abovementioned studies use ML models for monthly predictions based on monthly data. In contrast, we analyze shorter term predictability and focus on weekly data and weekly predictions. Analyzing weekly predictions provides two major advantages: First, the larger number of predictions and trades in an associated trading strategy provides higher statistical evidence due to the larger sample size. Second, ML modes require large training sets. Therefore, studies analyzing monthly predictions require very long training sets of at least 10 years. Given the dynamics of financial markets and the changing correlations in financial data over time, it could be suboptimal to train ML models on very old data, which is not any more relevant for today’s financial world due to changing market conditions. Because our study builds on weekly data, we are able to reduce the length of the training set to only 3 years while still having enough observations for training complex ML models.


Weekly update makes sense, especially if you’re including technical indicators like the authors. Really cool paper

1 Like

I think the paper is probably correct on this. Noting the referenced study is about classifications models and that it can probably be generalized to regression models. We generally use regression models at P123 and first iteration of AI/ML will be using regression models exclusively if I understand correctly. BTW, I think it was a wise choice to start with regression models and I am not sure that classification models should be a priory… I am not questioning the decision to do that as I think it was a good decision.

I would add to this paper the evidence from classic P123 users. P123 users commonly rebalance weekly. And P123 is kind-of-like a non-parametric (i.e., uses ranks which are ordinal) multivariate regression with manual optimization. Manual optimization rather than using something like gradient descent (internally at P123 anyway). If P123 classic used automated optimization it would clearly be be a machine learning model.

Actually, not “kind-of.” It is, in fact, a non-parametric regression model that is generally optimized manually by members. But a P123 model can be optimized with machine learning. For example, one cloud determine the weights in a ranking system using a regression model. Correct me if this is not factually correct.

But whatever people want to call what they do with classic P123, whether they self-identify as using statistical learning, fundamental analysis or whatever, P123 members’ manual optimizations are not likely to be any different than those resulting from gradient descent optimizations that we will be using with machine learning at P123.

And P123 is moving toward improved automation of P123 classic—considering allowing members to do some of the optimizations in parallel for example. I am not sure at which point everyone agrees that it is at least a little-bit-like machine learning or when P123 finally automates the entire process making it machine learning by definition.

This is just to say P123 classic has been (and is) valuable. Personally, I don’t mind having some of that automated with grading descent algorithms. I am only making the point that I believe much of what has been found to be true with P123 classic will continue to be true with automation.

TL;DR: Most P123 classic members have been using weekly rebalance for a while now and I don’t expect optimization with fully automated machine learning (i.e., gradient descent) to change that in any way. It would be surprising if it did.



I agree with @benhorvath that for weekly predictions some considerations needs to be made in terms of selected features and in addition in terms of a label.

One important property of the dataset processed by the ML algorithm should be the consistency of persistence between features and labels. Intuitively, the autocorrelation between the label y (future return) and the features X should not be too distant. If you sample features (accounting-based) and labels at weekly basis, the label is very weakly autocorrelated, while the features are often highly autocorrelated.

An extreme example is a model with one feature: SalesTTM / SalesPTM and you label is 1-week ahead return, you sample data weekly during 6 months. This means that your features change only once during 6-months training period, while your labels will be changing every week. Autocorrelation in feature space will be close to 1, while autocorrelation of label is usually close to 0. In this case your model captures will a lot of noise.

There are some solution to this problem but is probably topic for other thread.


This seems very important since most datasets will be a mixed bag of short term and long term factors, with target maybe somewhere in between.

What are some of the solutions?



@Pitmaster, Thank you. I had not thought of feature autocorrelation until your post and I don’t know much about the problems of feature autocorrelation. I do know an embargo can be used for autocorrelation when doing k-fold validation but I had not considered whether it helps with both feature and target autocorrelations.

When you answer Marco you might keep in mind that he has mentioned k-fold validation in the past and discuss embargoes with him if you think that can be a partial solution for feature autocorrelation from k-fold cross-validation—especially if P123 will be using it. P123’s AI/ML is still a bit of a black-box and I don’t know if it will include k-fold cross-validation. My question may not be pertinent but I think P123 would benefit from discussing this with you. An embargo may not be that difficult of a programming challenge for P123 if you think it would be helpful and P123 is considering using k-fold cross-validation is the only reason I mention it.

BTW, I assume something like EBITDA/EV also has autocorrelation even if the price can change this metric from day-to-day. I assume you were giving an extreme example above.


1 Like

Am I missing something with my logic: These feature rightly are not predictive of weekly returns for the reason you mention; the label changes but the feature doesn’t. There is nothing to solve for this as it’s not a problem - it’s a discovery of it’s usefulness for the task (i.e. predicting one week returns).

I think the most widely accepted, in academia and practice, transposition for accounting metrics is to normalize these features with price: (SalesTTM / SalesPTM) / close(0) = SalesTTM / (SalesPTM * close(0)) ?

Possible solutions are as follows:

  • increase label return period - for example consider annual or biennial future returns when working with 4-weeks data or use - this should increase the autocorrelation of the label.
  • use differences in factors levels, for example instead of SalesTTM/EV, use SalesTTM/EV - FHist(SalesTTM/EV, 1) - this should decrease autocorrelation of the features.

Personally I never use weekly returns as labels. They are too noisy especially for small stocks. It is also worth considering a path-dependent labelling techniques, like the triple-barrier or trend-scanning developed by Lopez de Prado.


Isn’t it better to do an ensemble ? Train three AI factors for short, medium and long term features. Target could be the same or different I suppose. Then combine for a single prediction.

For the medium and long term factor you can do weekly, but just include in the dataset of stocks that have changes (i.e. have reported).

Let’s summ it up:

  • to reduce noise in the label we need longer time periods
  • now as we have low noise in the label we need longer periods in the feature
  • as we have longer features periods (which improves the signal/noise as well) we need longer testing periods.
  • as I have now longer training periods, like 10year plus, we will tend to have less flexible weights
  • as the weights are less flexible, I can go back to classic portfolio123 with fixed weights

Sorry, I’m a bit exaggerating :wink:

Ok, basically we need to increase signal to nose ratio.

What options do we have?

  • excess return (cross sectional measure)
  • daily ranking (as well cross sectional measure)
  • are there are some other cross sectional measures?
  • filters, like Kalman, did somebody tried it?

If one goes into the time series direction on might pick up some noise, as market conditions might change.
I would prefer to be able to follow changes in market conditions on a 3 month base. This should fit as well with changes in sector performance.
Or running only with price information, way more samples and shorter than changes in market conditions.

This stuff is not easy…

My most frustrating experience was the first time I did an autocorrelation on a price series…nothing, zero - then I understood what random movements mean’s

The problem of autocorrelation is a subproblem of the larger issue of forecasting/predicting when samples are not independent. Most of the ‘easy’ or ‘classical’ algorithms assume that independence.

It’s certainly possible to get around this issue – though I don’t think ensembling by itself would get you there. For classical forecasting like ARIMA, differencing is the usual solution. For cross-sectional regression, you’d look into adding a random effect. If you were trying to use XGBoost, you may have to investigate other solutions as well.


I think this is a serious concern that could extend to XGBoost but XGBoost would not be the worst model. First, it performs feature selection during the training which mitigates the problem to some extent by focusing on features that have the most predictive power.

Also regularization helps mitigate the problem. XGBoost has both L1 (LASSO) and L2 (Ridge) regression. But I don’t have enough experience to say there is not still a problem or that this solves everything.

@Marco there is concern about independence in the training process and not just in the predictions.

At the end of the day, cross-validation can provide an individual answer for members. Individual answers with regard to their factor choices and the models they use.

WITH MONOTONIC CONSTRAINTS and cross-sectional data Sklearn’s boosting model does fine (k-fold cross validation with an embargo easy to trade universe). My cross-validation would suggest boosting could be used with some cross-sectional features. No regularization used here (other than early stopping).

And again, I am not sure an embargo solves all of the problem of autocorrelation but it would be worse without it, I think (embargoed k-fold cross-validation):

Screenshot 2024-03-01 at 4.36.19 PM


1 Like

I would be very hesitant to use XGBoost or any other model on time-based data without some allowance made for violation of independence of samples. As you suggest, cross-validation will be key here. And probably a CV strategy more robust than simply holding out the same 20% of data and testing against that same data 100 times.

I suspect you’d find that your in-sample metrics look great, and the out-of-sample looks awful.

1 Like

Kindly Whycliffes posts some good selection of papers - thanks!

In a lot of them on can find that Random Forest comes out on top.

Is there some experience here that other techniques perform better (boosting, etc.)