Ranking vs machine-learning algorithms

By the way this is what I understood from the substack post from Andreas:

Setting Large & Mid-Cap Stocks Small-Cap Stocks
Architecture ZScore-based only:
1. ZScore + Date (Global) with Skip and Date at the feature level.
2. ZScore + Dataset (Global) with Skip and Dataset at the feature level (considered best for long-term robustness).
A combination of both:
1. ZScore + Date (Global) with Skip and Date at the feature level.
2. Rank + Date (Global) with Skip and Date at the feature level.
Rationale: Builds relative, dynamic rules that adapt to rapidly changing market structures. "Rank + Date" is too static and loses its robustness in these efficient markets. Rationale: The two systems are complementary; they select different stocks, increasing the portfolio's capacity and diversification. "Rank + Date" works here because inefficiencies are more stable and can be captured with absolute rules.
Target Variable 9- to 12-Month Total Return / Relative Return.
(Note: 6 months might be fine for Mid-Caps, but was not tested by the author).
3-Month Total Return / Relative Return.
Rationale: Longer horizons smooth out short-term noise and align with how institutional capital rotates in more efficient markets. Rationale: Signals decay faster in noisy small-cap markets. Shorter horizons are ideal for capturing strong, short-term mispricings.
ML Algorithm ExtraTrees LightGBM + ExtraTrees (used to provide complementary signals).
Rationale: Stable, low-variance, and performs well with the "cleaner" data typical of large-cap stocks. Rationale: They serve different purposes: LightGBM is an "alpha extractor" (best for concentrated portfolios), while ExtraTrees is a "rank stabilizer" (best for broader portfolios of 30-100+ stocks).
Number of Features Sweet Spot: 87–180 Sweet Spot: 87–180
Rationale: A carefully selected set of features is crucial. The author emphasizes building on top of well-tested factors rather than reinventing the wheel. Rationale: A broad, curated set of features is necessary to capture the various drivers in the more complex and noisy small-cap universe.
Outlier Limit Always 5 Always 5 (for the ZScore architecture).
Rationale: The author emphasizes that the default value of 2.5 is too low and will remove too much valuable information. Rationale: Same as for Large Cap; the default value of 2.5 is too low.
Retraining Less sensitive to retraining. Rank + Date requires regular retraining (every 12–18 months).
Rationale: The ZScore approach is relative and has a "longer shelf life" across market regimes. Rationale: The absolute, static rules in "Rank + Date" must be refreshed to remain relevant. The ZScore system is more robust over time.
4 Likes

Yuval,

After thinking about this and posting some answers that were probably in the weeds I have a short answer to your question.

For context, some of the things i used to do to weight features in P123 classic’s ranks did not really need my domain knowledge or any knowledge at all really.

This would include radomizing weights in a spreadsheet and putting those weights into the optimizer. I am not sure how people are doing that now days but this used to be a popular method with P123.

Personally, I am happy to let Python do some or all of that. And sometimes, at least, I carry that to the point that I have turned P123 classic into a machine learning program. For me P123 ‘s classic is machine learning. Particularly when I use a Python program to optimize the weights.

For me the line between P123 classic and machine learning became pretty blurry about 10 years ago when InspectorSector uploaded a spreadsheet that would do some of the randomization of feature weights. His algorithm then copied the features and pasted them into the optimizer. The rank performance test then essentially became a selection function in an evolutionary algorithm. Much of this was done manually at the time. I guess it still is by some.

Maybe not yet machine learning at that point-in-time due to the the need for manual operation and the need for the user to look at the rank weights that performed best and re-enter them into InspectorSector’s spreadsheet. But I was happy to turn that over to Python and let Python do that while I slept.

When I do it that way it is machine learning by any definition.

But I am still using P123 classic in the process. I would never criticize P123 classic or doing some of the same things I do with Python by hand either.

@Pitmaster further automated something similar to InspectorSector’s method here: Genetic algorithm to replace manual optimizaton in P123 classic - #41 by pitmaster . And the original post in that thread was about a Genetic algorithm that also fully automates the selection of weights in the ranking system.

Marco has proposed an entirely different way to automate weighting of the features in ranking system here: AI Factor - Designer Model? - #15 by marco

This is now becoming common-place and common knowledge among many members including Marco: P123 can be made into a machine learning model. The origins of this extending back a decade or so to when InspectorSector shared his spreadsheet.

So to directly answer you invitation to give arguments for machine learning I have 3:

  • it works
  • it saves me time
  • I can’t see any degradation in performance compared to when i used a spreadsheet or manual trial and error to optimize a ranking system nearly a decade ago.

I think a lot of people are doing something like that and just like to characterize it according to their affinity for the term machine learning. I don’t care what people call that as long as I have the tools to optimize the ranks weights using Python myself.

And that can be done in a nearly endless number of ways with or without manual steps in the pipeline. I have a program–with a method that new to me, but part of a Python library–doing that for me now, in fact.

1 Like

Thanks for your thoughts, they're well expressed, and I have no argument with them. For the purposes of this discussion I was curious about the advantages that the Machine Learning models that P123 introduced about a year ago might have over ranking systems in terms of a) the way they choose stocks to buy and sell; and b) how well they can fit into my workflow. I wanted to present the advantages ranking systems had and was hoping to read some advantages that the ML models had as a counterpoint. It wasn't meant to be a discussion about the merits and demerits of machine learning in general, and I'm sorry if it turned into one.

1 Like

Thanks Yuval and to be completely honest (my interpretation alone) I see many more similarities than differences in our methods.

Admittedly, I have a tendency to see the same math in different methods and I think the math is pretty universal no matter the details of the methods.

I am such a math geek!

Thank you for getting me to think about this and finally articulate it in a samewhat understandable way. And for the great ideas you have presented in the forum over the years. I use a lot of year features in my models, for example.

And perhaps there is still a question to be asked: do we need tabular neural-nets or should we focus more on expanding P123 classic? I think that may be part of your question. If it is I don’t have a clear answer to it.

Thank you Yuval.

About five months ago I created a ranking system for buying puts and at the same time I created a machine-learning model for buying puts. I worked hard on both and instituted them as live strategies so I could see what stocks they chose and how they performed.

Now that so much time has elapsed, I can give my out-of-sample performance. According to my calculations, puts on the stocks the ranking system chose have an average gain of 26% and a median gain of 3%, while puts on the stocks the AI system chose have an average loss of 8% and a median loss of 30%. If I had been using these systems to go short, I would have gained 1% on the ranking system and lost 20% on the AI system.

I realize this is entirely anecdotal and not necessarily representative of ranking and AI systems. But I did want to share my experience, which I hope is not similar to yours.

Does anyone else have out-of-sample comparisons between ranking systems and machine-learning systems as live strategies?

4 Likes

Yuval,

To be honest, I am still waiting for some strong out-of-sample evidence from others who have taken the lead with the AI factor before diving in myself.

I am following the judge on X and he seems to be having a lot of fun with AI factor and posted a lot of favourable out-of-sample and pseudo out-of-sample performance on X based on his work. There are some others who also seems to have positive results from their postings on X.

I hope he see this message and could provide us with some info and answer your question.

Regards

James

1 Like

Hi Yuval, I have 2 similar strategies (same universes, same goal, not exactly same factors to avoid data leakage but to have and idea is enough) for now AI is winning but with very close results, and some time some overlap of names, but not so much. Anyway it s to early to draw conclusions in my case just one/two months in real out of sample

1 Like

@YuvalTaylor, what universe do you use for your short strategies? I would like to try to make an AI short strategy on a tradable universe. Shorting is not really an option for me through my broker, so I have never paid much attention to it.

I deployed two live (real money) AI strategies a little bit more than a year ago, one non-linear and one linear. The non-linear one is rocking, almost 100% up now. The linear model is doing slightly worse than my average classic strategies (still really good). All my paper trading AI strategies are doing okay, on average similar to my classic strategies. However, I have a very high conviction in my AI strategies now, and put a bunch of them live over the past months.

3 Likes

This level of performance is very impressive for stocks. This also confirms the findings from an academic paper “Not all that Glitter is gold” published more than 10 years ago. Here are the paragraphs that mentioned this :slight_smile:

As we will show below, linear methods did not show high predictability of individual backtest performance measures on OOS profitability. We next asked if non-linear regression methods trained on the full feature set could do better at predict OOS Sharpe ratio. Towards this goal, we explored a number of machine learning techniques, including Random Forest and Gradient Boosting, to predict OOS Sharpe. To avoid overfitting, all experiments used a 5-fold cross-validation during hyperparameter optimization and a 20% hold-out set to evaluate performance. In addition to training our own classifiers, we also utilized the DataRobot platform (https://www.datarobot.com) to test a large number of preprocessing, imputation and classifier combinations.

While the classical machine learning algorithm evaluation metrics described above suggest predictive significance for our non-linear classifier, the practical value of our machine learning methodology can only be evaluated by testing its profitability as a portfolio selection instrument. Towards this goal, we formed an equal-weighted portfolio out of 10 strategies with the highest Sharpe ratios as predicted by the Random Forest regressor on the hold-out set and computed their cumulative return (figure 6a) and the resulting Sharpe ratio. In addition, we compare this to 1000 random portfolios (Burns [2006]) of hold-out strategies and a portfolio formed by selecting strategies with the 10 highest IS Sharpe ratios. We find that our ranking by predicted Sharpe portfolio performs better than 99% of randomly selected portfolios with a Sharpe ratio of 1.8 compared to the IS Sharpe ratio selection which proved better than 92.16% of random portfolios with a Sharpe ratio of 0.7. Given the above result of weak predictability of IS Sharpe ratio it is surprising that it still performs reasonably well in this setting, albeit not at a statistically significant threshold compared to the random portfolios. These results do however show significant practical value for non-linear classification techniques compared to more traditional, univariate selection mechanisms when constructing portfolios of trading algorithms.

While the results described above are relevant by themselves, overall, predictability of OOS performance was low (R² < 0.025) suggesting that it is simply not possible to forecast profitability of a trading strategy based on its backtest data. However, we show that machine learning together with careful feature engineering can predict OOS performance far better than any of the individual measures alone. Using these predictions to construct a portfolio of strategies resulted in competitive cumulative OOS returns with a Sharpe ratio of 1.2 that is better than most portfolios constructed by randomly selecting strategies. While it is difficult to extract an intuition about how the Random Forest is deriving predictions, we have provided some indication of which features it deems important. It is interesting to note that among the most important features are those that quantify higher-order moments including skew and tail-behavior of returns (tail-ratio and kurtosis). Together, these results suggest that predictive information can indeed be extracted from a backtest, just not in a linear and univariate way.

Finally, we show that by training non-linear machine learning classifiers on a variety of features that describe backtest behavior, out-of-sample performance can be predicted at a much higher accuracy (R² = 0.17) on hold-out data compared to using linear, univariate features. A portfolio
constructed on predictions on hold-out data performed significantly better out-of-sample than one constructed from algorithms with the highest backtest Sharpe ratios.

I haven't implemented a short strategy yet so I really don't know what stocks are going to be available. From what I've seen so far, I'm using the following universe rule for backtesting: MedianDailyTot(126) > 20000000 or Inst%Own > 65 or MktCap > 2500. I'll apply that to the North Atlantic universe since my broker allows me to short certain European and Canadian stocks. But that's a very conservative universe, only for backtesting. When I'm actually ready to place orders I'm planning to use a much wider universe, probably MedianDailyTot > 5000000, and just see case by case what stocks have availability for borrowing. If anyone has experience with shorts and sees a flaw in this approach, I'd love to be corrected.

1 Like

I have even seen penny stocks and small Chinese companies be shortable so most decent volume stocks should be shortable at least on IB. Hit or miss with each different broker though. I will say there are some strategies out there purposely blowing up short positions 100%+ these days on crowded shorts which complicates things in the short term for low liquidity issuers