Two approaches, what would you choose?

Whycliffes · January 31, 2026, 4:47pm

Hi everyone,

I’m currently using a Python script (using LightGBM and CMA-ES) to optimize factor weights for a ranking system. I have a total simulation budget of roughly 3,000 simulations available for each test.

I am splitting my data into 3 distinct sub-universes (), and I am debating between two different architectural approaches. I would love to hear any thoughts on which approach you think would yield better Out-of-Sample (OOS) performance.

Here are the two methods I am considering:

Method A: "Specialized Ensembles" (Independent Optimization)

Process: I run 3 separate optimization jobs—one for each sub-universe.
Budget: Each job gets 1,000 simulations to find the best fit for that specific universe.
Result: I end up with 3 distinct ranking systems, each highly tuned to its specific sub-universe.
Execution: I then combine these into a final strategy (e.g., an ensemble approach where I average the ranks).
Hypothesis: This maximizes the "fit" for each specific sub-universe.

Method B: "Joint Robustness" (Combined Optimization)

Process: I run a single optimization job that targets all 3 sub-universes simultaneously.
Budget: The optimizer runs ~1,000 iterations, but in every single iteration, it simulates the current factor weights against all three sub-universes.
Objective Function: The fitness score is calculated as the average performance (CAGR) across the three universes in each simulation run.
Result: I end up with one single ranking system that works "moderately well" across all three universes.
Hypothesis: This acts as a form of cross-validation during training. Factors that work in Sub-universe A but fail in Sub-universe B are penalized and discarded. The backtest stats are lower, but the logic should theoretically be more robust and fundamental.

My Question:
Given a fixed simulation budget, which approach do you think would be more reliable for live trading? A ore B?

Thanks in advance for any insights!

eadains · January 31, 2026, 5:31pm

I think that approach A would give you the best estimate of out-of-sample performance as it closely matches traditional ML cross-validation approaches. Whatever your optimization method is should include some kind of regularization to avoid overfitting. Once you use approach A to validate the performance, you could then use the same training method on the entire universe and that ends up being your final ranking system.

In fact, as you mention, approach B could essentially serve as a regularization technique, so you could nest approach B inside of approach A, although I don’t know how you would achieve this in P123. In this way you would split each of your 3 sub universes into another 3 sub universes, and use approach B on those to optimize the ranking system. This mimics K-Fold cross validation where you split each outer training fold into a further inner set of folds for hyperparameter optimization, etc.

yuvaltaylor · February 1, 2026, 4:06am

Do option A. But add a little extra something. Find the ranking systems with the best median results across all three universes and average those in too, maybe giving those equal weight or a little more with the winners of the three universes. Keep the universes separate, as option A describes, but if you keep track of all the scores, you can find a ranking system or two that works pretty well on all three. And also remember that statistical ties are going to be common with this many iterations. Don't discard a ranking system just because it's close. Lastly, don't overoptimize.

In practice, small differences between ranking systems are going to be irrelevant to the out-of-sample outcome.

Whycliffes · February 1, 2026, 4:37am

Thank you for your replies, eadains and yuvaltaylor.

What would be the simplest way to merge these three different ranking systems?

Create three composite folders and give them 33% weight each? This would likely result in several identical criteria within each of the three composite folders. This is very simple, but could then lead to over 100 (x3) stock criteria in each composite folder, which in turn gives over 300 criteria in total, which often makes the simulation a bit slow.
I have sometimes asked AI (Gemini or Claude) to go through the three systems, normalize the weights in each of them first, then sum together identical criteria, and keep all other criteria as they originally were in ranking system A, B, or C. Then you end up with one system, but I notice that sometimes the AI takes liberties and forgets some criteria, and there are some rounds back and forth before I am sure that it has done the task I asked for.

How would you solve this?

yuvaltaylor · February 1, 2026, 2:25pm

Simply take the node weights and average them. If some have 0 weights in one system and substantial weights in another, include the 0 in the average.

Whycliffes · February 8, 2026, 3:47am

Why do you use the average instead of summing them together? I would think that when there are two identical criteria in different ranking systems, it signals that this is a strong criteria, so summing the weights of the two criteria would be better. Or?

yuvaltaylor · February 8, 2026, 5:13am

It's the same. Let's say you only have 5 nodes in your ranking systems. In one universe the optimal combo is 0%, 20%, 40%, 20%, 20%, and in the other it's 10%, 10%, 20%, 50%, 10%. Whether you add them together or average them, you'll still get 5%, 15%, 30%, 35%, 15%, because they have to be rounded down to 100% at the end.