Overhaul of AI Factor Discussion

marco · January 14, 2026, 5:07am

Dear All,

We're excited to announce some plans for the overhaul of our platform, AI Factor 2.0, to transform our solution into a true MLOps powerhouse for stock picking ML models. Our goal is to address the current version's inflexibility and streamline the process. We're seeking your feedback on the high-level changes outlined below.

Current version: The Monolithic Problem

The current version of AI Factor, while functional, has proven to be inflexible and cumbersome. Its monolithic structure—where a single "AI Factor" component contains the dataset, normalizations, model experiments (validations), and predictors—results in a complex workflow with too many steps and missing key functionality.

AI Factor 2.0 Solution

We are breaking this monolith into independent, modular components. The changes are below, organized by topic.

Dataset

We will be creating a standalone “Dataset” component. It will have the following new properties:

Datasets will contain raw values so you will be able to run different experiments without having to continually reload
Datasets will be able to automatically update with latest data so you don’t have to reload a dataset to re-train your production model
You will be able to choose which features of your dataset are used for a particular experiment or operation.
You will be able to specify default normalizations for your features (but they will not be calculated).
Datasets will be usable by different components like:
- Experiments (a.k.a. validations for model tuning)
- Feature engineering
- Inference
We will support easy imports of external dataset, like a Snowflake Dataset.
Ability to merge internal and external datasets

Feature Engineering

The main approach to feature engineering will be with purpose built Python apps. We will also work with app developers to create an app marketplace. This will be more clear once we launch our first app which is being tested.

Experiments (Model Tuning & Validation)

This component, which incorporates the former Validation and Result sections, will reference an existing Dataset and utilize custom normalizations.

Some additional features will also be added like:

New "fold-level" normalization, which ensures data integrity by only using past data windows, thereby preventing future data leakage with existing “dataset-level” normalizations.
WandB integration. If you supply your WandB API key we will be log interim results.
Additional statistics like accuracy, and accuracy trend.
Advanced model introspection and explainability features (e.g., SHAP values) will be integrated.
More efficient grid search like WandB “Sweeps” that do random and Bayesian search. Here’s a quote I found in some post: “I’ve replaced entire weeks of manual hyperparameter tuning with W&B sweeps. It’s legitimately one of the best features.”

Inference (Predictors Models)

Predictors (Models) will be decoupled from the core AI Factor, allowing for independent management, versioning, and deployment. In addition, Predictors will support automatic retraining upon Dataset updates.

Deployment

We will strive to make model deployment and tracking as easy as 1-2-3. For example:

You will be able to choose either a Ranking System or a Predictor. This way you don’t have to create Ranking System wrappers for each Predictor.
You will be able to launch live strategies with auto-retained Predictors. Tracking many models out-of-sample will therefore require zero effort.

Conclusion

In hindsight, some of the decisions we took were truly strange. Fixing it will be a lot of work of course, but we think it's worthwhile. Let us know you pain-points.

Thank you.

Cheers

trendyist · January 14, 2026, 10:00am

This all sounds very exciting and it will be interesting to see it all in action down the line. Sounds like a good plan.

Just wondering though, will our existing AI factors break? Mainly wondering if it makes sense to keep building and testing AI factors under the current system, or if the overhaul will change things so much that it’s better to hold off for now.

judgetrade · January 14, 2026, 12:29pm

A big chunk of AUM already investing / trading via AI Factor so of course not

SZ · January 14, 2026, 12:47pm

One of my issues with the current way is how different one run will be from another even if I use the same exact settings if the dataset is reloaded.

Having the dataset be independent from the runs will be useful to help reduce some randomness too I believe which should let users be better able to tell if an improved run is (purely) chance or not. Only one way to find out though. At least from the post description seems like we should be able to remove features and retest without reloading a dataset. Adding new features would require a new entire dataset or could we add the new features data without modifying the existing dataset otherwise? If so it should be an option!
(1.5 ) For me one item in my wishlist would be to be able to set a number of runs for example tell the UI to run 10 trainings with the same dataset, features and parameters to see what role randomness plays all at once. This would be a good way to get extra revenues for p123 too. Then a tab gets generated with average and median statistics along with each result. The number of runs can be set by the user. This tab/section/work (some) can also be repurposed to show side-by side return (and downside risk) comparisons when features are changed.
Rolling dataset normalization. Relative to recent history rather than all history in the set. For example a 12 month lookback etc
The returns order (instead of amount) lightgbm objective function would be great too. I think it is called lambdarank/rank_xendcg objective. If I had to choose one of the points it would be this one as I am very curious about what it can do.

dnevin123 · January 14, 2026, 12:47pm

sounds awesome… thanks for the continued effort

RJSchierm · January 14, 2026, 6:14pm

Sounds good. I appreciate the continued work on AI. Although I'm new to it (6 months....), I can appreciate the benefits of dataset modularity. It's been clunky, IMO.
While you're refactoring, would you be open to modifying the simulation test period to beyond 5 years in duration, and beyond the current 5 years in arrears? For example, I'd love to test a Predictor in different market regimes (2007-2010 GFC; 2019-2021 Covid downturn and subsequent turnaround).
One other wishlist item - if it makes sense - any chance you could create Targets that focus on results correlations? For example, Spearman Target of .2500 or greater? This would help identify important Features from a completely different perspective than the current set of Targets.

marco · January 14, 2026, 6:26pm

I added a section in first post about Feature Engineering which was not mentioned (it's probably the most important step )

ScifoSpace · January 14, 2026, 9:04pm

It is

yuvaltaylor · January 15, 2026, 1:42am

Will any of this cross over to non-AI users? Particularly the "dataset" part? There's some stuff there that would be great for good old-fashioned ranking systems.

Doney1000 · January 15, 2026, 12:43pm

I want to second points made by @SZ and @RJSchierm

What @SZ mentioned: I am also comprartively new to the whole topic but it took me a while to understand how much randomness and variance is still left for most models, even after finetuning. I underestimated it. My first naive conclusion was that maybe we need a way to use seeds to “fix” our best models and use them as predictors OOS. I now understand that this is not in the sense of how ML is supposed to work. If retraining of the same model leads to extreme variance of validation results, I should maybe go back to feature and target engineering and hyper parameter tuning. And even when you come to a satisfactory point, there will still be a lot of randomness left and the best practice is to work with multiple clones (and maybe even slight variations or diversification) of models. However, currently we have no “nice” way to judge stability of the models. Would be nice if the results table had a toggle to aggregate stats of clones for the same model to judge stability or similar.
What @RJSchierm said. I am not sure if I totally understand it yet but as far as I know there are ways to tweak the applied loss functions to certain goals. I think currently we only optimize RMSE which is a nice general default but for some tasks is maybe supoptimal. If I want a really tight long-only strategy, if I want to specifically optimze for turnover or simple ordering /(spearman correlation), the default might be suboptimal. Way too often in my case, validations look great but don’t survive simulation with real-world trading implications (e.g. slippage). Imo we need other loss function options. You can’t only tweak this via target, features and hyperparameters if the training goal is not aligned.

image765×266 24.3 KB

SZ · January 15, 2026, 2:39pm

On lambda rank. Not the best of papers but some promising info nonetheless:

Jrinne · January 15, 2026, 2:55pm

I think this discussion is very pertinent, and I’ve run into a closely related issue myself.

I too have often see cases where validations look great, but the simulation looks worse, and I immediately attribute the difference to “transaction costs.” That certainly can be a real contributor — no argument there. But I think there’s an important missing piece in how we interpret this gap.

As you’ve pointed out before (paraphrasing): even under the best conditions, there is still a lot of randomness left. I think this applies not only to validations, but to simulations as well.

The simulation feels exact because it is deterministic — especially when the ranking system is cached — while cross-validation is visibly stochastic. That creates a false sense of precision. In reality, the “true” sim outcome should probably be thought of as lying within a confidence interval (in frequentist terms), not as a single exact number.

So when a sim underperforms validation, the difference we’re seeing may be:

transaction costs,
randomness / variance,
or some combination of both.

Without estimating the variance of the simulation itself, we can’t really disentangle those effects.

This ties back to another point you made that I think is crucial: best practice is to work with multiple clones (and perhaps small variations) of a model, because individual runs can be misleading. I agree that there should be a systematic way to do this for both validations, and importantly, Sims.

Before attributing differences to slippage or turnover, we should first ask how much of what we’re seeing is simply noise.

SZ · January 15, 2026, 4:29pm

And also maybe a setting to add the median bid/ask to the buy price and subtract it from the sale price for training returns? I know some have just created custom targets but talking about an easy for new users setting here

Doney1000 · January 16, 2026, 7:06am

Did people manage to do this within target customization? I failed to do so because the target “does not know and does not care if last weeks stocks are in the same bucket as this week. So any target customization in the direction of slippage adjustments would treat it like a complete weekly sell and rebuy. That will just overly punish illiquid names and not help you final judgement for simulations.

hbee · April 6, 2026, 3:46am

everything proposed by @Marco especially reloading of dataset will save a ton of resource use.

The one gap that still blocks the next jump:

Right now, the system predicts point estimates. The best quant shops’ edge comes from predicting distributions.

Two small changes unlock this:

1) Target flexibility (no FHIST constraint)
Let users define targets like:

percentile rank of forward returns
decile buckets
custom transforms (Eval, composite ranks)

This matters because most modern papers (including Deep Momentum) don’t optimize raw returns—they optimize rank or bucketed outcomes, which are far more stable cross-sectionally.

2) Classification objectives (not just regression)
Once targets are deciles, classification becomes natural:

P(top decile)
P(bottom decile)
full softmax over 10 buckets

Now you’re no longer predicting “+8% vs +10%.”
You’re predicting probability mass across outcomes, which maps directly into:

position sizing
long/short construction
risk budgeting

That’s where live performance improves.

Why this is high ROI for the platform

It doesn’t require new data infra — just target + objective flexibility
It compounds with everything you’re already building (datasets, sweeps, retraining)
It aligns with how serious users already think about portfolios (probabilities, not point forecasts)

And most importantly:

it closes the gap between “good backtest” and “good live system.”

Right now, users are forced to approximate this workflow indirectly.

If AI Factor 2.0 is about becoming an MLOps layer, this is the missing primitive.

Everything else you’ve outlined increases speed.
This is what ensures that speed translates into better capital allocation.

References:

https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4452964

https://dl.acm.org/doi/abs/10.1287/mnsc.2021.4189

Jrinne · April 6, 2026, 2:37pm

"Cool! I do not pretend to fully understand this or even to have read the whole paper to be honest. So I am not an expert, but I believe bimodality is not so different than a 2-state hidden Markov Model. Or put simply, the bimodality is the result of more than one state. Of course, neural-nets have longer memory (which HMMs specifically avoid) and can be even better (e.g., LSTM models discussed in the articles). But maybe the takeaway is the same: momentum models are too noisy without “states” that have persistence. If so I agree strongly.

AlgoMan · April 7, 2026, 2:19am

Only one (tiny) dataset of Sudi stocks (no liquidity requirements) with a holdout over the Covid bounce. No quality check of the predicted rank/score compared with forward return.

This paper aims to explore rank-based approaches, mainly machine-learning based, to address the task of selecting stock symbols to construct long-term investment portfolios. Relying on these approaches, we propose a feature set that contains various statistics indicating the performance of stock market companies that can be used to train several ranking models. For evaluation purposes, we selected four years of Saudi Stock Exchange data and applied our proposed framework to them in a simulated investment setting.

hbee · April 7, 2026, 2:56am

@AlgoMan that's a good catch on the Saudi exchange nuance. though the practice of using rank as target rather than returns is used by many teams in Kaggle and other public competitions.

nevertheless thanks for sharing with us that you didn't find lambdarank or xendcg to outperform plain old regression with MSE.

an unrelated question. @marco may we PLEASE have the option to train the AI factor model all the way up to say 30 days ago from today's date? and have it auto retrained every 30 days? this is for live use, not for simulations. hence no need for hold out.

marco · April 9, 2026, 12:30am

This requires us to overhaul how we did the datasets. We discussed it today in fact. We've been brainstorming the overhaul for a while, and we are close to start working on it. In short, "Datasets" will be moved out of AI Factors. They will be independent components usable by multiple AI factors ( and other tools like python apps). Datasets will have the ability to auto update, which will then allow you to auto-retrain predictors which can then be used in auto-rebalancing live strategies.

It's quite a bit of work, but much needed for AI Factor 2.0 which will include other improvements.

hbee · April 9, 2026, 12:46am

thank you @marco. it's more about having fresher data in the training, than the automatic part. happy to manually retrain as it doesn't take too much effort thx to you and the team.