Overhaul of AI Factor Discussion

Dear All,

We're excited to announce some plans for the overhaul of our platform, AI Factor 2.0, to transform our solution into a true MLOps powerhouse for stock picking ML models. Our goal is to address the current version's inflexibility and streamline the process. We're seeking your feedback on the high-level changes outlined below.

Current version: The Monolithic Problem

The current version of AI Factor, while functional, has proven to be inflexible and cumbersome. Its monolithic structure—where a single "AI Factor" component contains the dataset, normalizations, model experiments (validations), and predictors—results in a complex workflow with too many steps and missing key functionality.

AI Factor 2.0 Solution

We are breaking this monolith into independent, modular components. The changes are below, organized by topic.

Dataset

We will be creating a standalone “Dataset” component. It will have the following new properties:

  • Datasets will contain raw values so you will be able to run different experiments without having to continually reload
  • Datasets will be able to automatically update with latest data so you don’t have to reload a dataset to re-train your production model
  • You will be able to choose which features of your dataset are used for a particular experiment or operation.
  • You will be able to specify default normalizations for your features (but they will not be calculated).
  • Datasets will be usable by different components like:
    • Experiments (a.k.a. validations for model tuning)
    • Feature engineering
    • Inference
  • We will support easy imports of external dataset, like a Snowflake Dataset.
  • Ability to merge internal and external datasets

Feature Engineering

The main approach to feature engineering will be with purpose built Python apps. We will also work with app developers to create an app marketplace. This will be more clear once we launch our first app which is being tested.

Experiments (Model Tuning & Validation)

This component, which incorporates the former Validation and Result sections, will reference an existing Dataset and utilize custom normalizations.

Some additional features will also be added like:

  • New "fold-level" normalization, which ensures data integrity by only using past data windows, thereby preventing future data leakage with existing “dataset-level” normalizations.
  • WandB integration. If you supply your WandB API key we will be log interim results.
  • Additional statistics like accuracy, and accuracy trend.
  • Advanced model introspection and explainability features (e.g., SHAP values) will be integrated.
  • More efficient grid search like WandB “Sweeps” that do random and Bayesian search. Here’s a quote I found in some post: “I’ve replaced entire weeks of manual hyperparameter tuning with W&B sweeps. It’s legitimately one of the best features.”

Inference (Predictors Models)

Predictors (Models) will be decoupled from the core AI Factor, allowing for independent management, versioning, and deployment. In addition, Predictors will support automatic retraining upon Dataset updates.

Deployment

We will strive to make model deployment and tracking as easy as 1-2-3. For example:

  • You will be able to choose either a Ranking System or a Predictor. This way you don’t have to create Ranking System wrappers for each Predictor.
  • You will be able to launch live strategies with auto-retained Predictors. Tracking many models out-of-sample will therefore require zero effort.

Conclusion

In hindsight, some of the decisions we took were truly strange. Fixing it will be a lot of work of course, but we think it's worthwhile. Let us know you pain-points.

Thank you.

Cheers

14 Likes

This all sounds very exciting and it will be interesting to see it all in action down the line. Sounds like a good plan.

Just wondering though, will our existing AI factors break? Mainly wondering if it makes sense to keep building and testing AI factors under the current system, or if the overhaul will change things so much that it’s better to hold off for now.

3 Likes

A big chunk of AUM already investing / trading via AI Factor so of course not :blush:

One of my issues with the current way is how different one run will be from another even if I use the same exact settings if the dataset is reloaded.

  1. Having the dataset be independent from the runs will be useful to help reduce some randomness too I believe which should let users be better able to tell if an improved run is (purely) chance or not. Only one way to find out though. At least from the post description seems like we should be able to remove features and retest without reloading a dataset. Adding new features would require a new entire dataset or could we add the new features data without modifying the existing dataset otherwise? If so it should be an option!
  2. (1.5 ) For me one item in my wishlist would be to be able to set a number of runs for example tell the UI to run 10 trainings with the same dataset, features and parameters to see what role randomness plays all at once. This would be a good way to get extra revenues for p123 too. Then a tab gets generated with average and median statistics along with each result. The number of runs can be set by the user. This tab/section/work (some) can also be repurposed to show side-by side return (and downside risk) comparisons when features are changed.
  3. Rolling dataset normalization. Relative to recent history rather than all history in the set. For example a 12 month lookback etc
  4. The returns order (instead of amount) lightgbm objective function would be great too. I think it is called lambdarank/rank_xendcg objective. If I had to choose one of the points it would be this one as I am very curious about what it can do.
1 Like

sounds awesome… thanks for the continued effort

Sounds good. I appreciate the continued work on AI. Although I'm new to it (6 months....), I can appreciate the benefits of dataset modularity. It's been clunky, IMO.
While you're refactoring, would you be open to modifying the simulation test period to beyond 5 years in duration, and beyond the current 5 years in arrears? For example, I'd love to test a Predictor in different market regimes (2007-2010 GFC; 2019-2021 Covid downturn and subsequent turnaround).
One other wishlist item - if it makes sense - any chance you could create Targets that focus on results correlations? For example, Spearman Target of .2500 or greater? This would help identify important Features from a completely different perspective than the current set of Targets.

I added a section in first post about Feature Engineering which was not mentioned (it's probably the most important step :face_with_peeking_eye: )

2 Likes

It is

1 Like

Will any of this cross over to non-AI users? Particularly the "dataset" part? There's some stuff there that would be great for good old-fashioned ranking systems.

1 Like

I want to second points made by @SZ and @RJSchierm

  1. What @SZ mentioned: I am also comprartively new to the whole topic but it took me a while to understand how much randomness and variance is still left for most models, even after finetuning. I underestimated it. My first naive conclusion was that maybe we need a way to use seeds to “fix” our best models and use them as predictors OOS. I now understand that this is not in the sense of how ML is supposed to work. If retraining of the same model leads to extreme variance of validation results, I should maybe go back to feature and target engineering and hyper parameter tuning. And even when you come to a satisfactory point, there will still be a lot of randomness left and the best practice is to work with multiple clones (and maybe even slight variations or diversification) of models. However, currently we have no “nice” way to judge stability of the models. Would be nice if the results table had a toggle to aggregate stats of clones for the same model to judge stability or similar.

  2. What @RJSchierm said. I am not sure if I totally understand it yet but as far as I know there are ways to tweak the applied loss functions to certain goals. I think currently we only optimize RMSE which is a nice general default but for some tasks is maybe supoptimal. If I want a really tight long-only strategy, if I want to specifically optimze for turnover or simple ordering /(spearman correlation), the default might be suboptimal. Way too often in my case, validations look great but don’t survive simulation with real-world trading implications (e.g. slippage). Imo we need other loss function options. You can’t only tweak this via target, features and hyperparameters if the training goal is not aligned.

2 Likes

On lambda rank. Not the best of papers but some promising info nonetheless:

2 Likes

I think this discussion is very pertinent, and I’ve run into a closely related issue myself.

I too have often see cases where validations look great, but the simulation looks worse, and I immediately attribute the difference to “transaction costs.” That certainly can be a real contributor — no argument there. But I think there’s an important missing piece in how we interpret this gap.

As you’ve pointed out before (paraphrasing): even under the best conditions, there is still a lot of randomness left. I think this applies not only to validations, but to simulations as well.

The simulation feels exact because it is deterministic — especially when the ranking system is cached — while cross-validation is visibly stochastic. That creates a false sense of precision. In reality, the “true” sim outcome should probably be thought of as lying within a confidence interval (in frequentist terms), not as a single exact number.

So when a sim underperforms validation, the difference we’re seeing may be:

  • transaction costs,

  • randomness / variance,

  • or some combination of both.

Without estimating the variance of the simulation itself, we can’t really disentangle those effects.

This ties back to another point you made that I think is crucial: best practice is to work with multiple clones (and perhaps small variations) of a model, because individual runs can be misleading. I agree that there should be a systematic way to do this for both validations, and importantly, Sims.

Before attributing differences to slippage or turnover, we should first ask how much of what we’re seeing is simply noise.

2 Likes

And also maybe a setting to add the median bid/ask to the buy price and subtract it from the sale price for training returns? I know some have just created custom targets but talking about an easy for new users setting here

Did people manage to do this within target customization? I failed to do so because the target “does not know and does not care if last weeks stocks are in the same bucket as this week. So any target customization in the direction of slippage adjustments would treat it like a complete weekly sell and rebuy. That will just overly punish illiquid names and not help you final judgement for simulations.