AI Factors Initial Tests-Randomness

I have run a number of AI Factor trainings/validations to see how some of my choices affect an AI Factor’s results and have used a setting of “seed": 42 in order to limit the role of randomness to keep my tests more scientific. “Ceteris paribus”

Randomness appears to play a huge role in the technical analysis system I am building to learn AI Factor’s workings and wanted to confirm a few things and get feedback. If I copy over a model within the same AI factor I do get similar results, but if I copy all features and replicate the settings the results do change quite a bit even if the settings are identical. Am I correct in assuming that the “control” on randomness that seed:42 exerts only applies within a specific AI factor? Seems like the answer is yes but just wanted to confirm.

Secondly, if it only applies to that AI Factor, each time we add a feature by creating a new AI factor the randomness will reset (be independent). Is that the correct understanding?

The screens below are from the same model, validation, normalization, features, etc but there was a large difference as shown below.

If you use the same data (features and target) and the same model settings (hyperparameters), the predictions should be identical (except for some very minor differences due to multithreading) across all runs. I confirmed this empirically on my local machine.

In addition, LightGBM has less inherent randomness than Extra Trees (ET) or Random Forest (RF). You can therefore expect larger variations in predictions from the ET model. For this reason, LightGBM may be preferable when you need to retrain your model frequently on new data.

1 Like

Thanks for your reply Piotr. When I am in an AI Factor run and click “Save As” and simply rename it, the settings are carried over to the new AI Factor but the results are very different. What could be happening? For example I tried it again after reading your reply and got yet another different result

Not sure and just troubleshooting with you but it has to be exactly the same data (with the exact same index for random_seed to accomplish anything) if you are using sub sampling which I know you have used in the past with LightGBM. Bootstrapping and sub sampling will give different samples if the index is shifted in only one place. Random_seed cannot help you in that case

Does P123 end up using the same array from one day to the next? Can they? For P123 classic the arrays are cached for about 20 minutes but are different from day to day. My guess is that P123 has to do something similar with the AI. I don’t see how it is possible for P123 to save all of the arrays in memory, but I could be missing something.

It's not just that any small change in the indexing can cause a problem but with ranks at least ties are broken differently with each run. With P123 classic, at least, that is the case.

But there may be a partial fix no matter what the reason for the changes from one run to the next are.. With LightGBM try turning off subsampling. With Extra Trees Regressor do not use bootstrapping and do not sub sample the columns with either. Your result are unlikely to be any better or worse. It remains to be seen if they are different from run to run with this change.

See if that helps. If P123 or someone else says we use a saved array or tuning of sub sampling does not help when you try it, then it is probably something else.

But for sure even with P123 classic which has a deterministic method the result will change when using ranks because the data can change from day to day.. Especially with the known situation of ties being broken differently each time.

Also LightGBM has a few more tweaks to make things deterministic. Even they call them experimental.

1 Like

Thanks-Yes, I am using subsampling (0.75,0.80), but I guess my question is why large differences across AI Factors but not within? It is almost as if its a different “randomness” instance on each named AI Factor but was hoping to clarify that. It is also quite a large difference. I will try setting those to 100% sampling and report back

1 Like

Theese libraries’ subsample methods should also use the passed random_state for deterministic, reproducible subsampling. Given you have consistent results with subsampling within an AI factor, I think the issue has to be further upstream.

I think the more likely culprit are minor revisions to the underlying FactSet data in between creation of the AI factors since each AI factor has its own factor dataset that is loaded/normalized once. This is consistent with the fact that you can see minor variations with classic p123 tools like rank performance (or sim performance) on historical dates across re-runs over time.

1 Like

I think, for sure, changes in the data are the issue — as Feldy points out. That means there’s probably no perfect fix. If the data change (whatever the cause), it’s a different model, and nothing can fix that.

That said, even a small change in the row index — offsetting the index by just one row out of more than four million — caused by that data change will result in different subsamples, even when using random_seed. So random_seed does nothing when there’s a shift in row ordering.

This can magnify what seems like a tiny change in ordering.

Without subsampling, it’s just a minor difference in one row. That’s the theory, anyway.

But at the end of the day, I think Feldy is right: with different data (whatever causes that change), it’s a different model. Maybe that can be mitigated, but there’s no guarantee.

As Feldy points out, it’s most likely a change in the data causing the problem. random_seed is behaving as intended, but it can magnify the effect of even a small change in the data.

I’ll be interested to see what you find.

So I performed a few more tests to see if I could gain any new insights but seeing the same results as before. Basically, similar hyperparameters on same dataset yield (generally) similar results AS LONG AS IT IS WITHIN THE SAME AI FACTOR. If I copy the exact same settings for that run over to a new factor/run then the differences are no longer small. It does not seem to be due to data updates (seems unlikely at least at this point) but something perhaps more fundamental such as the sorting of the data even if it’s the same data.

Here are some different results all with the same parameters and dataset and testing period. Adding a model with no subsampling displayed the same behavior as the subsampling ones (similar results within the same factor name, but very different in the copies with different names) They are all just exact copies.

What did I learn so far? Easier to test hyperparameter changes than feature changes while controlling randomness this way in the UI. I need to find out how to properly control for randomness still..

Hope to gain some clarity on this to make my tests more fruitful as it can make it hard to know if a change improved things or randomness did. If anyone has some helpful color I would be quite thankful.

Are there more hyperparameters I could add other than removing subsampling or specifying a seed? I was testing "seed": 42 for these runs already. This is for Microsoft lightgbm

1 Like

Just thinking out loud and troubleshooting with you. I have not actually tried much of this.

I think it helps to separate the different sources of randomness in tree-based models. In practice there are three:

  1. Row sampling (bootstrapping / subsampling)

  2. Feature sampling (column subsampling)

  3. Split randomness (specific to ExtraTrees)

For Random Forests, randomness comes mainly from (1) and (2): deterministic splits, but random rows and features.
For ExtraTrees, the key additional source is (3): randomized split thresholds, even when using the full dataset (no randomization of rows or features).

That split randomness is usually sufficient for ExtraTrees to control variance, which means bootstrapping and column subsampling are usually unnecessary for ExtraTreesRegressor. Turning those off typically has little impact on performance.

So in effect, with ExtraTreesRegressor you can remove two sources of randomness (row and feature sampling) while relying on a third (randomized splits). The advantage is that the remaining randomness operates on a fixed dataset, rather than being amplified by small changes in row ordering or indexing as would be the case with a random forest that uses bootstrapping and/or features subsampling.

Maybe you have already tried this and whether this fully resolves the behavior you’re seeing, I can’t say for sure — but in principle it should reduce run-to-run variability while preserving the core behavior of ExtraTrees.

I’m not suggesting ExtraTrees as a replacement for LightGBM here — but it might be useful as a diagnostic test. Effective because the whole data set is always used with ExtraTreesRegressor (with bootstrapping/subsamplng and feature subsampling turned off). If removing row and feature sampling collapses the variation across copied AI factors, that would strongly suggest those mechanisms are amplifying small upstream differences. If it doesn’t, then the randomness is likely entering earlier in the pipeline.

Again just troubleshooting, I asked ChatGPT and it focused on this too (part of its answer):

The most important insight (this is the punchline)

seed only guarantees reproducibility for repeated runs on the same realized dataset.
Creating a new AI Factor implicitly creates a new dataset realization.

Or my words for that: Its a different array and maybe using the entire data set rather than slicing it and dicing in using subsampling with minimize the effect of the changes.

Yes for some reason the act of loading the same dataset (same features, dates, universe) leads to very different results even with seed set and same hyperparameters. Even no subsampling led to very different results so would that not mean we know its not the subsampling? Still not sure how to best proceed since as @pitmaster stated extra trees will have even more randomness. Maybe reducing frequency will even things a bit not sure at this point. What parameters would you use to do the tests you describe and even if you identify the randomness cause will that not still leave you with the same result where its hard to ascertain the cause of observed differences in performance?

Ok so I asked GrokPro and got this which makes sense as it seems to confirm the experimental observations:

Why do results differ between a copied model in the same AI Factor vs. a new AI Factor? The seed (like random_seed or seed in LightGBM) primarily controls randomness within the training process for a given dataset realization—things like initial splits, subsampling, or tree growth. It ensures reproducibility if the input data is identical (same rows, order, and values). However, Portfolio123's AI Factors appear to treat each new factor as a fresh "instance," which involves reloading and renormalizing the dataset from scratch. This can introduce subtle differences due to:

  • Data revisions: FactSet data (underlying P123's universe) gets periodic updates or corrections. Even minor changes (e.g., a single adjusted value in millions of rows) can shift row indices or tie-breaking in ranks, cascading into different subsamples or splits. Public threads note that P123 data isn't statically cached long-term—it's dynamic, similar to how ranks or sims in P123 Classic vary slightly day-to-day.

  • Dataset realization: When you create a new AI Factor, it builds a new array or dataframe. If there's any non-deterministic step in data loading (e.g., sorting, handling NaNs, or universe filtering), the "realized" dataset differs slightly. LightGBM docs emphasize that seeds make runs deterministic only on the exact same data; a new load isn't guaranteed to be identical.

  • Confirmation from your tests: Your observation that results are stable within one AI Factor but diverge in a new one aligns with this— the existing factor likely reuses its loaded/normalized dataset, while a new one doesn't.

Maximize Determinism in LightGBM Settings:

  • Set seed: 42 (as you're doing).

  • Disable subsampling: bagging_fraction=1.0, bagging_freq=0 (no bootstrapping), feature_fraction=1.0. This forces full-dataset usage, minimizing amplification from data shifts. You've seen partial success here, but combine with...

  • If available in P123, enable deterministic=true (LightGBM param for consistent parallel behavior) or force_col_wise=true for histogram consistency.

  • Avoid Extra Trees/RF for now—stick to LightGBM for lower baseline randomness.

What I can say for sure: Random seed is not helping you here. Change a single row and it will be a deterministic still ,but with a completely different result.

And a different way of saying it is that the dataset will get sliced and diced in a totally different (albeit deterministic) way.

However, I may be overemphasizing the importance of this. There may be other, more important, causes.

1 Like

I think I will try to force the ordering and see how it goes. Lets see. Thanks for your suggestions

1 Like