Hi, wanted to ask how AI Factor currently handles sources of Randomness and reproducibility in model training, especially for LightGBM. Are seeds fixed for predictor retraining? I read that there is also a deterministic=True parameter...
Is it even relevant? Maybe seed effects on splits and trees are negligible? Would love to hear opinions...
Edit:
Okay I just noticed that duplicates of a certain LightGBM model with same hyperparameter set XYZ within the same dataset can vary strongly in validation results and also in live prediction ticker sorts while general feature importance stays rather stable.
As far as I understand it, the current "solution" would be to build an Ensemble of many duplicates and maybe marry multiple models with different HP sets?
I don’t want to spend the resource units to test this, but my guess is that deterministic=True won’t work directly in P123 since they use JSON for model parameters.
These mirror LightGBM’s own deterministic settings (In JSON).
That said, I don’t know which (if any) of these parameters are actually exposed through P123.
If you continue to have questions:
If you do ask support, would you please post what you learn here? How to get determinism in models is a recurring question in the forum, and I think many would be interested.
For scikit-learn models, Dan has suggested that this may be exposed (emphasis mine):
In JSON, this would probably look like: "random_state": 123(any integer can be used, with different behavior for each integer value)
I will just work with ensembles. They take forever in 5y Sims, but OOS I guess it will be more robust. Deterministic training on a handpicked "best" model might be too dangerous anyway.
Still, if somebody has info on how P123 is handling ML randomness at the backend, a I'd be interested to know. Any parameters outside the given set usually is ignored, so I assume that also goes for seeds if I'd put one in the hp json
Provide fixed values in data_random_seed and feature_fraction_seed. LightGBM gives precedence to direct specification of seeds.
Provide a fixed value in random_state, from which LightGBM will use internally to populate other seeds, including data_random_seed and feature_fraction_seed.
Omit seed/state from hyperparameters. This is the default behavior of LightGBM on Portfolio123, yielding different output per training.
The default randomness is intentional and may be useful for exploring the robustness of an algorithm against an AI Factor's features. The system allows the same model to be added and trained multiple times on the same AI Factor to support this use case. (Perhaps fixed seeds should be used in Grids, since differing seeds possibly obscure the results if the goal is to tune hyperparameters.)
Yeah the main problem right now is judging models with a distinct set of hyperparameters following one single (randomly superior) training, risking that in retraining it will behave worse.
I currently combine 3 duplicates of the "top3" models in a 3x3 ensemble for my trading. But tuning and judging top3 from my models or a grid properly currently would require me to retrain enough duplicates on all of them and manually averaging the stats to judge randomness (maybe will do that at some point to be sure).
Maybe the team finds a smart option to deal with this. Maybe a tick option to aggregate validation stats of duplicates into one row including variance. Having the option to mess around with seeds would also be great but again maybe also too dangerous...