Yeah the main problem right now is judging models with a distinct set of hyperparameters following one single (randomly superior) training, risking that in retraining it will behave worse.
I currently combine 3 duplicates of the "top3" models in a 3x3 ensemble for my trading. But tuning and judging top3 from my models or a grid properly currently would require me to retrain enough duplicates on all of them and manually averaging the stats to judge randomness (maybe will do that at some point to be sure).
Maybe the team finds a smart option to deal with this. Maybe a tick option to aggregate validation stats of duplicates into one row including variance. Having the option to mess around with seeds would also be great but again maybe also too dangerous...