Large performance gap between backtests using AIfactor & AIFactorValidation

I did a validation test on about 17 years of data with a holdout/unseen data period of 4 years. I used AIFactorValidation with the chosen models in a ranking system to backtest these 4 years of holdout/unseen data with a 52 wk gap from the training data with a result of 39/yr return, -25 drawdown and sharpe 1.09. I then used the same training data set over the same time frame to create predictors for these models. I used AIFactor in a ranking system and backtested it over the same 4 years of holdout/unseen data but my result was 23/yr return, -35 drawdown and sharpe .75. If the unseen data set is the same by both methods then why is there such a large gap in performance?

I would like to trust/use this tool so would greatly appreciate it if someone could explain why this large performance gap between backtests using AIFactor vs AIFactorValidation on the same unseen data is occurring or what I'm doing wrong?

Both algorithms you've trained used RNG in the training process. Predefined models that use RNG are tagged as #random to disclose this fact.

Adding "random_state": someinteger to each model's hyperparameters will ensure deterministic RNG during training, leading to exactly the same output for the same input. Without this, models and especially ensembles may yield drastically different results. (Currently, making copies of model definitions ('models') is the only way to manage this random_state argument.)

Additionally, the effect of RNG on results can be studied by adding and training the same validation model with non-deterministic RNG, (i.e., random_state not specified in hyperparameters,) to a factor multiple times.

Also, if the number of holdings are low and the turnovers are too low, it is very easy for the simulations to have very different results when you just change little settings.

I understand. Thank you for the explanation

Good point