Large performance gap between backtests using AIfactor & AIFactorValidation

sgmd01 · September 26, 2024, 11:54pm

I did a validation test on about 17 years of data with a holdout/unseen data period of 4 years. I used AIFactorValidation with the chosen models in a ranking system to backtest these 4 years of holdout/unseen data with a 52 wk gap from the training data with a result of 39/yr return, -25 drawdown and sharpe 1.09. I then used the same training data set over the same time frame to create predictors for these models. I used AIFactor in a ranking system and backtested it over the same 4 years of holdout/unseen data but my result was 23/yr return, -35 drawdown and sharpe .75. If the unseen data set is the same by both methods then why is there such a large gap in performance?

sgmd01 · September 27, 2024, 4:49pm

I would like to trust/use this tool so would greatly appreciate it if someone could explain why this large performance gap between backtests using AIFactor vs AIFactorValidation on the same unseen data is occurring or what I'm doing wrong?

aschiff · September 27, 2024, 4:51pm

Both algorithms you've trained used RNG in the training process. Predefined models that use RNG are tagged as #random to disclose this fact.

Adding "random_state": someinteger to each model's hyperparameters will ensure deterministic RNG during training, leading to exactly the same output for the same input. Without this, models and especially ensembles may yield drastically different results. (Currently, making copies of model definitions ('models') is the only way to manage this random_state argument.)

Additionally, the effect of RNG on results can be studied by adding and training the same validation model with non-deterministic RNG, (i.e., random_state not specified in hyperparameters,) to a factor multiple times.

ZGWZ · September 27, 2024, 5:06pm

Also, if the number of holdings are low and the turnovers are too low, it is very easy for the simulations to have very different results when you just change little settings.

sgmd01 · September 27, 2024, 5:52pm

I understand. Thank you for the explanation

sgmd01 · September 27, 2024, 5:53pm

Good point

mkraeutle · October 22, 2024, 4:35pm

One thing I found over the weekend is that building models in AI Factor assumes N/As are treated as Neutral. Whereas building a ranking system assumes N/As are negative. Until I updated my ranking system from Negative to Neutral I was receiving a similar performance degradation. Once I fixed that the results are similar between the two environments.

Now I ensure I check the ranking system In Time Sample result matches approximately with the AI Factor model In Time Sample result before I do simulations.

sgmd01 · October 22, 2024, 7:48pm

Thank you. I changed my ranking systems from negative to neutral using AIFactor & it made an insignificant difference. I also ran the models multiple times using AIFactor and there was a consistent and very significant degradation in performance compared with using AIFactorValidation with the same model over the same time period.