Stupid AI Factor Questions on Validation Models, Prediction Models, and Time Periods

Just found the time to start using the AI Factor functionality in earnest, and have a developed a model I am reasonably happy with. Despite that I still have a few questions about how the "guts" of the AI Factor methodology work that I was hoping P123 staff could help answer for me.

I fit my model over the time period (2000-07-23 to 2017-12-31) using the K-Fold CV method with 5 folds. Given this Validation model (I have so far found variations of the linear model to perform the best lol) I then created a Prediction model and ran a Simulation over the trailing 5 year time period (2020/07/05 - 2025/07/05). All good so far, but there are a few things I have questions about.

1.) What happens internally when I "Train a Predictor)? Does it create a model with different weights than what was developed when I created the Validation Model? I wouldn't expect that to be the case, but if it isn't the case then what exactly is going on behind the scenes when a Predictor is generated?

2.) It is pretty important for my overall workflow (i.e. determining the amount of capital that is allocated to P123 strategy, etc.) that I am able to generate a full Simulation backtest (1999 to 2025) for a given strategy to see how it performs in difference economic regimes. Is there a way to do this, or at least approximate this using a combination of AIFactor and AIFactorValidation? I don't mind stitching a few simulations together if necessary.

3.) This question is related to Question 2 - I fit my model over the time period (2000-07-23 to 2017-12-31) and the maximum amount of time I can simulate using a Predictor is the Trailing 5 Years beginning 2020-07-05. That leaves a greater than two year gap between the end of my In-Sample data and the period accessible to the Predictor. Is there any way to see how the Model would have performed over that period? I don't want to train my model over the data in that time period, because that time period is out of sample for all my active P123 strategies and I want to be able to do an apples to apples comparison between the new AI Factor strategies and my old strategies.

Thanks for the help in advance. Unfortunately, I am sure I will have more questions as I go.

Thanks,

Daniel

From https://portfolio123.customerly.help/en/ai-factor/ai-workflow:


Train Predictor

The best model is chosen and a Predictor is trained that can be used to generate current predictions. These predictions can be used in Ranking Systems, Buy/Sell rules and for screening. It's also possible to use multiple different models as an ensemble.


Not too many details there... But based on my understanding and intuition, your predictor is retrained on the entire sample (2000-07-23 to 2017-12-31) using the selected (best) hyperparameters. This approach has both pros and cons but in general retraining on the full dataset is much safer and more theoretically grounded for linear models than for models like Random Forests, Gradient Boosting.

1 Like

That’s my understanding as well—and I think it’s a reasonable approach as long as the test period begins after a sufficient gap from the end of the training period. In this case, Trailing 5 Years starting 2020-07-05 should satisfy that, assuming the predictor was trained up to 2017-12-31.

It follows a fairly standard—and actually quite conservative—workflow: the classic train, validate, test split.

I also agree with Pitmaster that linear models generally carry less risk of overfitting, regardless of the cross-validation method used. If you’re tuning hyperparameters—say, in Ridge regression—it still makes sense to adopt the train/validate/test method, perhaps using grid search to select regularization strength during validation.

Just bumping this up the queue.... any P123 staff have answers?

any help on these issues would be appreciated

I am really not clear on why we need to include a gap. There should be no data leakage if the factors are being created correctly. I typically use the Rolling Time Series CV method and do not include a substantial gap between the training and test. I think the ML models are using a cross validation using training data set and there would be no peeking into the future for validation data or test. If I am mistaken, I would like to understand why it seems that most users in tis forum think a substantiial gap is important. Thanks!

Hi tiltonhouse,

You’re right to question the need for a large gap between the training and validation sets when you’re using proper time-series cross-validation (CV) that avoids data leakage. The same principle applies to the test set as well.

In general, the gap only needs to match the length of your prediction target period. So if you’re forecasting one-week returns, a one-week gap is sufficient to prevent contamination from future information. If you’re targeting one-year forward returns, then a one-year gap makes sense.

As you’re suggesting, the gap doesn’t need to be any longer than that. That said, some users might choose a longer gap for specific reasons—for example, to exclude the unusual 2018 period when value factors were inversely correlated with returns.

Jim

Jim, thank you for the reply. I've thought about this some more and am still not convinced that there is any data leakage that would require including a gap between the training data and the test data. Since the actual code is not visible i cannot confirm that P123 is usiing a CV method that only utilizes the test data. I am going to run a few tests to see if there is substantial difference in perfromance based on lag. I'll post when I know more.
Thanks! -Bruce

Jim, OK, I get it now. After rereading your post, it makes sense that the training data has to be lagged by the amount equal to the prediction target period. Thanks for the explanation.

2 Likes