Should we be concerned about this: "This is also a reason not to use the "dataset" option in AI models."

Jrinne · October 5, 2025, 11:14am

Ranking vs machine-learning algorithms

Something I would like to see is to use Z-score as ranking option in the classic ranking systems. Like creating a composite that sum all the underlying factor z-score, preferably z-score by dataset as well as by date. That would give traditional ranking a completely new dynamic.

As an example, when using ranking (by date) for volatility factors, in calm markets (tight z-score range) the volatility factor shouldn’t matter as much as in a bear market (wide z-score range), but when always ranking between 0-100, volatility carries the same weight regardless of the volatility range of the underlying factor. If we could use z-score by dataset, the scoring would be a dynamic range and not a static range as now.

But that would introduce look-forward bias. The only way to properly do this would be to limit the dataset to past dates. In that case, the dataset would consist of one date at the beginning of the period and a huge number at the end. And that would, in turn, introduce wild inconsistencies in performance.

This is also a reason not to use the "dataset" option in AI models.

A P123 member has said he investigated this and it might not be a practical problem, but now, Yuval says we should not use this. Should we worry about this?

trendyist · October 5, 2025, 11:21am

I’ve noticed pretty unrealistically high validation returns when using normalise by dataset with certain features - especially using macro features with dataset min/max normalisation.

FWIW, I’ve let them run live in a strategy to watch and thus far results have been underwhelming, and have underperformed my strategies that don’t have dataset normalised features. These alternative strategies also have ‘worse’ validation results (though still good by any objective measure).

Obviously I’m just n=1 with an anecdotal case here and there are probably ways to use these sorts of features more robustly, but personally I stay away.

I think it’s not tough to see why though - these features could be inviting overfitting in a lot of ways.

AlgoMan · October 5, 2025, 11:38am

I would not be more worried about the lookahead bias with dataset more than my own (collected experience) bias towards factors in different market regimes. As I mentioned before, I have done tests to removed the z-score data set look ahead bias and I can’t see any meaningful difference in the out of sample results.

When I add low volatility factors to a ranking system, I know it will benefit me in bear markets and punish me in bull markets.

Should I ignore my lifetime acquired experience of the financial markets before making a ranking system? Am I cheating the process not ignoring knowing what works in the different market regimes?

Training on the whole data sets just tells the system that during a span of 20 years, there are periods of highly volatile markets and very low volatil markets.

Jrinne · October 5, 2025, 12:15pm

Thank you guys!!! It is nice to have members who are extremely well-informed with ML/statistics contributing to the forum.

So just my quick take on this. Look-ahead bias or survivorship bias of features (or whatever the correct term is) is a big concern of mine too. I think it is the greatest source of look-ahead bias too–if I understand your post.

In my mind this is ideal solution, theoretically at least. Start with a large number of features making sure to include some really bad features and ones you are not sure of.

The perfect way to do this would be a random selection of thousands of features using a random feature generator that you think is likely to produce a reasonable number of good feature along with all the junk feature it creates.

If you have a reasonably fast feature selection tool and and a reasonably fast machine learning model (a fast computer does not hurts), you can walk forward your feature selection method (using a faster ML on each training/test set). I.e., Simply use time-series validation on your feature selection method and ML model. Nothing really new here.

But you are actually validating your feature selection method and not the features themselves when you do this. And doing it in a way that does not use knowledge of how the features perform in the future–which is key.

I have done that with a modest set of features. Depending how rigorous I was with not using features that were not know at that time (when trained upon), with using a larger numbers of random features etc, I now have some idea of whether my feature selection process will work out-of-sample. I can also get some idea of how stable some of the features are over time.

So I am actually validating a feature select method and not the features themselves when I do this.

I understand that this only partially answers your concerns (if it addresses them at all) because I make no attempt to time the market or adapt to market regimes. I have given up on that myself.

BTW, I think the some problem exists with P123 classic and one could try this with P123 classic too. I honestly think it CAN be done with P123 classic.

And even if you use it elsewhere, maybe avoid any questions regarding look-ahead bias with feature selection (since this is the concern that is being address here), by not using “normalize over dataset” for this.

I don’t know if that helps but I very much appreciate you guys!!!

SZ · October 5, 2025, 1:01pm

If a feature using dataset normalization that gives useable look-ahead bias is used would it not actually likely hurt your validation results you see in AI factor? Is the validation period data being used in feature normalization? or just the training period data. Validation should come after training (in my opinion) and should not have look ahead info bias since training already happened. It could easily work against you in validation. I can tell you if you are building a technical signals system, date normalization might also not be the best idea as signals will get diluted by broad market events.

Jrinne · October 5, 2025, 1:58pm

Validation should come after training and should not have look ahead info bias since training already happened. You are right about this if you include the word should (as you have done).

So to make this simple lets say that is true and we only want to use time-series validation and avoid any real or theoretical problems that k-fold validation has and we are trying to use only data from the future using time-series validation. Limit the discussion to time-series validation.

The question is: are we somehow making a mistake in our method that makes that untrue. Untrue because we are inadvertently using information form the future.

And for simplify of this argument, lets just say any information for the future is bad. Full stop. Whatever this information may be we should not be using it.

So when we normalize over the entire data set we are using the mean and the standard deviations in the validation set. I.e., we are using future information (mean and standard deviation from the future).

That violated this assumption: Whatever the information may be we should not be using it.

I suggest you do not go through the mind-numbing mental effort of trying to future out the exact mechanism of how that influences the predictions. Just realize it does violate our assumption that we should not be using anything from the future.

So the solution is easy: *Don’t use the mean and stand deviant form the validator set when you normalize.
*
It is really as simple as that: **Don’t do it and you are fine!!! Perfectly fine.
**

So for completeness the validation and the training set should be normalized separately. The detail of that is that the test or validation set should be normalized using the mean and standard deviation of the training set (for z-score normalization).

It is not a new problem or one without a solution.

SZ · October 5, 2025, 2:03pm

So portfolio 123 normalizes the training data along with the validation data together during training?

Jrinne · October 5, 2025, 2:06pm

Yes. And it come from a staff member: Yuval. trendyist and AlgoMan are also 100% correct in what they have said, I believe. No doubt, I think.

I think there is an honest difference in option on how much this affects the model’s results. You can read some opinions about that above without my filter.

SZ · October 5, 2025, 2:07pm

Yeah I think you can normalize together for validation but maybe the option (or limitation) should be implemented to not use it for training

Jrinne · October 5, 2025, 2:15pm

Hmmm. Well normally the normalization is done with the training where you start in time–with the goal of no information about the future leaking..

The validation or test data are actually occurring at a point in time where we can know about the past (training data). So we could normalize the test data based on the training data with no future information leakage.

That’s how I learned now to do it anyway. Normalize only the training data and use that normalization for the test data. A realistic series of events for a time-series validation, in other words.

SZ · October 5, 2025, 2:17pm

Yes you normalize returns and features based on the training set but you do not include the validation dates. That will make validation more akin to oos

Jrinne · October 5, 2025, 2:18pm

I think that is perfect! The way I learned it anyway,

BTW, this arose out of discussion for a P123 classic feature suggestion. Maybe we could use z-scores instead of ranks for P123 classic was the suggestion.

I will leave it P123 to figure out whether to try to implement that suggestion or not, but this is an issue that affects both machine learning and P123 classic. I think some of the division between the 2 methods (machine learning and P123 classic) is artificial or over simplified at best.

For sure this is a potential problem for either method. And this is far from being the only example of something that could affect both.

In fact, I think it is generally the case that most discussions like this could potentially affect both machine learning and P123 classic in meaningful ways.

SZ · October 5, 2025, 2:27pm

Yes I think it is the standard best practice based on the books I read. Found an article that explains it in case anyone is curious: https://www.machinelearningmastery.com/data-preparation-without-data-leakage/

SZ · October 5, 2025, 2:30pm

In short:

AlgoMan · November 26, 2025, 3:00am

We have an option for “testing - holdout” when we do a validation. The way it is now, it’s just a second validation holdout, still using data we have used during normalization. In my opinion, the “testing - holdout” should use data we have not trained/normalized on.

What I’m doing now is to train on (x-5)-years of data, create a predictor trained on the (x-5)-years of data, make a ranking system and test the predictor on 5 years of data (that I never used during training). If the “testing - holdout” option in the validation did not used data we have trained (normalized) on, I would not need to do those extra steps.

Regardless, I still in the end need to create a new model that trains on all available data to train the predictor for a live strategy.

SZ · December 2, 2025, 4:15am

Yes makes sense to do this at least as a sanity check either way