Yuval, I think you’re making a strong point here.
Note: You would apply the exact same process when making predictions — not just during cross-validation which what I have describe below. Yuval’s point is fundamentally about prediction, not necessarily cross-validation. But it is the same process for predictions and cross-validation.
You’re absolutely right that normalizing an entire dataset at once can introduce look-ahead bias. But that’s not the only option—and it’s a well-known issue in machine learning with well-established solutions.
The correct approach is to normalize the training data using its own mean and standard deviation, and then apply that same normalization to the test data. This preserves out-of-sample integrity and avoids information leakage. So yes, using “dataset” normalization without proper handling is problematic. But it’s not a new or insurmountable problem—it’s just one that requires careful handling.
Your comment that “this is also a reason not to use the dataset option in AI models” sums it up nicely. It can be a real issue, and while some argue the practical effects may be small, it’s not the theoretically correct way to proceed.
As for Algoman’s idea: I’d be a bit surprised if he isn’t already managing this concern with P123’s download and possibly in his training and work. But regardless, it’s something P123 could manage in the theoretically correct manner..
Algoman, if I understand correctly, your suggestion could let us do a proper regression-style analysis within P123 Classic—rather than the rank-based (ordinal) regression.
The appeal is obvious. Today, a stock’s rank tells us its value relative to other stocks on that date. But in a frothy market where nearly all stocks are overvalued, ranking alone can obscure that there are no true bargains. Z-scoring over the training set (training set and not the entire data set), and then applying that transformation to test data, would allow us to detect this—and maintain consistent factor scaling over time.
In short: I think you’re on to something powerful here. And I agree with you that bringing Z-scores into P123 Classic (with the option to normalize by dataset or by date) could add a valuable new dimension. And it is possible to do it correctly.
BTW, I think Algoman already knows this but for others reading this, here is the Sklearn code using scaler (uses z-score to normalize) to handle Yuval's concern:
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
# Split data first
X_train, X_test = train_test_split(X, test_size=0.3, shuffle=False)
# Fit scaler on training data only
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
# Transform test data using training parameters
X_test_scaled = scaler.transform(X_test)