Optimizer optimization (or overfitting?) using train/test data

My comment: be careful even if you are using a form of cross-validation. You can easily turn your test data into training data…

And the question leading to the comment: if I have to tune my optimization system parameters to get a performance improvement in my test data, did I just fit to my test data?

I finally downloaded enough data to do a rolling time series train/test validation with Bayesian optimization on a linear ranking system. At first my code was messed up and I was training and “testing” on the same data. I was able to confirm that the Bayesian optimization produced a system that performed better than a equal weighted ranking system on the train data set. It was only an 10% median (I think I did median and not mean) performance increase across 10 train/test sets, but still seemed significant.

However, once I fixed my error to test on the test data (about 2 years per split) I found that I was not able to get any consistent outperformance from the optimizer. At most I saw a 2% median improvement. So sometimes it was positive and sometimes negative. I then started to try and tune my hyper parameters (training period, number of optimization rounds…) to get more consistent performance. However, I realized that I am now fitting my hyper parameters to my test data…

The challenge I think we face is that most of us have at most 20 years of data. This is not a lot if we want to train on 10 years and still have a significant number of train/test splits. As such we cannot easily have a significant period of time that is a final hold out to test performance on. Similarly if we split out universe in 2 we cannot easily trust that the data is truly “out of sample” as there are likely trends in the same time period across companies.

Maybe a solution is to hold out a few choice years/months across the 20 years that we only test performance on after we are done tuning hyper parameters. And we have to promise ourselves to not revisit that system many times after doing the final hold out test.

A simple visualization of the train/test split: note that each split trains on the full universe. I am just sliding the start and stop date

Note: I am doing this instead of k-fold to avoid any potential look ahead bias from training before and after my test period.

Correct, I think. And that is why one, at times, might consider k-fold cross-validation which validates over the entire period. But has weakness compared to time-series validation. You are familiar with those and I will not repeat what you already know here.

Yes, People confuse validation and testing, Again, you are well aware of the difference.

But this is very hard to do for us. For example, I did a holdout test set on a strategy and got 28% CARG if I remember correctly. But now I want to improve that strategy. Anything I do going forward should correctly be called validation without a holdout test set being possible. I already used the holdout test set and I don’t really think using the India’s market (or even Canadian data really works).

One poster said we should look at 1827 data. Hard to test the tech sector at least and I am not sure now big a thing stock buybacks were then. For all practical purposes further validations can give some overfitting and I should probably expect 28% CAGR at best. I do not think I can find another holdout set that would really work or be practical for most of what I do.

Coincidentally, ChatGPT was just now confirming that with me and basically made your same arguments.

TL;DR: I agree that there can be overfitting with multiple validation trials which is different than a holdout test set,

I don’t think I actually added a thing to what you have already said. I just agree that 20 years of data seem like a lot but I wish we had more some days :slightly_smiling_face:

Jim

How are you downloading the data for this? Are you downloading single factor rankings sampled across your date range? Or are do you have a FactSet or Computstat license and are downloading raw single factor values?

I use the Python API wrapper and the rank endpoint. See the post below for some more info:
Python API/ML Data Download Tips

Feldy,

I think boosting can be done with just ranks that do not require a license. If you include z-scores for regressions a license it pretty unnecessary, I think. I cannot think of an example were I would want it for machine learning.

Obviously, Jonpaul is better to address the specific question of whether he has a license but i think he just has an Ultimate membership and uses ranks and/or z-scores.

Jim

1 Like

Jonpaul,

If you are willing to use k-fold validation and really know ahead of time all of the parameters and hyper parameters you want to use you can do an “inner and outer loop” for cross-validation.

It seems a little complex until it doesn’t. ChatGPT can probably discuss this better than I can, But it does allow you to set the program and train/validate and test using all of the entire dataset at one time while you finish your breakfast or more realistically go to sleep on a Friday and hope the program finishes by the end of the weekend for a support vector machine (example here) and P123 data downloads.

Support vector machines are slow and scaler by between O(n^2) and O(n^3) as I am sure that @marco understands. Or at least the square of the amount of data. I was a little surprised that P123 plans to offer support vector machines for its AI/ML for that reason (clearly a good thing that he can). But back to the practical at home……

Thesis realistic with at home (or with Colab) with some algorithms that scale better. It has the weakness of k-fold validation. I do not mean to discount that has a potential problem.

But nothing says you cannot use more than one method along with a simple backtest—ideally, one supporting the other before you fund it.

Here is an example of the code:

from sklearn.model_selection import GridSearchCV, cross_val_score
from sklearn.svm import SVC
from sklearn.datasets import load_iris

// Load dataset
data = load_iris()

// Set up the classifier and parameter grid
clf = SVC()
param_grid = {‘C’: [1, 10, 100], ‘gamma’: [0.01, 0.1, 1]}

// Set up GridSearchCV
grid_search = GridSearchCV(clf, param_grid, cv=5) # 5-fold inner loop

// Perform 5-fold outer loop cross-validation
scores = cross_val_score(grid_search, data.data, data.target, cv=5) # 5-fold outer loop

Jim

Per Chatgpt and my understanding of an inner and outer loop you can still implement it with a rolling time series cross validation. But, I will have to be more careful about how I split my data since one of the hyper parameters I want to evaluate is number of years I train/optimize the model with.

K-fold for sure has advantages, but I really am hesitant to use it due to the inherent look ahead it has for factors that come into and out of style over multiple years. Maybe hold out periods can reduce this issue, but it is still not representative of how we can actually invest.

Regardless of the specific cross validation method do you feel that automating both the optimization/training system and the hyper parameter selection helps reduce overfitting? Or do you think our (human) memory of what worked and did not inevitably “poisons” any true hold out test data we have?

I understand about k-fold. I use time-series validation when I can. I am more comfortable using k-fold when I have a holdout test sample at the end of the data set. Any bias caused by k-fold training and validation should not affect the holdout test set which is always the most recent data when I do it.

The memory certainly can give a heck of a lot of bias but it does not have to. The bias or memory comes into play most for the factors, I think. Suppose for example, that I just love FCF/P and it has been working well for me in my live ports for the last 5 years… I will include it in my training set. It would be silly not to. But the problem is that when I do a holdout test set for the last 5 years there is information leakage pure and simple.

But it is possible to do it right, I believe. You could start with a set of factors such as Dan’s list or even just the Core Ranking System and any additional list of factors and chose a set of objective criteria or an algorithm for selecting the factors.

You write down an algorithm for choosing the factors. It probably would not be just one criterion. Maybe you want it to have a certain Spearman rank correlation and a nice slope. Maybe you want the top bucket to have 15% return. Maybe you use statistical significance with a t-test. Maybe you use feature importances from a Python program. That would be an individual decision and personally I do not use just one and probably do not hate any criteria people might suggest. I would not be adverse to just adding slope if I were corroborating with someone, for example. Slope is not totally ignored by the Spearman’s rank correlation number, BTW.

But consider the criteria to be parameters (or hyper-parameters to keep he analogy). And let’s just keep it simple. Let’s suppose you want the top bucket to have a return of X%.

Then select all features with X% return of the top bucket (for this particular algorithm). Maybe use equal weight in a ranking system for our computer resources. Walk it forward, finding X for each new training set and validating it on the next year.

So train 2000 - 2010 and validate 2011. Then train again (or find the return for the top bucket) for 2001 - 2011 and validate 2012….

Have a holdout test set or not. But train/validate with a range of values for the cutoff to the top bucket. Say 15% to 25% in 1% increments and walk-forward each value.

The point is this, FCF/P may be in in some training sets and not others. But you are no longer looking at factors so much but selecting an algorithm and testing it with a pretty good method: time-series validation which is guaranteed not to have any look-ahead bias.

Ideally, one would be approaching this new with massive computing power. One would probably want to integrate feature selection and model selection (neural-net, random forest, XGBoost etc) into a single train/validate/test run. Maybe do a Monte Carlo or Bootstrap on the holdout test set for your retirement planning and set the computer to start trading. Look at your broker account now and again. Have the computer retrain you system yearly. Buy a house based on your Monte Carlo simulation, watch your kids graduate from Harvard after making a huge donation to get them in, buy that retirement beach home.

But my point here is that if you have the time and/or computing power and liked to use P123’s optimizer, say, you could chose the factors and optimize the weights at each training step then test the next year. And that would be ideal.

In practice one might just do this to determine what features to use and come back to figure out if you want to do XGBoost or somehow use the features in a classic P123 ranking system. Use P123’s trade automation.

I think that is pretty good and infinitely doable—even with spreadsheets.

But a lot of this could be automated with Python and Pandas. Slicing and a for-loop could do the walk-forward and most feature selection (from slope to feature importance to Spearman’s rank correlation). Even if you wanted to use the output for each training period in P123 with a screener, sim or rank performance test (for each test year) it is doable. You could run that for each validation year in P123.

With the P123 API one might be able to fully automate every step, but I think you would not have to.

There is no look-ahead bias with that if you like it or don’t hate it is my only point.

Jim