My comment: be careful even if you are using a form of cross-validation. You can easily turn your test data into training data…
And the question leading to the comment: if I have to tune my optimization system parameters to get a performance improvement in my test data, did I just fit to my test data?
I finally downloaded enough data to do a rolling time series train/test validation with Bayesian optimization on a linear ranking system. At first my code was messed up and I was training and “testing” on the same data. I was able to confirm that the Bayesian optimization produced a system that performed better than a equal weighted ranking system on the train data set. It was only an 10% median (I think I did median and not mean) performance increase across 10 train/test sets, but still seemed significant.
However, once I fixed my error to test on the test data (about 2 years per split) I found that I was not able to get any consistent outperformance from the optimizer. At most I saw a 2% median improvement. So sometimes it was positive and sometimes negative. I then started to try and tune my hyper parameters (training period, number of optimization rounds…) to get more consistent performance. However, I realized that I am now fitting my hyper parameters to my test data…
The challenge I think we face is that most of us have at most 20 years of data. This is not a lot if we want to train on 10 years and still have a significant number of train/test splits. As such we cannot easily have a significant period of time that is a final hold out to test performance on. Similarly if we split out universe in 2 we cannot easily trust that the data is truly “out of sample” as there are likely trends in the same time period across companies.
Maybe a solution is to hold out a few choice years/months across the 20 years that we only test performance on after we are done tuning hyper parameters. And we have to promise ourselves to not revisit that system many times after doing the final hold out test.
A simple visualization of the train/test split: note that each split trains on the full universe. I am just sliding the start and stop date
Note: I am doing this instead of k-fold to avoid any potential look ahead bias from training before and after my test period.