As shown in the post above P123 is working on releasing grid search for the AI Factor. I am very excited, but Jim and I briefly discussed there is concern for fitting to the data with grid search. This is my new post to not take over the preview announcement any more than I did.
I have personally implemented a nested grid search with a kfold inner loop and a walk forward outer test loop to remove some of this worry. I am using a custom scoring function to determine which "settings" are the best. Unfortunately my custom score is a bit to volatile of a to mean a lot, but here is the info for the score from the validation data vs the test data. Validation data might be overfit, but the test data should not be.
Train/validate/test run 1
XGBoost - Mean validation score: 0.355
XGBoost - Mean test score: 0.364
Train/validate/test 2
XGBoost - Mean validation score: 0.377
XGBoost - Mean test score: 0.35
Train/validate/test 3
XGBoost - Mean validation score: 0.376
XGBoost - Mean test score: 0.346
overall not looking overfit, but lots of variation between the three runs I have here to check how stable my results are.
My conclusion is that only a few hundred runs does not overfit (with clipped returns for the label) or my custom scoring function is in need of a rework. Probably both. That being said I will update when I have a chance to run with the custom scoring function just being RMSE as this should be more stable as a comparison.