Hi all,
I have some tips for running XGBoost as well. They probably apply to some other ML methods, but I have only looked at XGBoost. Also my code has not been reviewed other than trying to match my simulation results to P123 simulation results, so unfortunately my findings may be inaccurate… That being said the annualized returns and yearly returns are similar so it is not completely made up.
In no particular order:
- If you normalize your data you need to be very careful that the train, and validate, and investment prediction data is normalized the same way. If you normalize 1 week vs say 10 years some of the ranges will not be the same, even for ranks! This is doubly true if you train on data that is not a rank like your target probably is. If you want to look at feature importance normalizing is important, but it may not be 100% needed. Comments on a few normalization methods:
– The minmax scaler had the best results
– Yeo-Johnson significantly reduced the out of training (oot) annual return and RMSE is around 1, probably due to negative values
– Yeo-Johnson + minmax had the highest Spearman’s, but still really bad annual OOS
– RobustScaler also shoots the makes the RMSE around 1, but the oot results are almost as good as minmax
– RobustScaler + minmax very similar results as just robust scaller, but the RMSE goes back to a reasonable value - As Jim kindly helped me discover the leaf size or min child weight really helps bringing some of the training in-sample magic out of sample. It reduces the in-sample returns, but increases the out of sample returns. I have done a little bit of grid searching and for my data a min_child_weight around 1000 seems to be a good starting point if you are using XGBoost in python.
- If you do a hyper parameter optimization it might be worth maximizing a sortino ratio or the like on your validation data, granted I did not make that work yet… I am not currently convinced that minimizing RMSE maximizes returns out of training. I ran a quick bayesian optimization on my hyper parameters and the optimized hyper parameters were significantly worse out of training.
- For the core combination ranking system (with some factors dropped) here is the order of target look ahead goodness 3m > 1w > 1m from what I tested. I am using 7 years to train/validate and 5 years after that to get annualized returns with a simulation much like the simulated strategies in P123. I am not sure why 3m is better, it could be an error of some sort in my code, but it should not be from look ahead in the simulation as I have checked the linear rank performance against P123.
- When using kfold cross validation there seems to be a sweet spot around 5 folds for maximizing out of sample performance: 4 → 19% , 5 → 24% , 6 ->23% , 7 - 19%. But, I am not sure how well this would translate to a live portfolio… I assume this is an effect of how large or small the training set ends up being and thus if it is over or underfitted.
Once again if folks have things to share of feedback that would be very welcome!
Thanks,
Jonpaul