XGBoost Python Tips

Hi all,

I have some tips for running XGBoost as well. They probably apply to some other ML methods, but I have only looked at XGBoost. Also my code has not been reviewed other than trying to match my simulation results to P123 simulation results, so unfortunately my findings may be inaccurate… That being said the annualized returns and yearly returns are similar so it is not completely made up.

In no particular order:

  • If you normalize your data you need to be very careful that the train, and validate, and investment prediction data is normalized the same way. If you normalize 1 week vs say 10 years some of the ranges will not be the same, even for ranks! This is doubly true if you train on data that is not a rank like your target probably is. If you want to look at feature importance normalizing is important, but it may not be 100% needed. Comments on a few normalization methods:
    – The minmax scaler had the best results
    – Yeo-Johnson significantly reduced the out of training (oot) annual return and RMSE is around 1, probably due to negative values
    – Yeo-Johnson + minmax had the highest Spearman’s, but still really bad annual OOS
    – RobustScaler also shoots the makes the RMSE around 1, but the oot results are almost as good as minmax
    – RobustScaler + minmax very similar results as just robust scaller, but the RMSE goes back to a reasonable value
  • As Jim kindly helped me discover the leaf size or min child weight really helps bringing some of the training in-sample magic out of sample. It reduces the in-sample returns, but increases the out of sample returns. I have done a little bit of grid searching and for my data a min_child_weight around 1000 seems to be a good starting point if you are using XGBoost in python.
  • If you do a hyper parameter optimization it might be worth maximizing a sortino ratio or the like on your validation data, granted I did not make that work yet… I am not currently convinced that minimizing RMSE maximizes returns out of training. I ran a quick bayesian optimization on my hyper parameters and the optimized hyper parameters were significantly worse out of training.
  • For the core combination ranking system (with some factors dropped) here is the order of target look ahead goodness 3m > 1w > 1m from what I tested. I am using 7 years to train/validate and 5 years after that to get annualized returns with a simulation much like the simulated strategies in P123. I am not sure why 3m is better, it could be an error of some sort in my code, but it should not be from look ahead in the simulation as I have checked the linear rank performance against P123.
  • When using kfold cross validation there seems to be a sweet spot around 5 folds for maximizing out of sample performance: 4 → 19% , 5 → 24% , 6 ->23% , 7 - 19%. But, I am not sure how well this would translate to a live portfolio… I assume this is an effect of how large or small the training set ends up being and thus if it is over or underfitted.

Once again if folks have things to share of feedback that would be very welcome!


Try monotonic constraints.

What depth do you use?

It might no be the best but you will be surprised how well a stump does (stump vs a tree) where the depth is one.

There will be no interactions with a depth of one so not the best perhaps but surprising how well it does perhaps. You probably want 6 or less for depth.

A stump will not have interactions but you will get an a non-linear modeL with interactions with is not always bad.

All guesses with your data of course. The grid search will tell (including with monotonic constraints).

BTW, feature importances should be done with depth = 1 to really tell how good a feature is! To prevent competition of similar (or correlated) features (that collinearity thing again).


I have been mostly using a depth of 2, but I have not tried a depth of 1 yet. I did not realize you could haha.

I was spending a lot of time fighting with the Bayesian optimizer to try and get it to optimize validation performance instead of rmse. I have not succeeded yet.

I’ll have to look into monotonic constraints. I have not heard of them before.

The other big thing I want to try is a model that performs really well in sample with 1w look ahead target and then see how it performs one week out of sample. But it’s a lot of coding to set up because I’ll have to train or update the model every week in my simulation.

1 Like

Jonpaul and all,

So this is very interesting to me. If I understand, it uses the explore/exploit trade-off which is an extremely interesting math problem for me. A problem we all address every single day. Do I try a new restaurant or a new train strategy or should I stick with what has been working? You explore to find something new but at the cost of not doing something that has been working for you. Math can give you the best way to do this—to maximize the dining experience or trading experience over a lifetime.

That is no joke. Life gives you no guarantee that you will find the best restaurants or trading strategies and be able to use them most of the time but math will tell you how to give yourself the best chance of doing that.

I have posted about Thompson sampling which can help a person find the best trading strategy quickly: probably the optimal strategy for doing that.

Anyway, I have just started learning about this. I would certainly be interested in any comments.

Honestly, I think it is now possible at P123 to set up a cross-validation strategy (probably with Bayesian optimization)—and when you are done, know that you have the best model possible with the information you have available…

And if you have enough data for a hold-out test sample—truly know how your system will perform out-of-sample.

i understand why Marc Gerstein and others wanted to keep that from us. But if we don’t do it here someone else will do it. We might as well work together here at P123 to keep ahead. Before the advantage gets arbitraged away. Too dramatic? Not really I think. ChatGPT clones and a bank of servers have been using this all night somewhere, I am sure. :thinking: :dizzy_face:

Anyway, thank you Jonpaul for introducing me to: XGBoost hyperparameter tuning with Bayesian optimization using Python

@Marco, I think you really need to understand at least GridsearchCV to ba able to direct the development of the P123 platform. Yuval is not going to do this for you. Have you already discussed this with the AI/ML person. Is she doing it right?

Also, this is something you might try with some of the unused CPU cycles–for your own trading–when the new servers come online. .

@jlittleton: which library do you use?

Some choices before Jonpaul tells us what he uses. Maybe the easiest here: scikit-optimize

But Bard also likes these:

Hyperopt: Distributed Hyperparameter Optimization




I am using this one: from bayes_opt import BayesianOptimization. I am not actually sure which it is… But it was what chatgpt suggest to me and I am too new at this area to pick my own. I also used gp_minimize from skopt which I think may be the scikit? I used that one when I was trying to optimize a ranking system using the ranking_perf api.

That being said I spent probably 5k credits last month trying to optimize core combinations and Small Factor Focus and ended up not getting anything better than the original weights I started with… But I could not run very many iterations due to credit usage and a better ranking system performance does not always result in a better simulation performance. So my suggestion would be to instead download the ranking data and build your own ranking combiner and simulator so you can let the Bayesian optimizer run hundreds or thousands of times without using api credits.

If you are doing ML you should have all of the data you need to build your own linear multi-ranking systems anyway! Especially with pandas dataframes you can just add the columns together with weights multiplied to each. I am fairly confident chatgpt can write you the code if you cannot. Especially if you give it an example of the dataframe format and say what you want as output. Although be careful to check it worked right as I had to try a few times to get it right…


So you use this as your optimizer rather than all of those spreadsheet randomization methods.

Genius. With limited processing power and/or time it will find the optimal solution quicker or find a better solution. Pretty much guaranteed with proper programming.

Obviously, useful for XGBoost too as you have posted.

Nice! And Thank you.



I am do actually use the P123 optimizer for removing factors one at a time or seeing how factors in a ranking system look one at a time, but I am learning how to use the Bayesian optimization and I think I finally maybe figured out how to make it work…

I was getting junk out of it. The out of sample performance using the optimized values was horrible. But after chasing more normalization issues I think I am on the right track with the following:

  1. If you normalize you must do it on all of the data at once! I thought that ranks and other data would not have significantly different min/max values for 5 year+ time periods, but they seem to as when I normalize my training and simulation data separately I get wildly different predictions than when I don’t normalize or I normalize all at once and then split the data again.
  2. You need to optimize for the metric you care about, like alpha, or annual returns, or sharpe. A better RMSE does not mean better out of sample performance.
  3. Make sure the training and validation data are not in the same period of time. When I did not split the data this way the validation results were way too good and it messed up the optimization.

Now that I think about it I realize kfold is probably splitting the universe such that the stock tickers mix across the folds. I should probably try to prevent this from happening and see if my out of sample results improve. I need someone to check my code haha

I haven’t done one bit of ML work but I want to know more about normalization. I consider ranks as already normalized - they can only range from 0 to 100. Are you finding it necessary to normalize them again?


So I do not want to speak to what Jonpaul may be doing—especially with Bayesian optimization as I have not done it yet (I will). He is doing some pretty cool stuff, very advanced stuff and more than one thing. Some things I have not done before for sure (e.g., Bayesian optimization was new to me and is very cool). Maybe I can catch up with him on Bayesian optimization but I think he will always be a better programmer.

I am sure that with XGBoost alone normalization is not necessary or even helpful. That is one reason that using rank downloads from P123 can work so well. No raw factors necessary or even helpful.

Boosting and random forests make a split according to the target (returns). The feature (e.g., FCF/P) just need to keep the same order (which by definition a rank does.).

But again, Jonpaul is doing some very advanced stuff. Also, it is my understanding that ranks are a type of normalization . But for some things z-score can be a better way to normalize. And again, neither method of normalization is needed for XGBoost, I think.




Intersting idea. I would have been more concerned about mixing time-periods together in the Train/Test split. Obviously if you train value stocks during the 2008 recession and then test value features over the same period the test results will be more correlated than they should be. More correlated than if the k-fold test period is 2003 for example.

But people not using ML at P123 also have to decide when to use even/odd ID or separate time-periods. So you have an important question for everyone.

Please keep us informed!


The poor man’s ranking system!

I just did another test and the higher performance I was seeing was from normalizing the target (next week or next 3 months return) and not the ranks. That being said if you look at your factor ranking, they will not always be from 0-100. Look at DaysToDivPay for example with negative NAs. I know it has to do with treatment of the NAs, but I think with ML it is ideal to not mix the NAs into the middle. And the reason that normalizing the target is so tricky is that depending on the time frame you normalize over the bounds can be VERY different. Maybe this is where trimming or adding a fake min max bounds can help with consistency.

I am also not so sure why normalizing the target makes a difference. Based on what I know about XGBoost it should not matter as Jim mentioned. But the difference I am seeing is over 50% higher annualized returns when I normalize the target. That being said I have much higher spearman’s correlation variation on my train test kfolds when I do normalize. I will try to remember to post about it if I figure it out…

I also just ran the Bayesian optimizer on my out of training time period to see how good it can get. The answer is not that amazing haha. But it is much better than the equal weighted portfolio. But now my question is, did I just do the equivalent of over-optimizing a ranking system to in sample data?

I also want to see if the ML is better than a simple linear system optimized with Bayesian optimization during the same time period. Maybe I can check that in the next day or so.