Bayesian Ranking optimization vs XGBoosting performance

Short version:
How much more out of sample performance can I expect from an XGBoost portfolio than a Bayesian optimized static ranking system using the same factors for both?

Long version:
I am fairly new to Portfolio123 and am procrastinating on making my own well thought out and researched ranking systems by learning more about optimization and machine learning such as XGBoost.

Last week I looked into the python API and with the help of ChatGPT coded up a weighting optimizer for a Ranking System using Bayesian optimization and gp_minimize from the python scikit-optimize. From what I can see it generates a bunch of random rankings for each factor and then evaluates the performance and uses that to “look” for the best weightings. I split my universe into two so I can validate the performance on one half and optimize on the other. I then keep the weights that performed best on the validation universe.

I ran out of credits before I could do more than one or two runs so I am not sure to what degree I can squeeze more in-sample performance from it. So far I am only seeing a few percent annualized. I plan on running it on the core combination ranking system with the composite nodes removed and the Small Factor Focus ranking system. However, it uses a lot of api credits! So I am wondering if I should instead learn how to do ML with XGBoost and pour my credits into downloading the data for it.

My more detailed question, with the above context, is which method of optimization has more potential out of sample? Given my limited amount of credits (10k a month) I want to dig into the higher potential option first. Also I understand that any form of optimization can “memorize” the data used to “train” it and therefore a lot of the work goes into cross validation and training set selection and so on.

If anyone has a specific example of out of sample performance or knows of any studies that would be awesome! If not maybe I can give it a shot over the next few months/years and at least see how both methods perform in/out of sample.


I did not read Duckrucks paper. I am sure he has some good ideas about cross-validation. i would like to expand on that (or probably repeat some of what he has said).

Ultimately, cross-validation is the right way to go. The forum is not really that good about specific ideas. The ideas are too focused, responses are sporadic and tend to be a little adversarial.

A simple cross-validation technique to start with is criss-cross validation.

So if you have 20 years of data you divide it in two. You train the first 10 years then predict for the last 10 years and see how your predictions worked by whatever metric (e.g., R^2). But Scikit-Learn has a ton of metrics to choose from. RMSE would be another of many.

The you can also train the last 10 years and predict the first 10 years (again seeing how predictions performed by whatever metric you choose).

But I recommend you tune-out the forum and find out what works for you over at Scikit-Learn. I think you do not need my biases about particular methods.

If you try a random forest oob_score is so easy of a validation method that you should turn it on. But it has its problems.



BTW, I do use Bayesian techniques as part of a larger algorithm. I think it is one good method for shrinkage but not the only one.

Duckruck has mentioned some other great methods for shrinkage (e.g., Elastic net regression). Again, though I would focus on finding your own method with cross-validation. I would only say that I would not discourage Bayesian methods or XGBoost (with validation).


Thanks all! I’ll read the paper, it looks helpful. Sounds like implementing ML in voice is just the start and how to consistently get good out of sample performance is the meat of the challenge.