Bayesian Ranking optimization vs XGBoosting performance

Short version:
How much more out of sample performance can I expect from an XGBoost portfolio than a Bayesian optimized static ranking system using the same factors for both?

Long version:
I am fairly new to Portfolio123 and am procrastinating on making my own well thought out and researched ranking systems by learning more about optimization and machine learning such as XGBoost.

Last week I looked into the python API and with the help of ChatGPT coded up a weighting optimizer for a Ranking System using Bayesian optimization and gp_minimize from the python scikit-optimize. From what I can see it generates a bunch of random rankings for each factor and then evaluates the performance and uses that to “look” for the best weightings. I split my universe into two so I can validate the performance on one half and optimize on the other. I then keep the weights that performed best on the validation universe.

I ran out of credits before I could do more than one or two runs so I am not sure to what degree I can squeeze more in-sample performance from it. So far I am only seeing a few percent annualized. I plan on running it on the core combination ranking system with the composite nodes removed and the Small Factor Focus ranking system. However, it uses a lot of api credits! So I am wondering if I should instead learn how to do ML with XGBoost and pour my credits into downloading the data for it.

My more detailed question, with the above context, is which method of optimization has more potential out of sample? Given my limited amount of credits (10k a month) I want to dig into the higher potential option first. Also I understand that any form of optimization can “memorize” the data used to “train” it and therefore a lot of the work goes into cross validation and training set selection and so on.

If anyone has a specific example of out of sample performance or knows of any studies that would be awesome! If not maybe I can give it a shot over the next few months/years and at least see how both methods perform in/out of sample.