I like the idea of model averaging. It can be done by adding columns in Python (which I have done before) or by using VotingRegressor. I was introduced to VotingRegressor in this post: Share complete ML phython code.

So I thought I might experiment with the code and try model averaging again. I did cross-validation on 2 of my models that had similar performance. An ExtraTreesRegressor model and a HistGradientBoostingRegressor model.

While they are both tree methods they are really quite different algorithms and model averaging seems appropriate.

It made little difference in the total return. You could argue that the returns are less extreme for the VotingRegressor model.

I am not set-up to calculate drawdowns, Sharpe Ratio or Alpha for ML done in Python. ChatGPT and I could probably figure that out and I need to do that. But strictly as far total returns are concerned, model averaging does not help these 2 models at all.

It would be interesting to check Sharpe or some other measure that includes variance/risk. You may find that although returns are about the same, the variance of (weekly, monthly, etc.) returns may be better. Or that one combination of models produces slightly worse returns but substantial improvement in variance.

Here is the correlation of 2 ML models that used the same factors. The Sharpe ratio is calculated for each as well as for a "Book" (done in Python and not a P123 Book) using both models:

As expected, the Sharpe ratio is improved by combining these 2 models—which are not 100% correlated with similar returns--into a "Book." Here, the improvement in the Sharpe ratio is limited by the high correlation of the 2 models.

This is a k-fold validation with a 3-year embargo period. Essentially a screens of 15 stocks (done in Python) with no slippage. The universe is the Easy to Trade universe.

"Sharpe ratio" calculations did not include the risk-free-rate and are actually just the mean/(standard deviation) of the returns.