As requested, here are the performance of Numerai two market neutral hedge funds until end 2023 so that everyone can see that even Numerai can get it so wrong (especially both are supposd to be market-neutral based funds). It maybe due to overfitting of their AI/ML models.
I think the performances in 2023 were so bad that they have now stopped publishing the monthly performance.
The machine learning community has always loved overfitting. That's why XGBOOST is always the algorithm used by the winners in Kaggle, but it doesn't work well in P123's AI system - the secret lies in the fact that the Kaggle winners just got lucky!
The fact is that even the simplest linear models can do well. We need more shrinkage, not more fitting.
Nice point that I had not fully considered in this context.
High-variance gets lucky and will win when there are a lot of trials (or entrants in a Kaggle competitions), for sure.
Just as small schools do better (but also worse) when you look at their result (more variance due to the smaller student sample). Just as a 5-stock models will get lucky and be the best if you run a lot of trials (and used to be quite the rage at P123).
For extra trees you reduce the variance by keeping the min_samples_split , or min_samples_leaf a high number, I think. I believe this is expansion to Extra Trees Regressors is consistent with ZGWZ's point if you like Extra Trees Regressor.
It seems that the default Extra Trees model's hyperparameters are good enough that my further tweaking just brings bad things. And for some reason, my highest ranked models are always linear models.
IMO, XGBoost can be very strong on P123 as well. But like Jrinne mentioned with extra trees, there are a lot of hyperparams that are important to help with the overfitting e.g. min_child_weight, subsample, colsample_bytree, gamma, alpha, lambda.
Hyperparameter search leads to another overfitting. I would rather use default parameters for robustness.
The triumph of XGBoost in Kaggle is the result of a combination of its inability to adequately cope with model uncertainty, and a very large parameter space.
FWIW my Extra Trees Regressor does best with "min_samples_split": 4000,
I note Extra Trees Regressor is sensitive to noise variables as it does not select features according to the optimal split and can be more likely to select a noise variable because of this. So it can overfit to noise variables, if you include them in your feature set. Random forests, that do select features using the best split, have their own problems with overfitting as you know, however.
, Linear models, especially with regularization, are inherently resistant to overfitting as I am sure you already know.
But ZGWZ, you seems to understand this well and you are probably familiar with the "No Free Lunch Theorem." As I understand it, this is a mathematical proof and therefore it is unarguably true. It states there is no single-best model for every situation. For sure that applies to Extra Trees Regressor not always being the best model.
BTW, Claude 3 says this about the No Free Lunch Theorem proving that Extra Trees Regressor is not always the best (and I hope I never imply that in my posts): The theorem is proven mathematically and applies to all optimization problems, including supervised learning tasks in machine learning.
Financial data seems to be well suited for fitting with linear models. Even in the case where I had previously used the logarithm of the raw data as features, I had expected the nonlinear model to shine, but the best model was still Lasso/ENet