Update on AI factor normalization upgrades

The Benjamini-Hochberg (BH) procedure might be applicable here. In genetics, researchers routinely face similar feature selection challenges when analyzing genome-wide association studies (GWAS) with tens of thousands of genes. BH helps them identify significant genes while controlling the false discovery rate - that is, the proportion of false positives among all rejected null hypotheses.

However, there are two key considerations for financial data:

  1. BH works best with independent tests, while financial metrics (like your 23 normalizations) often have complex dependencies
  2. For dependent features, the Benjamini-Yekutieli variant was specifically developed to maintain FDR control

This could be particularly relevant when dealing with your multiple normalizations of the same base metric, which would likely have strong dependencies.

Marco, this would be ideal and could be automated. I believe. I think the paper's results would have been improve by removing the obvious noise introduced by adding that many random factors and by reducing the dimensionality. 18,000 is a lot of dimensions if one has any belief in the curse of dimensionality at all. See a discussion of curse of dimensionality in ML here.

I get that the paper's authors think that their results are not as good because others were doing it wrong and they have now corrected all of that, but maybe the authors are making some mistakes in their own paper or have some poor assumptions accounting for some of their INFERIOR PERFORMANCE (e.g, a lot of noise and some pretty large dimensionality in their method).

In other words, I am not sure we should go out of our way to get the same inferior performance that the authors claim is a feature unless we fully understand what is going on and avoid simply accepting this single paper as being exhaustive on the subject.

TL;DR: I think you could reduce computer time as well as time spent with spreadsheets while improving results with this well-established method.

This previous post provides a link to the BH procedure method which is simple enough that it can be done in a spreadsheet or as the post suggest, you can have ChatGPT do it for you: Everyone can do machine learning using ChatGPT - #18 by Jrinne

The BH procedure is automated in an Sklearn module also: SelectFdr

The BH procedure is discussed in many finance papers linked to in the forum by members. Many have linked to this paper that discusses the BH procedure (among others) for example: Is There a Replication Crisis in Finance?