The goal is similar to yours: Spot redundant factors to simplify things. But it's tailored for ML workflows, where high correlations can make models unstable or slow, even if they don't hurt traditional linnear ranking system as much.
As most “AI” users here, I’m still learning, but I will try to explain the differences as good as I can.
The program works on the raw z-scored factor values across all stocks and dates in the download (not the ranks from a specific ranking system run). So, it's looking at the "raw signals" rather than how they combine in a composite rank. This makes it more about inherent data redundancy (e.g., if two factors move together across the market) than rank-specific behavior.
It auto-detects high correlations (>0.7 by default, adjustable) and suggests removals systematically (e.g., in correlated pairs, keep the one with lower VIF or higher ML importance if provided). It also handles "clusters" of multiple overlapping factors (e.g., if A correlates with B and C, it might remove just one or two to fix the group).
In traditional ranking, equal weights mean correlations might not hurt much (as long as factors add unique edges). In ML (like LightGBM), correlated features can inflate importance scores, cause instability, or slow training. This tool preps for that by suggesting a leaner set.
The tool is not perfect, common sense still applies. For example, the tool will very likely find volatility and momentum data to have high correlation and suggest to remove one of the features. This will probably hurt your model…
In short, your method is rank-performance-centric and great for P123 sims/backtests. This is data-centric and geared toward ML input prep.
If I get timeover I will try to upload the code as homepage so anyone can us it without any python knowledge. I think that is what Marco wants to achive with the Jupyter Notebook setup.