Python program to find correlations and multicollinearity

The goal is similar to yours: Spot redundant factors to simplify things. But it's tailored for ML workflows, where high correlations can make models unstable or slow, even if they don't hurt traditional linnear ranking system as much.

As most “AI” users here, I’m still learning, but I will try to explain the differences as good as I can.

The program works on the raw z-scored factor values across all stocks and dates in the download (not the ranks from a specific ranking system run). So, it's looking at the "raw signals" rather than how they combine in a composite rank. This makes it more about inherent data redundancy (e.g., if two factors move together across the market) than rank-specific behavior.

It auto-detects high correlations (>0.7 by default, adjustable) and suggests removals systematically (e.g., in correlated pairs, keep the one with lower VIF or higher ML importance if provided). It also handles "clusters" of multiple overlapping factors (e.g., if A correlates with B and C, it might remove just one or two to fix the group).

In traditional ranking, equal weights mean correlations might not hurt much (as long as factors add unique edges). In ML (like LightGBM), correlated features can inflate importance scores, cause instability, or slow training. This tool preps for that by suggesting a leaner set.

The tool is not perfect, common sense still applies. For example, the tool will very likely find volatility and momentum data to have high correlation and suggest to remove one of the features. This will probably hurt your model…

In short, your method is rank-performance-centric and great for P123 sims/backtests. This is data-centric and geared toward ML input prep.

If I get timeover I will try to upload the code as homepage so anyone can us it without any python knowledge. I think that is what Marco wants to achive with the Jupyter Notebook setup.

Yuval — just so I understand the discussion:

Are you using a single snapshot from one time period to compute the correlation matrix? Or are you taking multiple snapshots over time and averaging them? Or is it something else entirely?

I’d also be curious how Pitmaster and Algoman would characterize their approaches. Are their results based on average correlations across a time period, or is the output more nuanced?

Correlation structures can shift meaningfully over time and across different market regimes — so this feels like it could be a key difference between the methods, depending on how people answer.

To me, this might be the most important distinction — setting aside any additional advantages that VIF or hierarchical clustering might offer.

And it seems like this would matter regardless of whether you’re using ML or P123 classic.

Yes, that's precisely what I'm doing, and using rank rather than z-score. So, for example, ranking 4,000 companies on market cap and on total assets will have a correlation greater than 0.8, and ranking them on 6-month momentum and 8-month momentum will probably have a correlation close to 0.95. It's very simple, really, and perhaps too simple. I use it to get the number of nodes I have in my ranking systems down to below the maximum allowed.

1 Like

Your reply was extremely helpful. Thank you!

Thanks for sharing your method. FWIW, I’ve been using rank for the features in my models too. I also think correlation of ranks works well. It’s mathematically equivalent to Spearman’s Rank Correlation (by definition), as I’m sure you already know. The two are exactly equivalent when ties are handled by assigning average ranks. It’s only slightly different when using P123 ranks, where ties are randomly broken instead. Probably not a meaningful difference in most cases.