Clustering 6800 criteria with MiniBatchKMeans

Whycliffes · March 17, 2026, 4:28am

Over time, I have collected a good set of criteria (6800). Naturally, many of these will overlap, even if they aren't identical. When creating a ranking system and attempting to optimize it for a universe, I understand that with so many criteria, and with 80-100 criteria running in each simulation, there's very little statistical chance that 3000 simulations will cover even a small fraction of the possible combinations.

Therefore, I've taken a technique from machine learning and text analysis (Natural Language Processing - NLP) to group these ~6800 formulas into 120 "thematic" clusters. The goal is to understand what each formula is about, so the script can choose a varied mix of strategies.

But the performance of doing these clusters worsened my performance, and I was wondering if there may be better methods?

Step 1: "Cleaning" - Preparing the Formula Texts

First, the script looks at the raw XML code for each of the 6800 formulas.

It first "cleans" the text, for example, by:

Converting everything to lowercase: sales(0,ttm)/sales(4,ttm) > 1.2
Removing unnecessary "noise", such as quotation marks.
Replacing all numbers with a generic symbol (#).

The formula above becomes sales(#,ttm)/sales(#,ttm) > #.#.

Thus, the script should understands that Sales(0,TTM)/Sales(4,TTM) and Sales(1,TTM)/Sales(5,TTM) are the same concept: sales growth over 4 quarters. It focuses on the meaningful words (sales, ttm) and ignores the specific numbers, which are just variations of the same theme.

Step 2: "Weighting" - Finding the Most Important Words (TF-IDF)

Now the script has 6800 "cleaned" text strings. The next step is to convert each text into a list of numbers representing how important each word is. This is done with TfidfVectorizer.

TF-IDF stands for Term Frequency–Inverse Document Frequency.

Term Frequency (TF): How often a word appears in a single formula. In a formula about sales growth, the word "sales" will have a high frequency.
Inverse Document Frequency (IDF): How rare a word is across all 6800 formulas. A very specific word like "FCFGr%" (Free Cash Flow Growth %) is rarer and therefore gets a high IDF score.

The combination (TF-IDF) gives the highest score to words that are important in a specific formula but rare overall. The result is that each formula is represented as a mathematical vector.

Step 3: "Grouping" - Placing the Formulas in 120 'Boxes' (MiniBatchKMeans)

Now that all the formulas are translated into number vectors, the MiniBatchKMeans algorithm, the clustering algorithm, takes over:

Places 120 random "centers" in the mathematical space where all the formula vectors are located.
Assigns each formula to the center it is closest to. Now we have 120 rough groups.
Moves each center to the midpoint (average) of all the formulas assigned to it.
Repeats steps 2 and 3 several times. With each round, the groups get better and the centers move less, until they stabilize.

The result is 120 -defined clusters. For example:

Cluster 5: Will contain formulas with words like price, sales, book. This is a "Value" cluster.
Cluster 22: Will contain formulas with rsi, sma. This is a "Momentum/Technical" cluster.
Cluster 48: Will contain formulas with roe%, roic%, grossprofit. This is a "Quality/Profitability" cluster.

yuvaltaylor · March 17, 2026, 9:00pm

Here's an alternative.

Value: all factors that include market cap, EV, or price as a numerator or demoninator such that the other term is not market cap, EV, or price.

Growth: all factors that compare a line item to the same line item in a different period, including Gr% factors. Limit this to income statement line items.

Quality: all OTHER factors based on line items.

Sentiment: all factors based on analyst estimates.

Momentum: all factors based purely on price or price-based formulas.

Stability: all factors based on SD or RSD or factors that rank middling values best.

Size: all factors that are simply market cap, volume, sales, # of analysts, etc., with lower values better.

Maybe you could use Claude Code to do this for you, especially if you give it access to the factor reference, which groups all formulas and functions by category. At any rate, it can be automated: you don't have to do this by hand.

rwbattyaz · March 18, 2026, 12:53am

The flow looks reasonable, but the cleaning looks to be doing both too much and too little. The handling of numbers sticks out as they can have multiple roles. The following is going to be a mind dump. Please note that trying these all at once is likely to be error prone.

The # character is used in some parameters like #inclna and #exclna. Does that cause a problem for you?
The Boolean values are integer 0 and 1. Are they mapped to the generic # symbol or to true or false? Suggest true/false. Also look at implied true, does it need to be consistently stated or removed across all statements.
How is the not(!) symbol handled? Does it’s presence create a new string.
Are numbers embedded in a name treated consistently? If one decides to transform them how does xxx8xxx versus xxx13xxx transform? Is it to xxx#xxx and xxx##xxx or just xxx#xxx? More generally, do multi-digit numbers map to an one character placeholder or a multiple character placeholder? I would try leaving them untransformed to retain the information.
The role integers play as index and offset in expression like rsi(20, 5). Instead of the generic number character consider using defined role names to transform to rsi(index, offset).
I frequently don’t specify defaults in an expression when I accept them, so rsi(20) and rsi(20, 0) which mean the same thing will transform into different text strings. You may want to add the defaults back into your expressions.
There are cases where integers are used as keys such in specifying sector(“10, 20, 30”) or industry(“8940, 1433”). Decide if you want to leave them alone or replace with a text like sectorcode. the same question when you have a code like Energy.
When handling numeric compare values like (32, 32.0, 8, -2) what do they get transformed to? Is it to a text like “comparevalue” or a pattern like ##.## or ##? Is calling out an embedded decimal point meaningful or a distraction? How are negative numbers handled with the minus sign retained or merged into the number representation? I have found using modest negative numbers in compares to be sometimes profitable.

Like I said, a mind dump. Studies have found that data scrubbing can be upwards of 85% of project time. My experience has supported that observation.

Good Luck,

Rich

Jrinne · March 18, 2026, 12:43pm

Natural Language Toolkit has a classifier module. E.g., classify as value etc.

NLTK could be used to code what Yuval has suggested above if you do not want to use Claude for some reason. Claude and other LLMs use things like “attention” that you will not be able to get from most Python programs but you can probably train Python programs like NLTK for this narrow usage.

NLTK has the potential advantage (compared to your present method) of using supervised learning for classification if you do not want to use an LLM for most of this. You still might want to use Claude to create some labelled training data, however.