Over time, I have collected a good set of criteria (6800). Naturally, many of these will overlap, even if they aren't identical. When creating a ranking system and attempting to optimize it for a universe, I understand that with so many criteria, and with 80-100 criteria running in each simulation, there's very little statistical chance that 3000 simulations will cover even a small fraction of the possible combinations.
Therefore, I've taken a technique from machine learning and text analysis (Natural Language Processing - NLP) to group these ~6800 formulas into 120 "thematic" clusters. The goal is to understand what each formula is about, so the script can choose a varied mix of strategies.
But the performance of doing these clusters worsened my performance, and I was wondering if there may be better methods?
Step 1: "Cleaning" - Preparing the Formula Texts
First, the script looks at the raw XML code for each of the 6800 formulas.
It first "cleans" the text, for example, by:
-
Converting everything to lowercase: sales(0,ttm)/sales(4,ttm) > 1.2
-
Removing unnecessary "noise", such as quotation marks.
-
Replacing all numbers with a generic symbol (#).
The formula above becomes sales(#,ttm)/sales(#,ttm) > #.#.
Thus, the script should understands that Sales(0,TTM)/Sales(4,TTM) and Sales(1,TTM)/Sales(5,TTM) are the same concept: sales growth over 4 quarters. It focuses on the meaningful words (sales, ttm) and ignores the specific numbers, which are just variations of the same theme.
Step 2: "Weighting" - Finding the Most Important Words (TF-IDF)
Now the script has 6800 "cleaned" text strings. The next step is to convert each text into a list of numbers representing how important each word is. This is done with TfidfVectorizer.
TF-IDF stands for Term Frequency–Inverse Document Frequency.
-
Term Frequency (TF): How often a word appears in a single formula. In a formula about sales growth, the word "sales" will have a high frequency.
-
Inverse Document Frequency (IDF): How rare a word is across all 6800 formulas. A very specific word like "FCFGr%" (Free Cash Flow Growth %) is rarer and therefore gets a high IDF score.
The combination (TF-IDF) gives the highest score to words that are important in a specific formula but rare overall. The result is that each formula is represented as a mathematical vector.
Step 3: "Grouping" - Placing the Formulas in 120 'Boxes' (MiniBatchKMeans)
Now that all the formulas are translated into number vectors, the MiniBatchKMeans algorithm, the clustering algorithm, takes over:
-
Places 120 random "centers" in the mathematical space where all the formula vectors are located.
-
Assigns each formula to the center it is closest to. Now we have 120 rough groups.
-
Moves each center to the midpoint (average) of all the formulas assigned to it.
-
Repeats steps 2 and 3 several times. With each round, the groups get better and the centers move less, until they stabilize.
The result is 120 -defined clusters. For example:
-
Cluster 5: Will contain formulas with words like price, sales, book. This is a "Value" cluster.
-
Cluster 22: Will contain formulas with rsi, sma. This is a "Momentum/Technical" cluster.
-
Cluster 48: Will contain formulas with roe%, roic%, grossprofit. This is a "Quality/Profitability" cluster.