AI Factor Parquet file download

Marco,

I’m very interested in the prospect of creating apps for the platform. My experience is that when I build tools for a community, the accountability of others using my code forces me to be more rigorous, which ultimately makes the tools more reliable for my own trading.

My focus is heavily on pre-training analysis: feature selection and normalization methodology. Most of my heavy lifting happens before a single "final" model is trained. However, this work is extremely resource-demanding. To do this right within the P123 ecosystem, access to compute workers and, crucially, the ability to work with raw data pre-normalization/transformation is a necessity.

While post-training analysis is interesting, it’s rarely as computationally expensive as the "search" for the best data representation.

As an example, I just finished building an engine designed to run tens of thousands of shallow tree-based simulations to solve the exact problem mentioned in this thread:

Exactly. We shouldn't throw 4,000 features at a model and hope it figures it out. We should find which of those 23 versions is the "truest" representation of that specific factor's alpha.

The App/Engine I’ve Developed: I’ve built a Python-based analysis engine (using LightGBM and Joblib for high-parallelization) that automates this discovery. The development of the engine was the tricky part, to utilize the full capacity of a processor is not very easy. Here is the workflow:

  1. Combinatorial Testing: It takes a set of raw features (e.g., 200 metrics) and applies a matrix of normalization methods (Z-Score, Rank, etc.) and transformation methods (Log, Box-Cox, etc.).

  2. Shallow Tree Simulations: For every single variation, it runs 20+ iterations of shallow LightGBM models. To ensure the feature isn't just "getting lucky," the engine pairs the target feature with 15 random low-correlation features to test its predictive power in a "noisy" environment.

  3. Statistical Validation: In a typical run (4 normalizations x 4 transformations x 200 features x 20 iterations), the engine handles 64,000 iterations on a standard PC. I managed to run a 30k iteration analysis on my i9 processor in about 4 hours.

  4. The "Winner" Selection: It doesn't just look at the highest return. It uses a Composite Score that weights: win rate, margin, consistency etc.

An analysis like this requires raw PIT data. Currently, very few people have the FactSet API licenses (private investors can’t get it either) or the local infrastructure to run 60k+ simulations on raw data.

If P123 provided the infrastructure to host an app like this—where we could utilize P123’s compute workers to "brute-force" the search for optimal normalizations and transformation—it would be a total game-changer for serious AI users. We could move from "guessing" which normalization works to having a statistically backed data-prep pipeline.

I’d love to hear your thoughts on how we might eventually integrate these types of apps into the new infrastructure. I have many other ideas that would require heavy compute where the P123 workers could be utilized.

PS, I just saw your post about the AI Factor 2.0 release. Maybe this type of normalization analysis could be implemented straight in to the new release?

1 Like