P123 compound nodes as AI Factors?

Jrinne · July 6, 2024, 3:25pm

Can that be done now?

If not, it might be a good thing to consider. Use case for now: Compound nodes could be set to principle component analysis (PCA) weights for factors using P123's factor download to find the principle components weights with Python. Those weights could be used as factor weights in a node.

Then those nodes could be used for linear regression to do principle component regression (PCR). Making it so that a PCR model could be rebalanced in a port.

This addresses know issues with collinearity for linear regression and the curse of dimensionality for ML models in general (i.e., it would be a method of dimensionality reduction, if desired, for any ML regression model).

korr123 · July 6, 2024, 10:50pm

Hi @Jrinne

I thought about this and am unsure if it matters. The algo's, I think, will weights or weighted paths, that should result in the same thing, theoretically.

Are you sure this isn't automatically accomplished implicitly by the AI models?

Jrinne · July 7, 2024, 9:35am

LASSO Regression and Ridge Regression are nice additions to the ML models. LASSO Regression reduces dimensionality somewhat and both are excellent for addressing multi-collinearity. P123 has done a nice job by providing these ML methods and I am sure P123 will make it possible to know the coefficients for these model in the near future. P123 has done a nice job with AI/ML in general, IMHO.

But nowhere have I read that LASSO Regression and Ridge Regression are the same thing as Principle Component Regression—even if they accomplish some of the same goals. Sklearn puts it in a different category anyway: Principal Component Regression vs Partial Least Squares Regression

I am not saying P123 should make this a priority.

I am just wondering, for sure, whether compound nodes can be used as features in P123's AI Factors. Mostly just a question with some explanation of why it is a question I am interested in.

Also for all types of regression (not just linear models), Curse of dimensionality:

"The common theme of these problems is that when the dimensionality increases, ……the amount of data needed often grows exponentially with the dimensionality."

A rule of thumb is that amount_of_data_needed = constant * k^number_of _features. The number of dimensions is the same as the number of feature for math geeks.

So if you use 300 factors at P123, k^300 could mean a lot of data is needed to get good answers, no matter what the empirical number for k is. So not just an academic point for some at P123. We may be operating under a curse and not know it!!!

I would not mind finding out if I have some sort of math-geek-head curse, myself.

Jim

pitmaster · July 7, 2024, 12:54pm

I tried to use latent variables created by Partial Least Squares (PLS) Regression as additional predictors for ML models (tree-based). Based on my very initial research, this approach seems to increase variance with no additional benefits.
PLS latent variables may work better for linear models.

korr123 · July 7, 2024, 9:51pm

@Jrinne, PCA is a preprocessing step for modeling. As you correctly noted, it transforms the feature set into a different set to reduce covariance between features to zero. It is simply a tool and nothing more. We might see improvements, or we might not. In my experience, PCA does not help with the interpretability of financial data. Additionally, there is information loss with PCA, so keep that in mind—it's not a free lunch.

Your most insightful remark is that the AI/ML feature is essentially a grid search. The entire process can be automated as follows:

Throw in the kitchen sink (use all available features).
Model with various methods and preprocessing techniques.
Train and test with various look-back periods and methods.
Report the grid results based on objectives and statistical significance.

The AI tool provides no insight; it is strictly empirical, so there is not much one can do aside from proper optimization.

Jrinne · July 7, 2024, 10:35pm

My actual quote: