I think Dan and Yuval can continue to help with specific questions regarding the data.
I think factor analysis works well. Because of the vectors used and the creation of orthogonal vectors one can use aggregated data, perhaps simplifying the data downloads–depending on your assumptions. I won’t go into my assumptions here or even argue that the way I do it is the best way or even correct.
The nice thing, assuming one finds it useful at all, is that it determines the significance of a factor (whether to use it or not), it determines latent variables with can easily be classified into nodes (e.g., sentiment or value), and the weight of the node or factor is automatically determined by the program.
All this can be used to create a regular ranking system for P123. I have not found further optimization to add any additional benefit after using factor analysis to create a ranking system. In other words, already optimized. Definitely easier, IMHO.
SPSS does a better job than Python for factor analysis, I think. and JASP has a free version that is pretty capable.
Andy Field writes well and covers factor analysis extensively in the book for R programming: Discovering Statistics Using R
Piravi, you may not need to download any data at all (for which you would need a direct data license with Factset or S&P ).
We’re almost ready for a pre-release launch of our AI Factors. It’s all web based , easy to use, no code, with pretty front ends. And it all ties in directly with the screener or buy/sell rules
Our initial release will include the following learning and prediction algorithms:
XGBoost
Random Forests
Extra Trees
Linear Regression
Keras Neural Networks
Support Vector Machines
Generalized Additive Models
You will be able to change hyper-parameters for each. We’ll also have several pre-processors for data transformation. Is there something missing from the list above ? Any reason you would still want to download data?
Absolutely correct. In addition, much less computing power is required (and no GPUs). I would only add that for classification Python’s LogisticRegression() also has L1 and L2 penalization which are equivalent to (or actually the same as) Lasso and Ridge regression respectively.
For those who wonder about using this with ranks remember that P123 can also provide z-scores without a data license and there should be no concerns about using any of this with z-scores.
Also, these methods are ideal for any continuous data (e.g., returns or much of the technical data available). P123 recently had a feature suggestion that included Pitmaster’s excellent observation about the benefits of L1 and L2 penalization: https://www.portfolio123.com/mvnforum/viewthread_thread,13384
There is code in this thread for the use of LogisticRegression() which uses L2 penalization as the default (like ridge regression).
Pitmaster, I think you have a great idea that–at a minimum–might save some computer resources at P123. In addition to being an excellent method of shrinking and selecting factors as you said.
More generally, what P123 can do with this is pretty limitless.
And none of this is to say that I am not excited about trying XGBoost!!! There must have been a lot of good work that went into making that possible.