Machine learning, Portfolio123, historical data and ratios

I would like to use Machine learning for stocks selection, by using fundamentals. It is only an exercise… I know it won’t work:)

Anyway, I want to analyze the NASDAQ and I need:

  1. The historical stocks of the index. I mean… the stocks included in the NASDAQ in 2010,2011 and so on.
  2. Historical fundamentals, ratios and metrics for each stocks, in a quarterly form.
  3. The historical ration for each sector and industry (es PE for Semiconductor and son on)… quarterly form.
  4. An easy way to get this data. I use R and Python, but having something easier to use would be a plus.

So my question:
Portfolio123 can do this?

Or, in alternative, do you know a service doing this?

Thank you

Portfolio123 can offer you the raw data if you take out a license with one of our data providers. But that would likely cost you over $10,000. If that interests you, see and then

On the other hand, we can offer you transformed data without a license with this operation: or this one: These will enable you to download the rank within the NASDAQ of each metric that you want on each date you need.

There are also Data_Universe and Ranks endpoints available in the API if you would rather use that instead of the DataMiner tool.

For more information on the API and DataMiner, take a look at the Knowledgebase at [url=][/url]
and the API documentation at [url=][/url]

Hi Francesco,

I think Dan and Yuval can continue to help with specific questions regarding the data.

I think factor analysis works well. Because of the vectors used and the creation of orthogonal vectors one can use aggregated data, perhaps simplifying the data downloads–depending on your assumptions. I won’t go into my assumptions here or even argue that the way I do it is the best way or even correct.

The nice thing, assuming one finds it useful at all, is that it determines the significance of a factor (whether to use it or not), it determines latent variables with can easily be classified into nodes (e.g., sentiment or value), and the weight of the node or factor is automatically determined by the program.

All this can be used to create a regular ranking system for P123. I have not found further optimization to add any additional benefit after using factor analysis to create a ranking system. In other words, already optimized. Definitely easier, IMHO.

SPSS does a better job than Python for factor analysis, I think. and JASP has a free version that is pretty capable.

Andy Field writes well and covers factor analysis extensively in the book for R programming: Discovering Statistics Using R

Or this one for SPSS:Discovering Statistics Using IBM SPSS Statistics: North American

Python works well for other thing from logistic regression to neural nets. But your question was about something that was possibly easier.


Piravi, you may not need to download any data at all (for which you would need a direct data license with Factset or S&P ).

We’re almost ready for a pre-release launch of our AI Factors. It’s all web based , easy to use, no code, with pretty front ends. And it all ties in directly with the screener or buy/sell rules

Our initial release will include the following learning and prediction algorithms:

Random Forests
Extra Trees
Linear Regression
Keras Neural Networks
Support Vector Machines
Generalized Additive Models

You will be able to change hyper-parameters for each. We’ll also have several pre-processors for data transformation. Is there something missing from the list above ? Any reason you would still want to download data?


Thnks Marco for update.
why not to include lasso/ridge regression ? They are good in discarding irrelevant factors and and are unlikely to overfit.

Absolutely correct. In addition, much less computing power is required (and no GPUs). I would only add that for classification Python’s LogisticRegression() also has L1 and L2 penalization which are equivalent to (or actually the same as) Lasso and Ridge regression respectively.

For those who wonder about using this with ranks remember that P123 can also provide z-scores without a data license and there should be no concerns about using any of this with z-scores.

Also, these methods are ideal for any continuous data (e.g., returns or much of the technical data available). P123 recently had a feature suggestion that included Pitmaster’s excellent observation about the benefits of L1 and L2 penalization:,13384

There is code in this thread for the use of LogisticRegression() which uses L2 penalization as the default (like ridge regression).

Pitmaster, I think you have a great idea that–at a minimum–might save some computer resources at P123. In addition to being an excellent method of shrinking and selecting factors as you said.

More generally, what P123 can do with this is pretty limitless.

And none of this is to say that I am not excited about trying XGBoost!!! There must have been a lot of good work that went into making that possible.

Awesome! Really looking forward to this Marco!