A very simple method, but could this work:

I have downloaded my factors and have the following overview with normalized numbers for each factor (feature or nodes):

In column D, I have three months of future relative returns compared to SPY( FutureRel%Chg_D(65, GetSeries("SPY"))).

What if I

  1. Filtered out the 5,000 best performers from a total of over 100,000 lines in column D.
  2. Then, I add a new row nr. 1, that sums the number of normalized values that are 1 for each factor (feature). (I may have misunderstood this, but 1 is the highest score each factor can achieve.)
  3. Then, I create a ranking system based on, for example, the 40 nodes that have the highest number of 1s among the 5,000 best performers.

Wouldn't that provide a very good indication of which of my 300 features (nodes) perform best in relation to future three-month returns?

The idea here is that the features with the largest amount of 1 in the 5000 best performers give a indication of what characteristics that would be a part of the winning stocks.

It is better to perform this operation in reverse for 5,000 worst performers

Why? And how would you do it?

fantastic insights, I shall be bashing the bishop all day over those insights

Thank you, but the idea came in a conversation with Test_user. Keep us updated on how your testing goes..

But would a large enough sample base—even more than 5000 help to model away from fitting the strategy to the "winning lottery tickets"?

One question, when you normalize numbers, am I correct in understanding that 1 is the highest. So if the stock factor (node) is a measure of high insider buying, the largest amount insider buying in the sample set will get a normalized value of 1?

As I recall, the machine learning model trained on the 1/5th of the stocks with the highest future returns in the paper I'm talking about got OK returns, but they were extremely volatile, and the return-to-risk ratio was not even as good as the model trained on the 1/5th of the stocks with the second highest future returns.

"This is the 50 features with the highest number of '1's. Equal balanced.

Not the best result:"""

image

1 Like