Extreme number of nodes (300)– does it damage the ranking system?

Still new to p123, I’ve grasped the importance that a high number of nodes can be important for building a robust system. What I have been working on has 80 nodes and is used against the Easy Trade US Universe. It has yielded good results.
But to take it to the extreme, and just to test, I increased the number of nodes to 300. These are taken from danp’s overview.
Surprisingly, the system with 300 nodes gives a slightly better result than the one with 82.

But, beyond the fact that it is late to test such a large system, are there any dangers with very many nodes?

Iv seen that Yuval have published his numbers of nodes in the “Design Model” description. But how many have you in your own system?

When you are using that many factors/nodes, some of them are going to be correlated.

P123 staff have posted at least 2 ways of dealing with that. Yuval has mentioned his method of looking at correlations. Dan has expressed an interesting correlations at but I am not sure what he is actively looking at or may have implemented in that regard.

Marco, says he will be making partial least square (PLS) available in the near future (the second clear implementation of a method looking at correlation by P123 with Yuval’s being the first).

With over 300 nodes and–by definition of a node–more than 300 factors it seems possible that you are using more than one type of cash flow. And maybe more than one value ratio that includes the same cash flow factor. Like FCF/EV and FCF/MarketCap.

It is hard to get away from the idea that there might be some correlation among your factors. That is not a trivial problem when trying to optimize the weights of your factors in a ranking system (or some other algorithm).

So my answer would be there is nothing wrong with 300 nodes. But the question is how do you weight the nodes and factors with the correlations in mind?

I think Yuval and Marco both have good ideas on this. You can find other ways to do it and/or use something of a hybrid and maybe add a few ideas of your own.

To some extent this is a multicollinearity problem that has been frequently discussed (basically everywhere in the literature for 100 years). Marc Gerstein has addressed this topic in the forum along with its evil-twin: misspecification. He apparently attended some of those lectures when he got in Finance Degree and some of that training went into developing P123’s present methods with are already pretty advanced..

This is not a new topic. Marco has decided to implement some solutions. Another example of Marco helping with this is the observation that boosting and random forests are robust to this problem.

Jim

I’ve always said, the more nodes the better, as long as they make good financial sense.

Let’s look at some extremely correlated factors as an example: those based on free cash flow. You can start with free cash flow margin, free cash flow yield, unlevered free cash flow to enterprise value, free cash flow to assets, and free cash flow growth. You can compare against universes, sectors, and industries. You can look at the latest quarter, TTM, annual, 3-yr, or 5-yr average. And you can use several different specifications of free cash flow: FCF (operating cash flow minus capex), NetFCF (ditto minus dividends paid), or OperCashFl+CashFrInvest (this will take into account acquisitions, divestitures, and other investing cash flow–for example, Hertz used to put all their spending on new cars in CapEx and all their income from sales of old cars in other investing cash flow). So, doing the math, right there we have 225 possible FCF factors.

How do you choose? NOT BY BACKTESTING ALONE!!! You have to think these factors through. Does it make more sense, when you look at company’s financial statements, to use OperCashFl+CashFrInvest or just OperCashFl-CapEx? Does it make more sense to compare FCF margin to other companies in the same industry or to the universe as a whole? Should value ratios in general (FCF yield, unlevered FCF to EV) look at only the latest quarter, or a 5-year average, or should value ratios stick to TTM numbers? Your lookback period also should depend a great deal on how long you’re holding the stock. Does it make sense to use quarterly values for a one-year holding period? Probably not. For a six-week holding period? Maybe so. Should you also take into account the variability of FCF, which is one of the most variable factors out there? Which of these formulas goes best with the other formulas you’re using? What do you do about companies like Starbucks, which had negative FCF for decades, all while posting astonishing growth and very solid returns? Should you consider using plain old operating cash flow in addition to or instead of FCF? A lot of analysts say you should only deduct maintenance capex, not growth capex, to get FCF, and this can only be done on a company-by-company basis. Or can it? Some people have proposed a formula to determine growth capex . . . and so on . . .

In short, there are an infinite number of factors out there. A wise choice is better than choosing them all. But looking at a factor from several different angles, even if they’re closely correlated, is probably better than just choosing one. And don’t rely only on backtests. Think things through.

1 Like

Thanks Yuval!

But let me ask a little more practically then, let’s say I have 80 nodes spread on all the known factors, then I I’m not sure if I should add another node, which is DbtTot2CapGr%PQ.

How would you go about considering whether this should form a part of your ranking system?

What I see is that DbtTot2CapGr%PQ gives a better return for my ranking system in a backtest, but what else would you consider for if the node should be included in your system?

I’m assuming you’re looking at this factor with lower numbers being better. In general, anything that measures debt reduction would, in my opinion, be a good factor to include. Is this the best expression of debt reduction? I usually favor PYQ factors over PQ factors because of seasonality, but you may be able to make a good argument here. And perhaps using DbtTot2CapGr%PQ is better than DbtTotGr%PYQ. It hadn’t occurred to me before now. What matters is your reason for including it. Are you including it because you feel that your system needs to have some debt-reduction measure in it, and that this is really the best one because [I’m sure you can come up with a good argument for it]? In that case, great. Or maybe you have another debt-reduction measure in there and you think this would be a nice supplement to it. That would be fine too. In short, I can’t really think of many reasons not to include a debt-reduction factor that you think would work well with your system, especially if it gives your system a better backtested return. But, of course, you want to look at the alternatives and decide what debt-reduction factors make the most sense to you from a financial standpoint. Lastly, you’ll want to look at NAs here. If a company has no debt to reduce, it should ideally rank higher on this factor than a company that reduces its debt but still has a lot of it. So run a screen with this factor and look at the debt load of companies that score NA. If all of them are debt-free, perhaps you’ll want to use IsNA(DbtTot2CapGr%PYQ,-1000000) so that those companies have the lowest score.

One other thing: don’t use PQ factors on European stocks. Too many of them report semiannually, and PQ will be NA for those. PYQ and TTM should work fine.