Fewer parameters make better models.

Denny,
The 5 year bull-market started on 3/9/2009, when the S&P hit its lowest level. All models would have done well since then. So this is not really a test. Just holding SPY provided a 25% annualized return without any trading effort at all. The systems you refer to are all bull-market systems which imploded during the last recession. As you say “They all pretty much failed during the recession, but then so did everyone else’s.” So if systems are only going to work during bull-markets then we might as well forget about P123 and R2G and stay with buy-and-hold, and only get out of the market prior to recessions.

This bull-market is not going to last forever, it will be interesting to see performance of the models when the market climate changes.
Georg

Geov,
basically, I agree.
BUT, the first thing I am looking at when I look at a R2G model is the time frame of 2008/2009 and 2000. These were disaster years for the stock market. I want to know how the model did during those periods. If it did decently well, I may be interested. If it collapsed, no thank you.
There are some models that did well during those times (and not only because of “market timing”).
Werner

Georg,

What might be fun/interesting is to push the thinking forward into some more specificity on ‘good complexity’ versus ‘bad complexity’ - things such as - first a) how to classify rules into ‘types’ (say on a ‘complexity scale’ from 1-10) and then b) how to make better choices for choosing portfolio dollar allocations based on total system ‘complexity’ based on the various ‘types of parameters’ and their complexity type.

Might be interesting learning exercise / discussion? How would we create a ‘parameter’ complexity scale for single rules? If testing factors on a stand alone basis…rank on elements like:

  1. Fewer than 1000 stocks pass the rule. 0/1
  2. The annual (AR%-AR%Bench)/Downside Deviation of the best year of the rule - the worst year of the rule is ‘too large.’ 0/1.
  3. If vary the ‘setting’ (weight or numerical value) on the rule by 20% or more, see more than a 20% change in annual system alpha. 0/1.
  4. If vary the ‘setting’ (weight or numerical value) on the rule by more than 50%, see more than 50% change in annual system alpha. 0/1.
  5. Rule cuts peak DD by more than 10%/yr. 0/1.
  6. Rule cuts peak DD by more than 20%/yr. 0/1.
  7. Rule has fewer than 100 transactions / factor in backtesting. 0/1.
  8. Rule is based on statistical relationship only not on ‘common sense’ underpinning that has long-term research basis, that I understand and get and that has been widely written about. 0/1.
  9. Rule boosts AR% or alpha by more than 5% on underlying universe.

I just made the value up…but the idea is that…something like that becomes a scoring system a designer can use…a modified ‘Piotroski’ style sum of scores gives the parameter ‘complexity.’ The parameter complexities are then summed to create total system complexity rankings.

Might be interesting topic?

Best,
Tom

Sometimes, I use a pretty tight screen to pre-qualify the group of stocks to be ranked, in which case, a very small number of ranking parameters can be fine; I even use one model in which I have a single-parameter sort, which can be executed as a one-factor ranking system in sim or a “quick rank” in screener. Other times, I screen leniently (not so much to identify potential winners but more to eliminate potential dumpster fires) and run those results through a more comprehensive ranking system.

Use of many modelling parameters can do two important things for you that cannot be done with less parameters:

  1. Factor diversification can help mitigate the risk of the inadvertent mis-specified model; i.e. where a particular numeric relationship (factor or formula) often tells stories different from what you think it tells. One example might be the sales decline that is not telling you about bad business trends but is instead telling you about the elimination of a money-losing business and consequent increase in profitability. Another example might be the strong cash flow growth that isn’t really telling you about potential increase in shareholder wealth but is instead signaling you about insufficient capital investment and the likelihood deteriorating profitability going forward. Not only is it impossible for any single factor to tell you everything you need to support an investment decision, it’s not even possible for any single factor to consistently tell you what conventional wisdom suggests the factor ought to be telling you. A constellation of factors enhances relating to a particular theme enhances the probability of a model being able to override aberrations here and there and more effectively assess companies.

  2. Use of many factors (particularly when they reflect stylistic diversification) also opens the way to entire classes of stocks that cannot selected by more focused models. It’s a distinction between the generalist versus the specialist. When you trim down to a small number of factors, you’re going the specialist route. In such cases, you are no longer able to say “I want to make money in the market.” You’re instead saying “I want to make money in the market but that’s not so important to me as the desire to make money in the market this specific way.” Also, by limiting the number of factors, you’re in effect saying “I only will consider companies that are “excellent – spectacular – at this small number of things.” And considering that may people use just five stocks, and most probably 30 or less, we necessarily have to be talking about really extreme excellence (that being the only way a company can get a high enough rank to make such a model). There’s nothing wrong with that. But there’s also nothing wrong with generalists; companies that may not be spectacular in any single area but pretty good in many respects. Disqualification of the latter is not always a good way to go.

Both sets of considerations are likely to diminish simulated performance. When predicting the past, we (or at least the database) already know with absolute certainty who the biggest winners were. So the way to spruce up such a prediction is to hone in to the extent one’s detective skills permit on the traits that were held in common during the sample period(s) by those winners. Models with a lot of parameters tend to dilute that effort and, hence, reduce simulated alpha, etc. But it’s a very different ballgame when we turn 180 degrees and look to the future, where nobody and no database knows who the winners will be and what sorts of traits they’ll share.

There are three things that I do, to address this issue, a bit differently than what’s being talked about. Along the line of Marc’s point, I think of what factors bring more statistically independent views of a candidate stock or trade setup.

First, when developing a system I think of factors in eight different categories:

  1. Company related (financial statement stuff unique to the company)
  2. Price (and price derivatives e.g. moving averages, RSI, historic volatility, valuation etc.)
  3. Volume
  4. VIX (independently priced option market)
  5. Broad Benchmarks (stock can’t be major component e.g AAPL is > 10% of QQQ)
  6. Analysts (independent people analyzing the same or privileged data)
  7. Intermarket pricing (bonds, commodities etc. are independent competition for capital)
  8. Insiders (privileged information)

I group every factor and trading rule into one of these categories and count up the number of CATEGORIES not factors (never more than 8). If you’re familiar with principal component analysis, these become the principal components. I focus on the essence of what I’m trying to screen out within each category a “theory of the rule”. Often similar rules (e.g. 4 or 5 different valuation metrics) can be simplified without much impact.

Second, I try to have 100 (but at the very minimum 30) TRADES per category. This is just statistics… 100 can detect 2 sigma (95.4%) events (.045*100 = 4.5 instances).

Third, I spend alot of time in excel pulling months out-of-order and trades out-of-order (monte carlo) to understand what happens if the system performs like the past but in different sequences.

For example if I was using Balanced4 I’d group the ranking:

  1. Company (EPS Consistency, Industry Leadership)
  2. Price (TechRank, Valuation)

I think of this as 2 not 17 “risks of curve-fitting”. If I add a benchmark timing and volume rule, there would be 4 categories overall. I would want to see at least 400 trades independent of time. I would then put the 400 trades in an excel “shoe box” and pull them out in different order to create “synthetic years” to see if things hold up.

Factors come and go over time but basic value, momentum and small cap continues to be recognized as anomolies by even the EMH hardcores. I recently added #AnalystsCurFY < 4 to a sim fishing in the Russell 2K and it was detrimental over the long term (less stuff to buy) but helped TTM tremendously. Makes sense since this Bull market is getting a little long and capital is looking for more knooks and cranies. Like FCEL, BLDP and PLUG that have tripled in the last few weeks on no real changes to financials!

I would be very interested to know Denny how these system have performed from 2014 to 2023? :slight_smile: Still have the RS from way back?

I tried to find them. I’m not sure if its the exact same RS that you tested. I set some minimum requirements as to the number of stocks, and the universe, turnover, and liqudity, just to make sure it was not tilted to the extreme.

Besides one (Dans), this was not impressive.

I can fully agree with this statement: As a rule, the more degrees of freedom (more factors with regards to linear models/RS) a model has, the greater the sample size that is required to prove statistical significance.

However, I like to analyse my RS based on ‘The Bias-Variance Trade-Of’ nicely described by prof. Trevor Hastie in ‘An Introduction to Statistical Learning’

‘Variance refers to the amount by which f^ (a function that estimates your target variable), would change if we estimated it using a different training data set’. In our case a universal different training set can be achieved by bootstrapping your dataset. On the top of that you could use a similar (but not overlapping) universe (e.g. Canada vs US), different periods (1 week, 4 weeks, etc.), switching start date, or adding some noise by adding 10% of random stocks from a nonoverlapping universe. Then optimise your RS for each different training data set, and analyse variability in your parameters or use the same parameters and analyse variability in a performance metric.

Bias refers to the error that is introduced by approximating a real-life problem. In our case I would assume that the higher SR, Omega, the lower bias.

Ultimately, you would like to track how variance increases and bias decreases as you add new factors or degrees of freedom (more trees in random forest). For example, you would like to add a new factor to RS only if an increase in variance is suitably compensated by gain in bias (higher return).

From my experience, its possible to create very good systems using around 35 factors. Much more than that and the factors become too diluted, introducing too much noise. When I first joined P123 I had 100+ factor systems with sub 1% allocations for each factor… I was able to curve fit some amazing systems that performed terribly. Now my best out of sample performers use around 35 factors, which I would consider complex. I also think you need to use an external program to help you test systems with that many factors.

Tony

So we have few if any independent factors in our ranking systems. For example, FCF/P is not an “independent factor” if your system also has FCF/EV. EBITDA/EV is not fully independent if you use FCF/EV in your ranking system either. This is a collinearity problem.

There are automated solutions to this. Including recursive feature elimination which can be done in python with linear models, random forests, XGBoost etc…

For linear models LASSO regression can be used to eliminate factors as well as principle component analysis.

As a practical matter I do not know the ideal number. But I agree with Tony that around 35 factors can work with cross-validation and out of sample.

Many do use machine learning now and P123 will providing it. The number of factors I use is determined by the results of a time-series validation. My machine learning method is unique however (e.g., not linear regression, not a random forest not anything you would have heard of).

When available through P123, I think a random forest does pretty good and could be used with recursive feature elimination. That is pretty resource intensive.

Jim

Are those “35 factors” ranking systems better because they use fewer factors? Or because they have better factors or more weight on better factors in the OOS?

Regardless, it would very helpful in evaluating designer models if p123 displayed the number of nodes in the ranking system and the number of degrees of freedom used in the buy/sell rules, e.g. number of factors or functions used in the buy/sell rules.

Personally, I’m less concerned with overfitting in the ranking system than I am with the buy/sell rules. Those boolean filters make it very easy to curve fit, and I tend to prefer systems with only single rank-based sell rule and zero-to-minimal buy rules (maybe just some use of {Sec,Ind}{Count,Weight} to avoid overconcentration.

Currently I have to rely on the designers’ text descriptions to get a sense of their rule constructions, but it would be incredibly helpful to expose these degrees of freedoms to potential purchasers to get a look behind with the curtain without fully exposing the implementation.

Better yet, I’d still prefer to be able to subscribe to a ranking system (and not a strategy), so that I could combine that ranking system with my own, or build my own trading rules around the ranking system based on my trading preferences.

If you want to know the true degrees of freedom you might consider an orthogonal factor analysis.

This may seem like it is in the weeds because it is. Degrees of freedom is a difficult concept. At least for me personally I have a math degree and used it frequently. I can honestly say I never fully understood it. But could use it in a “cookbook-like way” mimicking my professor or the textbooks when necessary. I got good grades.

So bypassing a bunch of notation and mathematical proof that I admit I do not really understand anyway. You will have trouble finding more than 8 to 10 significant “latent variables” in an orthogonal factor analysis. I think this is a measure of the true degrees of freedom most of us have based on the number of factors alone.

Granted, adding factors one at a time and optimizing (and overfitting) the individual weights DOES increase the degrees of freedom. As do Boolean buy rules as Feldy suggests. How you weight the factors adds to the degrees of freedom. As I said, difficult stuff and that is my only point about degrees of freedom if you want to use it.

It is not my point that we do not manage to increase the degrees of freedom and overfit at P123. We do.

But ultimately, the degrees of freedom is not closely related to the number of factors by itself. And for sure adding a factor that is highly correlated to other factors in s ranking system does not count as a whole degree of freedom

Personally I use all good factors (with the devil being in the details of how you define a good factor). Something like elastic net regression or cross-validation can address the issues of collinearity and overfitting once you have selected what you think are good factors.

TL;DR: I am not sure degree of freedom helps this discussion much. And anyone wanting to take the time to to understand it (most of my adult life so far) could probably understand cross-validation. And just me perhaps, but I believe they will find cross-validation more useful. Probably a hell of a lot easier.

Jim

1 Like

@feldy As Yuval has mentioned in the forum, any factor with a weight below 2% is noise, and that has also been my experience. If you follow that rule and had 50 factors than no factor would be above 2%. Some factors are clearly better than others for example MarketCap may be between 5% and 10%. Given you will have several of these “better factors” that warrant >2% allocation, you cannot have 50 factors. I don’t know what the optimal number is yet but I suspect its between 25 and 35. Of course the quality of your factors is the most important part. Its important to draw a distinction between factors and ranking system nodes. I’m really talking about nodes here, each of which may contain multiple factors.

Just to clarify what’s being discussed. Is this one factor or four?

OpIncAftDepr(0,TTM)/( NetPlant(0,qtr)+ Recvbl(0,qtr)+ Inventory(0,qtr))

And with a database that may have many missing entries, does one need to use more factors?

I consider that 4 factors in 1 node. If you have factors that may be missing, you can use a relevant default value if it makes sense for your particular factor.

And the 2% min guideline is for nodes and not factors?

Yes, since percentages are set in RS nodes. I don’t know of a way to weight individual factors within a single node, although that could be interesting.

Tony,

I fully agree that 35 factors can work. No disagreement there.

But I wonder a little bit about what should count as a factor or possibly what Yuval’s present position is on factors and noise. Specifically these are Yuval’s words describing Crazy Returns Microcap Model: “The ranking system relies on over 150 different factors,…”

Anyway, whatever Yuval might say about what he is doing with his model I do agree with you. I also think the key is using good factors and not counting how many. Yuval spends a lot of time finding good factors. I am not questioning his methods either (no matter how arrived at the 150 number).

Jim

A facinating discussion. And for some reason, when searching in the forum, there have been some of them before, but without any clear answer.

How many nodes do you use in your system abwillingham?

And maybe others are willing to share how many nodes they are using?

I am one of the people who are overdoing it. I have 112 nodes and even more factors, but some are overlapping each other. I have been using the last 10 years to optimise and the 10 years before as an out-of-sample period. And for some reason, my out-of-sample period is not bad. With a 25-stock portfolio, a turnover of 300- 350%, and variable slippage of 0.2 commission, I have 65% insample and 61% out of sample.

I also test it with other parameters, like:

10/23/13 10/23/23
10/23/01 10/23/23
10/23/06 10/23/13
10/23/10 10/23/23
10/23/01 10/23/10
Mod(StockID,5) = 0
Mod(StockID,5) = 1
Mod(StockID,5) = 2
Mod(StockID,5) = 3
Mod(StockID,5) = 4

I have a lot of nodes, about 60-70, and some of these have really low weights. My thinking has been that it’s probably harmless to finetune your system, as long as it’s done in the ranking system, the testing/optimizing procedure is “robust”, and the nodes make financial sense.

That being said, a while back, I made a ranking system optimized solely for the utility sector. Given the small universe, I decided to stick to node weights in multiples of 2,5%. The final number of nodes was much small than what I’m used to, but the clarity and the low effort required to optimize the weights really made me reconsider my many-nodes strategy. It’s tempting to return to my main system and try to make something simpler. Maybe next year!