How to Test for Robustness (and avoid curve fitting)

o806 · December 10, 2007, 4:53pm

Hello fellow P123 users:

I would like to start a discussion on robustness. Just what things can we do to design ranking systems and strategies that will work in the future generally as well as they have worked during testing. To get things going I will outline my current approach. I look forward to comments on what I am doing. And I equally look forward to hearing of things that others are doing to increase robustness.

Before stress testing in simulation mode, I do several tests using P123’s Ranking Performance. This saves time in simulation testing. This first post is going be long enough just looking at Ranking Performance, so I will leave my ideas for Simulation Stress testing for another post - perhaps someone else will get that started.

RANKING PERFORMANCE FOR ROBUSTNESS

I like to test ranking systems over 4 time periods and over subgroups of market capitalizations.

My preliminary test is to run a new ranking idea on the entire P123 history (2001march-present) using my 4K custom universe. This custom universe excludes stocks that I would never invest real money in by requiring minimum price and volume and a minimum market cap and I personally exclude OTC stocks. If a ranking idea does well on this, I go on to the following to test for robustness.

I see how it does on the 2,000 (approx) smaller caps (mktcap 25-1000) and the 2,000 larger caps (mktcap>1000). Almost every ranking system works better with small caps, but ideally I like to see that it still works, albeit not as well, with larger caps. Often the ranking system does a lot better with the 2K small caps than the 4K group, but sometimes it is about the same. This tells me if including the large caps “weakens” the ranking system, or if the ranking system is smart enough to produce equally good returns if given the change to put some large caps into the top buckets. This gives a heads up for what mktcap filters to use when I get to the strategy simulation stage of development.

Next I do time segmentations. I want to see if the system can do well in the 3 major market conditions for which we have P123 data. So I rerun the ranking performance tests for the following time periods:

2001 - Present (got to do well or I don’t bother doing further tests)
2001march-2003march (bear period with 3 bear rallies)
2003march-2004oct (rocketing bull – every long system does well)
2004oct-2007oct (“typical” bull with pullbacks)

I take a careful look at the bear period 2001march-2003march. At a minimum a system has to not do worse than the benchmarks (SP500 and Russel2000), but ideally it should make money on the short term rallies during the bear. I don’t pay much attention to 2003march-2004oct because we will rarely seem that type of extreme bull run and if we do it will only be after a major bear (which will ample advance notice that conditions are ripe for anther rocketing bull). I also look at the last 3 years (2004oct-2007oct). Once we get another year of data, I might divide this into 2004oct-2006oct and 2006oct-2008oct).

The two most important periods are 2001mar-2003mar and 2004oct-2007oct which give me a handle on a systems consistency over time.

LOOKING FOR RANDOMNESS

I also keep an eye open for randomness. At what point does market “noise” or “randomness” start to overwhelm the power of the ranking system I am considering? When doing ranking performance tests I like to see how the system performs with bucket sizes of 5, 10 and 20 stocks.

Using my 4K custom universe of liquid stocks, a 200 bucket test has 20 stocks per bucket. Generally this gives fairly smooth curve for the top 3 to 5 buckets (I only care about the top buckets). So to seek out the line where randomness appears, I use some smaller custom universes based on market caps:

4K_all = stocks meeting basic liquidity (also no OTC, no ADR)
I may or may not add ADRs back in. A ranking system should do well with or without ADRs so this is not a big issue for me when testing for Robustness using ranking testing).
This gives 20 stock to each of the 200 buckets.

2K_smaller cap (mktcap>25 and <1000) (same liquidity as 4K_all)
2K_larger cap (mktcap > 1000) (same liquidity as 4K_all)
These give 10 stock to each of the 200 buckets.

1K_micro cap (mktcap>25 and <350) (same liquidity as 4K_all)
1K_small cap (mktcap>350 and <1000) (same liquidity as 4K_all)
These give 5 stock to each of the 200 buckets.

Some ranking systems begin to show randomness affecting the top buckets when the size gets down to 10 stocks and many do for 5 stocks. This is especially true when one is looking at 1.5-2 year time periods (see my time periods above). Implications: if a 5 stock bucket does better than a 10 or 20 bucket over the full 6.75 years of P123 data, one needs to be aware if one will have to endure some short term periods of poor performance caused by randomness.

Personal perspective: Given my personal risk tolerance (or lack thereof), I consider it very important to know when randomness can be expected to overwhelm the native power of the ranking system. I get a handle on this from tests using segmentations of time and market cap. This helps me decide on the size of portfolio use when testing in simulations (and that is a topic for another post).

Comments? Could I be doing this differently to get better results, or to save time and effort? Am I missing something I should be looking at when doing ranking performance testing?

Brian (o086)

olikea · December 10, 2007, 5:59pm

First off, a couple of book recommendations:

Way of the Turtle
Design, Testing and Optimisation of Trading systems

Curve fitting:

Curve fitting is a valid process of attempting to discover a relationship based on empirical observation. The problem arises when you “overfit”.

I have a scientific background, so I am familiar with the process. Imagine you have 10 data points on an X-Y plot, but each one has some “error” associated with it. This can be considered to be white noise. You are looking for a relationship between X and Y.

Fortunately, through visual inspection, you can see that it looks like the relationship might be a linear one, i.e. a straight line. So you fit a line with the high-school equation: y=mx+c

Great, but the line will never go through all the points, it may not actually go through any of them. So instead you attempt to fit the line as best you can, using statistical methods like “least squares” etc. to get a “best fit”. until eventually you have a relationship that you believe is as close as you can get to the “true” underlying relationship.

This is analogous to optimising a trading system. The better the fit, the better the annualised returns etc.

Now here comes the problem. What if you see a series of points, but they blatently don’t form a straight line. More like a curve. You could try with a second order polynomial, something like y=Ax^2+Bx+C, and here lies the path to hell:

It is a mathematical fact that if the order of the polynomial is equal or greater than the number of data points, you can do better than a “best fit”, you can have an “EXACT FIT”! Think about it, if you have a number of points on a graph, then you can always draw a wiggly line through all those points. But intuitively, you know that such a line has absolutely no value whatsoever- you have “overfit” the data.

Now how do you avoid this? Well, there is a simple way to consider it: Look at the ratio of the number of data points, to the degrees of freedom. A second order polynomial has 3 degrees of freedom, namley the coefficients A,B,C. A straight line has 2 degress of freedom. If you fit a straight line through 2 points, it seems pretty obvious that while it “might” represent a true relationship, its pretty hard to say that it is conclusively true. If you have a 3rd point that also falls on the straight line, then this acts as more evidence.

In stock trading, the number of data points is rather line the number of trades, (though this can be debated), and the number of trading rules is rather like the number of degrees of freedom. (also can be debated).

This brings me back to what I have been saying for a while, LESS IS MORE. People like to invent complicated trading systems with all sorts of rules. However, the more rules, the less the robustness (for a given number of trades). Another issue is that if you reduce the number of holdings in the sim/port, then this is like reducing the number of data points, getting close to the point of “overfitting”. In fact, you should be looking to large stock portfolios, and trying to throw away rules as much as possible.

Like what Einstien said, make things as simple as they can, but no simpler.

A scientist will always look to fit the simplest model to the data, and only add in extra degrees of freedom (i.e. higher order polynomial etc.) if a simple solution (such as a straight line) really doesn’t cut it.

The analogy with trading works very well. You can imagine that you are trying to build a model. Each point on the graph represents a trade, your curve represents the model. You get more money by getting your curve closer to all the points. Therefore the top performing model has the highest number of degrees of freedom, as it exactly fits all the data. Unfrotunately, like drawing a wiggly line through the data, it has no predictive value whatsoever.

I am not trying to pick on anyone, but the following example shows exactly the problem of “overfitting”:

http://www.portfolio123.com/port_summary.jsp?portid=332135

Wow- incredible returns. (This was developed a few years ago). Lets see how it performs out of sample:

http://www.portfolio123.com/port_summary.jsp?portid=332136

I think EVERYONE on p123 should review this. Be careful about chasing those high return sims.

Then we come to the issue of rankings. I am very excited about rankings, because the fact you can rank factors into buckets is very useful, because you are using a huge amount of data. It is impossible to argue that, say, Price-to-sales has no correlation to stock performance. This was analysed in Dan Paraquettes excellent spreadsheets on top factors. (see http://www.portfolio123.com/mvnforum/viewthread?thread=1742#6882 )

I always use a bottom up approach to ranking systems, like Dan. Starting with how individual factors work on their own, then combining them and looking for synergies. This is how I cam up with my Third Generation ranking system, and created a sim with an emphasis on robustness: http://www.portfolio123.com/port_summary.jsp?portid=314980

At this point I will go slight counter to what I said before. I do not believe that adding more factors into the ranking makes it less robust necessarily. If the factors, on their own, work, then I think it makes the ranking system more robust, simply because if one particular factor fails, then others will make up for it. However, the true “degree of freedom” can be in the weightings of the ranking, and this is where overfitting is possible. An overfitted ranking system may not produce poor results in the future, but they will be considerably inferior than the indicated results. This also makes smaller stock portfolios look a lot better in backtesting than they are likely to perform in real time, and I think if people knew what the real risk/reward characteristics would look in real time, then they wouldn’t go for it. To illustrate the point, take a look at these sims:

http://www.portfolio123.com/port_summary.jsp?portid=332143

That is a 5 stock sim based on Dans “TF12 system”, which is probably one of the most optimised ranking systems on p123. Looks very impressive, annualised return over 119%! I have only ran up until 2006 because this was the time during which it was developed.

Now just for fun, lets look at a 50 stock sim:

http://www.portfolio123.com/port_summary.jsp?portid=332144

Still very impressive, returns of 77% per annum. And here comes the issue: What do you chose, 119% per annum, or 77% per annum? Its a tougher question than you think. Lets welk foreward a bit. This is interesting because I am confident that 2006 onwards represents “out of sample” data for the TF12 system. Lets see what happens. Here is the five stock sim:

http://www.portfolio123.com/port_summary.jsp?portid=332146

Disaster! A very VERY bumpy ride, drawdown over 40% (greater than the “in sample” period) and at the end of it all, you have actually underperformed the market, and have LOST money!

What happened to the 50 stock sim? :

http://www.portfolio123.com/port_summary.jsp?portid=332145

Well, it hasn’t performed as well as in the “in sample” period, but it has returned over 30% pre annum and a much more manageable 21% max drawdown. I’d be happy with that.

The reason is simple: With the small number of stocks, the stocks that are chosen is highly dependant on the exact weightings of the ranking system. They are subject to “over fitting”. With a larger number of stocks, then the exact weightings matter much less. You have reduced the degress of freedom, AND increased the number of data points, making it significantly more robust.

This is one of the reasons why I said that a sim containing fewer than 10 stocks should be ignored!

Ultimately, there are still a great deal of “unknown unknowns” as well as known unknowns. The challenge is to make simple, robust systems. More data is better. Simpler is better. Endless optimisations is unlikely to payoff.

o806 · December 10, 2007, 7:36pm

Oliver:

Your reply is exceptionally insightful. I appreciate the time you took to write it up.

After digesting its content I hope to add some more to this discussion.

Oh, I also would recommended the Way of the Turtle. Easy to read. Helpful illustrations about draw down risk even if they come from the futures world which has a different risk structure than non-margined stocks. The most important thing I took away from the book was the necessity of consistently when using mechanical systems. That is one of the reasons I like P123’s weekly email. Even when life gets super busy, there is the reminder to rebalance in my email. Actually I also have my PDA set to remind me each week so the P123 email is a nice back up.

Brian (o806)

charles123 · December 10, 2007, 7:50pm

Oliver,

This goes back to a previous debate, but would you say it’s more robust to have 30 to 50 stocks in a portfolio based on something like your 3rd generation ranking that has many diverse factors OR to have, as Denny suggests, 3 to 5 10 stock portfolios based on independent and diverse ranking systems.

o806 · December 10, 2007, 8:24pm

Charles and Oliver:

My intuitive sense, which might be wrong, is that five 10-stock portfolios using independent ranking systems would be more robust than one 50-stock portfolio even if the latter included several diverse factors. This assumes that the five small 10 stock portfolios also test well as 20 and 30 stock portfolios. That extra assumption should help remove the possibility that five small portfolios were “overly” curve fit.

Over a 20 year period, there might be little or no difference between 50 stocks in 5 small portfolios or all 50 being in a single portfolio. However, on the short term, I would expect the 5 small portfolios to be easier to handle psychologically - at least for me. One of my challenges is continuing to follow a system that is temporarily underperfoming the overall market for 1 or 2 years. If I had all my eggs in one basket, this is likely to happen. If I have 5 strategies then at least 1 or 2 should be outperforming the market at any given time. That would make it easier for me to tell myself: "Stick with all 5 strategies because they were sound in the past and as proof 2 or more are doing well right now.

So I am balancing robustness of the methods with my ability to stay with the plan. Multiple strategies help me. At present I have 4 P123 strategies of 20 stocks each. Plus an AAII strategy with 10 stocks that I might discontinue at the end of the year. So I am actually going for diverse strategies using relatively larger portfolio sizes.

Brian (o806)

olikea · December 10, 2007, 9:28pm

Ok, if you read “Way of the Turtle” he very succintly explains why, optimisation of factors is worthwhile, but the more you optimise, you increase the likelyhood that “real” returns will be lower than the backtested returns - in the backtesting you have the benefit of hindsight to find “peaks” in the performance of factors.

Ultimately, here is the issue I have: If the system is OVERFIT, i.e. like in my above examples, the real time performance is not simply “diminished” compared to backtested performance, it actually is completely unrelated.

I.e., the sims based on very small stock portfolios appear to have no predictive power whatsoever, and even worse, are subject to severe underperformance.

So therefore, the question of diversification among many portfolios arises: but if all of the portfolios have this sort of devistating real time performance vs. backtested (see: http://www.portfolio123.com/port_summary.jsp?portid=332143 vs. http://www.portfolio123.com/port_summary.jsp?portid=332146 ).

I do just wonder to what extend a lot of the tiny stock portfolios out there are a form of “fools gold”. It is true that one or two of them might post exceptional real time performance. But then that is like choosing stocks in a portfolio - one or two may give exceptional performance, one or two dismal, and the rest mediocre, it all averages out to an index fund.

To be honest, I don’t really know the answer. The fact of the matter is, a large stock portfolio shows lower degredation in performance in real time (see: http://www.portfolio123.com/port_summary.jsp?portid=332144 vs. http://www.portfolio123.com/port_summary.jsp?portid=332145 )

Please remember, 30% per annum is still a return higher than what many billionaires have achieved over their lifetimes.

As a result, it seems to be prudent to “bet on both”. I.e. have one large stock portfolio (like my 50 stock 3rd Gen) as the “core” and have satellite portfolios that are more speculative, and perhaps equal weight across all stocks, so the more concerntrated portfolios naturally get a lower weighting (each 10 stock port has a 1/5 weight of the 50 stock portfolio).

Of course ultimately this invovles having a large number of stocks in total - maybe more than 100. However, with deep discount brokers such as interactive brokers, I really don’t see why this is an issue any more, commissions are sufficiently low that unless you really have a tiny portfolio, there are no extraordinary costs to holding so many stocks.

lucabol · December 10, 2007, 11:51pm

I very much agree with Oliver.

I go one step further. Extensive reading on the topic convinced me that the historical data that we have in P123 at this point is not enough to give me any confidence on the predictive power of the identified factors given the medium/long term horizon of my systems. BTW: I personally don’t believe in short term trading for various reasons.

My solution is to just use factors that have been identified in several academic studies over very long term horizons (i.e. 20+ years). And I pretty much equal weight them. I don’t use any buy rule or sell rules apart from liquidity related ones for buy rules and Rank/Time rules for Sell rules. I use these Sell rules mainly for the sake of giving me a 90-130% turnover.

Overall, I simply use P123 to confirm the results of the academic studies, to help managing my portfolios and to optimize small things at the periphery (I.E. number of stocks between 15 and 25 and diversification constraints).

I certainly don’t get the fantastic back tested results that are often mentioned in these forums, but I hope for my systems to be more robust. Time will tell.

Maybe I will use the system more as a way to discover new rankings when we get more historical data. I’d love for that to be Marco’s team main focus.

.luca

o806 · December 11, 2007, 5:09am

luca:

I understand where you are coming from. Your comment opens the door to discussing another aspect of robustness. Each of my four P123 systems that has real money on the line is based on concepts that have been tested over 10 to 20 years by others using more extensive data. However, I do not follow their parameters slavishly (and you probably are not either so do not interpret the following as criticism of your post).

I am comfortable as long as any modifications stay within the ball park of the original concept that has an extensive history of good performance. Also any changes I use have to pass my own battery of robustness tests so as to avoid over-curve fitting to random noise in the short data history we currently have in P123.

A story will illustrate my dependence, but not slavish dependence, upon long term studies by others.

Some years ago, O’Shaughnessy in What Works On Wall Street reported great results for a micro cap strategy with a price2sales < 1 filter and a 52 month ranking to select the 50 top stocks. O’Shaughnessy did not spend much time on this strategy in his book perhaps because liquidity concerns made it impractical for use in his mutual funds. But what is illiquid to a Fund manager who has to move millions in and out of a position, can be quite liquid to individual investors who only need to move 1/1,000th (ie, a few thousand) in and out. Also O’Shaughnessy appears to have tested this with a yearly rebalance, which from other comments in his book appears to have been a limitation of the database he was using a decade ago. This micocap low price2sales strategy has become known as O’Shaughnessy’s TinyTitans even though he did not use this title, at least not in his earlier editions of the the book.

AAII has done a forward test of TinyTitans starting in 1998. I consider the AAII test to be very, very important for three reasons. First, it is a forward test which eliminates questions about survivorship biases in databases. Second, it covers 3 1/2 more years than we can currently do with P123 data, and those extra years include the 1998 crash as well as the 1999-2000 bubble and bust. Third, AAII’s forward testing use a monthly rebalance period with a smaller 25 stocks portfolio. AAII’s results are higher than O’Shaughnessy’s. It is impressive when a variation forward tested on out of sample testing gets better results than the in-sample results.

So in my eyes the concept of “Value on the Move” (p2s<1 with 52 week price delta) has a back test period of about 20 years from O’Shaughnessy and 10 years forward testing from AAII. (there many be some overlap in those years). That is about as good as one is ever going to get.

When using P123 I could slavishly follow O’Shaughnessy’s method with 50 stocks and a yearly rebalancing, or AAII’s “improved” variation using 25 stocks and monthly rebalancing. But it seems wiser to try more variations since AAII showed that variations can be significant improvements.

Also variation testing is an important confidence builder for me. If O’Shaughnessy has really found a strategy that taps into a market reality, other variations should also work. If most variations do not work, then the strategy is at best very fragile or at worse an illusion based on over-curve fitting. Good test results for the most variations would confirm the basic strategy concept is sound. Second, some of the variations may well be better than the one O’Shaughnessy presents. Nothing requires a writer like O’Shaughnessy to present his best strategies; he might be sharing his 2nd or 3rd best variation.

Using P123 I could vary the rebalancing period from 1 year down to 1 week. There was a clear pattern of better results with short rebalancing periods. That confirmed AAII’s monthly was better than O’Shaughnessy’s yearly. But it also showed that every 2 weeks was better than AAII’s monthly, and weekly was better than 2 weeks. Good results of close variations give confirmation to each other.

But why stop with just improving the rebalancing period. Why not move the p2sales item from a “filter” to be part of the ranking system. P123 makes that possible. O’Shaughnessy’s original software did not allow such complex ranking. O’Shaughnessy’s most recent edition of What Works on Wall Street indicates that he is starting to explore multi-dimensional ranking systems. Just think of that. With P123, you and I have tools in the same ball park as the big boys (although they still have longer data histories until P123 extends its history). It will be no surprise to P123 users that putting p2sales along with price momentum improves the ranking system.

But should I limit myself to using a 52 week price delta for price momentum? A simple 52 week price delta looks like another artifact of limited computer power from 10 years ago when O’Shaughnessy was doing his studies and AAII was just starting their forward testing.

Why not try the 4 quarters method used in many of the models provided by P123? Or the 4, 13, 26 and 52 price deltas provided by Reuters to P123? Why not see what a more computationally intense sma(20)/sma(100) formula might produce?

As it turns out, virtually all medium and long term price momentum formulas “work” in Tiny Titan variations (shorter term one month deltas are not so hot). The strategy appears to be robust since it can withstand significant alterations in the price momentum specifics.

Furthermore, the “value” side of TinyTitans is also looks remarkably robust. Indeed, if TinyTitans is tapping into a real market inefficiency related to value investments, then one should be able to vary the value factor and still get OK results. And that turns out to be so. One can get OK to good results by replacing price2sales with price2book, or price2cashflow, or price2freecashflow, price2earnings, or price2projectedearnings. Now the results for some variations are better than other, but all the variations “work” better than randomly throwing darts at a list of stocks.

Which value factor will work best in the next 5 to10 years? That is impossible to know ahead of time. But given the robustness of test results, any one of the value factors should do better than a random pick from all stocks. The option of some with more experience than I seems to be to picking the variation that has done the best in the past. Alternatively, one could pick two or three of the better variations and go with those. I am currently taken the latter course. I may being giving up a few percentage points of annual gain, but I feel I am reducing the risk of randomness doing significant harm to my trading account. My (unproven) assumption is that using a diversity of strategy variations will be more robust at the possible sacrifice of a few percent of profit.

Brian (o806)

o806 · December 11, 2007, 6:49am

Oliver:

What a super post you provided. Here are a few comments:

I have not read Pardo’s book from 1992, but I have read several more recent ones on the topic of testing and optimizaiton. I am sure I have a lot more to learn. As for the Way of the Turtle it illustrates some great general principles which, if the easy reading fluff was removed, could be outlined in three or four pages. Still I am glad to have it on by bookshelf and plan to recommend it to one of my relatives.

I like this distinction between a reasonable/responsible curve fitting which illumines real underlying principles verse overfitting resulting from irresponsible increase in variables.

I get this. In fact, I am very suspicious of a system that has several buy rules. The first thing I do is deactivate those rules and test the simulation with 20 or 30 stocks. Only if it still “works”, I will invest more time in studying it. Surprisingly I have seen simulations with complex rules that work just as well when the rules are deactivated, which makes one wonder if some people like complexity for its own sake.
Like you I find my confidence is greater for simple systems than for complex systems with equal test results. When all other things are equal, simple is always better. Personally my “gut” prefers a simple system to a complex one even if the latter has slightly better test results.
Typically my simulations and portfolios are nearly a naked ranking system with the addition of a couple basic liquidity filters and a simple exit based on rank < X. I do most of my robust testing in ranking performance because like you say it gives so many “data points” that it is hard to over-fit unless one starts giving precise values to the weighting of each ranking factor. Once I have a good ranking system (simple, robust, and profitable), I only spend a little time in simulation runs to check out the drawdowns (which rank testing does not display) and to determine what rank < x setting to use for an exit rule.

Once I get into simulation testing, I have the Excel addon do tests for portfolio sizes for 30, 25, 20, 15, 10, 7, and 5 stocks. Generally profits increase for smaller sizes, but so do draw downs. So after the Excel addon has all results in, I add a column that divides Annual Gain by Maximum Draw Down. That is the number I use to determine if the higher gains of smaller portfolios is worth the increased pain of draw downs.

I agree. Every P123 user should see this. I hope you have those two simulations in your “never delete” category. They are extremely helpful. Once we get more data history in P123, we will be able to test our systems on some out of sample data. I will look to see how my systems perform on the bubble-bust (1999summer - 2001summer). But what I am even more interested in seeing results for the 1996-1998 a bull period since that is like more typical than the bubble bust. The sharp correction in 1998 will also provide a valuable stress test for our strategies.

Agreed. P123’s ranking performance testing is so useful for developing robust systems.

I am not sure I agree about more factors being harmless.
However, I fully agree that fine tuning the weightings of several factors can be a Pandora’s box of trouble. Furthermore, I expect to see a lot of “over-fit” weightings in ranking systems when/if P123 releases an Excel addon to automate testing ranking variations.

Oliver, you make so many good points. I appreciate the discussion.

Brian (o806)

o806 · December 11, 2007, 7:17am

Oliver:

The last part of your post is just so important. And I am afraid some might miss seeing it. I am putting a copy below so it will really stand out. Every new P123 user needs to read this.

Brian (o806)

… This also makes smaller stock portfolios look a lot better in backtesting than they are likely to perform in real time, and I think if people knew what the real risk/reward characteristics would look in real time, then they wouldn’t go for it. To illustrate the point, take a look at these sims:
http://www.portfolio123.com/port_summary.jsp?portid=332143
That is a 5 stock sim based on Dans “TF12 system”, which is probably one of the most optimised ranking systems on p123. Looks very impressive, annualised return over 119%! I have only ran up until 2006 because this was the time during which it was developed.

Now just for fun, lets look at a 50 stock sim:
http://www.portfolio123.com/port_summary.jsp?portid=332144
Still very impressive, returns of 77% per annum.

And here comes the issue: What do you chose, 119% per annum, or 77% per annum? Its a tougher question than you think. Lets welk foreward a bit. This is interesting because I am confident that 2006 onwards represents “out of sample” data for the TF12 system. Lets see what happens.

Here is the five stock sim:
http://www.portfolio123.com/port_summary.jsp?portid=332146
Disaster! A very VERY bumpy ride, drawdown over 40% (greater than the “in sample” period) and at the end of it all, you have actually underperformed the market, and have LOST money!

What happened to the 50 stock sim? :
http://www.portfolio123.com/port_summary.jsp?portid=332145
Well, it hasn’t performed as well as in the “in sample” period, but it has returned over 30% pre annum and a much more manageable 21% max drawdown. I’d be happy with that.

The reason is simple: With the small number of stocks, the stocks that are chosen is highly dependant on the exact weightings of the ranking system. They are subject to “over fitting”. With a larger number of stocks, then the exact weightings matter much less. You have reduced the degress of freedom, AND increased the number of data points, making it significantly more robust.

This is one of the reasons why I said that a sim containing fewer than 10 stocks should be ignored!

Ultimately, there are still a great deal of “unknown unknowns” as well as known unknowns.

The challenge is to make simple, robust systems.

More data is better.

Simpler is better.

Endless optimisations is unlikely to payoff.

olikea · December 11, 2007, 11:50am

When optimising a ranking system, you can imagine it rather like building a hill. Naturally, if you tune the factors, you can change the hill from being a smooth rolling hill much more into a mountainous spike (much like the TF-12). If you then run a sim, you will always find reducing the number of stocks gives you higher returns - it allows you to climb further up the hill.

The problem then arises that the optimal weightings in the future are almost certainly not going to be the optimal weightings in the past. For one - there is the issue of white noise, secondly there is the issue that the market is evolving and certain factors come into-and-outof favour. As a result of all this, the location of your “spike” is going to be in a different place. You can geniunely see this if you look at the pictures I will upload, showing the TF12 system ranked up until 2007, then during 2007-to-date. You can see that the top performing bucket has actually shifted its location.

As a result, a highly concerntrated portfolio has done very badly, it has been trying to find a hilltop in the wrong location. A larger portfolio fares much better - it casts a wider net and as a result it is much more likely that indeed, the hilltop will lie within its range. Unfortunately for those trading a small port, who are “off the hill” will conclude the ranking system has died. It hasn’t, it still has value, but you need a wider net to capture it.

Firstly: Quite a lot of factors are highly correlated, e.g. price-to-cashflow should be correlated to earnings yield. A high rank in one likely means a high rank in another. Factors that aren’t correlated, like value and momentum, are orthoganol. Effectively, “momentum value” is a 2 factor ranking, because the subranks are correlated. Indeed, the choice of multiple correlated factors seems reasonable as it helps confirm the “real” situation. If a stock has a low P/E ratio, I would be very suspicious if it didn’t also have a low P/CFL ratio too.

The other issue is factor death - some factors do go out of favour, as this year price-to-sales has done (see: http://www.portfolio123.com/mvnforum/viewthread?thread=2922 ). If you have multi-factors, this reduces this risk. Is this better than having a large number of ports with single factors? I don’t know. But the latter will lack any of the synergies that can form by combining orthoganol factors (like momentum and value).

Another issue I want to raise is that of drawdowns. Recently there was a post about “Goals”, (http://www.portfolio123.com/mvnforum/viewthread?thread=2987 ) and I was disturbed by the idea you could manage a 15% max drawdown.

Remember that the out-of-sample drawdown of the 5 stock port was 40%, much more vicious than the ~25% DD in the backtest. And the out of sample was just over a 2 year period with relatively mild market conditions (compared to history). The issue that your max drawdown is frequently underestimated is a problem that caused LTCM to go bust, caused a lot of problems in the long/short hedge funds, etc. etc. The problem is you have a drawdown much larger than you expect, and at that moment, you quit. That may not be the best thing to do, at a market bottom everyone is quitting because they never realised they could lose so much money, and are completely terrified. If you really want to reduce your drawdown, the only method is to reduce the exposure to the market. It is as simple as that!

decoder · December 11, 2007, 11:55am

Oliver and Brian,

Thank you for your important (crucial) posts in this thread.

I have two comments/questions which try to merge some important ideas from some diverse (diversified) people:

(1) It makes sense to me that a smaller number of stocks in a port results in both a higher return and a higher DD. The higher return is due to the fact that we are dealing with an average rank (over all stocks in the port) which is higher than in larger ports - as Denny has pointed out. The higher DD would, I think, be reduced if we could run several small sims together as one aggregate sim (not a priority, just a comment).

I say this because I think that a smaller number of stocks is actually more diversified than it appears - as long as one can view it as “time-diversified” rather than “broadly diversified at any given slice of time”. So, if we have several small ports, we have both breadth-diversification and time-diversification, and we have the added boost that our rankings are all very high, and we have a diversification of strategies. Does this make sense or am I missing something?

(2) I think that Oliver’s new port (value momentum projected PE) improves on the O’Shaughnessy model in a way that is very similar to Brian’s description of his own research. Are there other simple ideas that have been backtested for a few decades, which could be developed in a similar way in P123? Can anyone point me to 1 or 2 or 5 of them in the literature?
Or better yet, some existing sims or ports on p123 in whatever stage of development they are in?

Thanks for any leads,

Art

decoder · December 11, 2007, 12:19pm

Oliver,

I see that your post arrived 5 minutes before mine, and it raised the issue of the shifting hill which I had not thought of (so my question 1 is answered - yes I was missing something). Still, maybe there is a balance somewhere in the middle - have 3 or 4 medium sized ports, each accepting the top 98% rank or whatever makes sense (hoping that the hill does not move too far away, as in your charts).

But my question 2 (about what other strategiesare out there) remains regardless. Some of the strategies (potential ports) identified in response to that question will hopefully be different enough from the others to provide port-level diversification (a simple wish but I suspect not a snap to realize). Thanks again,

Art

olikea · December 11, 2007, 12:19pm

BTW - just a quick note on additional data

A lot of factor performance I knew about before coming to p123. Obviously, “Value” is highly publicised, momentum is known about, ROE is frequently popping up etc.

Therefore it is possible to look at data that is outside of the p123 scope.
Sources:

-Multiple acedemic papers on the P/E effect. I suggest you do a search. Interestingly they also conclude it is stronger for smaller cap stocks

-A few acedemic studies showing the underperformance of highly shorted stocks (if requested I can try and find links)

-What Works on Wall Street - again excellent book everyone needs to have it

-Joel Greenblatts little book that beats the market - simple quant ranking that shows high returns pasted on P/E and ROC

-Navellier’s little book that makes your rich - somewhat more complicatd quant factors based on EPS gain year-over-year

-backtest.org - a good, if simple, FREE site that allows you to backtest rudimentry screens, back as far as 1986. For example, you can choose “R26 top 20”, that will pick the top 20 stocks with the highest 26 week relative strength (see Screen Builder ), and set the rebalancing period. This site allowed me to conduct experiments so I already knew a fair amount about momentum before p123 (the timescales etc.) The logic of the fields allows you to do things like “CPE Bottom 20%”, then another rule like “R26 top 10”, so you can pick the top 10 stocks by RS from the bottom 20% bucket of all stocks by current P/E ratio. Obviously it pales in comparison to the sophistication of p123, but going back to 1986 makes it very interesting. The universe is the value-line universe, which is limited to about 1700 stocks, but the flip side is they are all likely to be liquid.

Ultimately, P123 allows you to fine tune a lot of this information, so In reality, I feel like I am using more than just the 7 years of data. Nevertheless, more data would be very useful (hint hint).

lucabol · December 12, 2007, 12:28am

Brian:

Thanks for your reply to my post. I appreciate the discussion.

I do exactly what you do. I don’t follow the literature slavishly, but I’m very careful in my modifications and experiments. I too translate hard rules (i.e. PS < 1) to ranking systems. I too tend to merge similar factors in the academic literature (i.e. PE, PC) to create a common node (i.e. Value).

Regarding the shortening the rebalancing period. My personal experiments, including taxes, show me lower overall returns with shorter periods (albeit with lower drawdown). Usually my with-taxes-breakeven-point is at about 6-13 months. Moreover I don’t enjoy the mechanics of trading low volume stocks (it takes time to have a good fill). Because of these reasons I tend to pick longer rebalancing periods.

Something I haven’t seen mentioned in this very good thread is the use of the screener. I use it a lot as it allows me to test a ranking system in each possible starting week of the historical data with performance against an index. It is somehow more cumbersome to do the same with simulations or the addin.

Anyone else is doing the same?

.luca

dwpeters · December 12, 2007, 2:54am

Hi,
I recently created a ranking and sim and I think the process I used was effective. I started with Dan’s list of top ranking factors:
http://www.portfolio123.com/mvnforum/viewthread?thread=1742#6882 ) (Thanks Dan!)

I then retested the top ranked factors, plugged them into his spreadsheet and obtained a new score for the 5 years through the middle of 2007. I sorted the factors according to the ranking from the spreadsheet, then went down the list selecting only a few value factors and then the top factors from other categories. I added a couple other factors I like, then added a liquidity filter - I don’t have a custom universe and I think it is important to test the factors at least excluding the very low liquidity stocks that I would not trade. (fyi, I also tested factors with the large cap universe and the results were surprising, some top factors actually had negative scores when tested for large cap).

Except for liquidity I assigned each factor equal weight and went through the factors one at a time, assigning a factor 0 weight and noting whether it improved or hurt the ranking performance. After this I threw out the few that didn’t really seem to help, increased the weight of the one that helped the most and left the others at equal weight. I created 2 variations in this manner, one yielding about 100% and one about 120% in the top bucket.

From these ranking systems I tested 10 stock sims using typical buy/sell rules that I use with little adjustment - pretty aggressive, 20-30 day holding period. From there I tested 20 stock, 5 stock, and 3 stock variations. The higher ranking system had a wicked drawdown in 2004 across all sims. The lower ranking system had very good returns across all sims with very reasonable drawdowns. This sim has consistently good performance year to year. I’m trading a 5 stock version using restrictive buy rules. (and after reading this thread I tried a 50 stock version: 63% AR, 17.5% DD over 5 years using 3 buy/sell rules)

This method was not too time consuming, I didn’t even try to optimize the factor weights and I only use a few buy/sell rules in the 10 to 20 stock versions. I added a few more rules to the 5 stock version to help control risk but it does well without them too. I did specifically focus my efforts on the ranking system, after that the sims were easy.

Given the dramatic shift away from certain value factors, I’m thinking this is an exercise that should be repeated regularly. Next time I will split current history apart from prior years to see which factors hold up across different market environments, or even to target factors that are currently performing well.

This is an excellent thread and in particular Oliver’s post re. out of sample results was very illuminating. It’s much more difficult to get good results in real life then it is in the backtests. I’m interested in comments or critiques of this process - I’d rather take my shots here then in the market.

Don

o806 · December 12, 2007, 3:00am

Although those of us living north of the board usually pay more in taxes, we get a really good deal on capital gains. Up here there is no distinction between short and long term capital gains. All capital gains qualifies for 1/2 of one’s normal tax rate. So if my tax rate is 30%, I pay 15% on capital gains whether I have held the security for 5 days, 5 months or 5 years.

It sure makes trading simpler having a consistent tax rate.

Brian (o806)

probtrader · December 12, 2007, 10:13am

To be fair, we should compare a large portfolio of, say, 50 stocks and 20 factors to a blend of portfolios of 5 stocks each picking the highest (extreme) values for a reduced set of factors.

An interesting point about olikea TG ranking system is that it uses a measure of risk (sharpe(1, 120)) as one of the highest weighted factors. This sure helps to pick stocks with smoother returns. I haven’t seen many ranking systems here using volatility factors. Am I wrong?
Some research point out that stocks lower risk is actually indicative of higher returns:
http://www.cxoadvisory.com/blog/external/blog4-23-07/default.asp
http://www.cxoadvisory.com/blog/external/blog1-26-06/

o806 · December 12, 2007, 8:44pm

olikea: … -backtest.org - a good, if simple, FREE site that allows you to backtest rudimentry screens, back as far as 1986. For example, you can choose “R26 top 20”, that will pick the top 20 stocks with the highest 26 week relative strength (see Screen Builder ), and set the rebalancing period. This site allowed me to conduct experiments so I already knew a fair amount about momentum before p123 (the timescales etc.) The logic of the fields allows you to do things like “CPE Bottom 20%”, then another rule like “R26 top 10”, so you can pick the top 10 stocks by RS from the bottom 20% bucket of all stocks by current P/E ratio. Obviously it pales in comparison to the sophistication of p123, but going back to 1986 makes it very interesting. The universe is the value-line universe, which is limited to about 1700 stocks, but the flip side is they are all likely to be liquid…

olikea:

Thanks for the tip. Backtest.org looks like can fill an gap until P123 releases its extended database.

I have some questions I hope you can answer since backtest.org website is a bit confusing. So I have started a new thread which is located here: http://www.portfolio123.com/mvnforum/viewthread?thread=3002

Brian (o806)

jbarnh · December 12, 2007, 10:13pm

Brian,

Since I us citizen how does Canada define a longterm cap gain and a short term cap gain?. Is all trading les than 1 year considered a short term cap gain?

Thanx Jbarnh