Be careful of 'pretty' sim's and 'smooth' equity curves

Are R2G’s are focused on the wrong things?

100% AR sim’s with dozens of rules, market timing, complex ranks, etc. for finding just the right way to trade small port’s of really thin liquidity stocks, producing super smooth equity curves and super hi turnover. Designers want sub’s. I get it.

But, my time on P123 has shown me that these systems don’t usually hold up.

So…I am attaching a little powerpoint I made to remind myself that with the launch of R2G’s and the seeming ‘ease’ with which people are cranking out ‘world beating’ strategies - to be careful.

This was just one ‘randomly chosen’ great past sim. I have tracked many amazing sim’s since I joined. Out of say, 50 or more, ‘great sim’s’, maybe two or three have maintained out of sample performance. Probably one after I adjust for the kinds of slippage I have seen in my humble trading. That one may be luck.

I don’t doubt that some people are making a lot. But, it’s people who have invested a lot of time learning, building and testing…and maybe also gotten a little (or a lot) lucky along the way.

Best,
Tom


RobustExample_8_7.pdf (687 KB)

Thanks Tom - this was very interesting, and will be useful to a relative newbee like me.

  • Andy

We are going to publish a robustness checklist, thanks for the ideas.

Lets be grateful 2008 happened, everything got flipped around, specially over optimized sims. It’s valuable data giving us two very distinct periods. You could say that an “adaptive” system is needed that adjusts weights, and what not, based on recent performance, etc, etc, exponentially increasing the complexities.

But time tested strategies, like a Piotroski, relatively simple and consistent, should do well in either sample.

Here are 4 years before and four years 2008


Many thanks for the presentation. I have spent the last year coming to the same conclusion thru trial and error, and, fortunately, without live portfolios. Nice to get independent (and well documented) validation.

Excellent Tom! Your pdf should be required reading for every R2G subscriber (and designer).

Just one small issue: Did you check if it was the popularity of this system in your case study that reduced perfomance? If it was then it would not be illustrative of non public systems. A way to check if the problem was popularity is by adding a rank rule to buy only the second best ranked stocks and seeing if performance improves. The fact that performance went up with higher liquidity does hint that this was the case. So does the fact that it improved with a different ranking system.

(To illustrate how popularity weighs on performance, before Joel Greenblatt’s Magic Formula became popular, it’s highest ranked stocks did best and there was a strong correlation between rank and return across the buckets. However, since it became popular–which was about the time that they launched the managed accounts using this formula–the highest ranked stocks have just barely beaten the market while the other buckets remain strongly correlated. This suggests that returns went down not so much because the period was out of sample but because it attracted too much money–which goes into the top bucket.)

I designed over 100 systems when I first joined StockScreen123 (the former sister site); most of them did well over 35%/year with rules that made sense. I didn’t invest in any of them because of concerns about out of sample performance. My fears were later proven to be well founded. Perhaps three of them continued to do well out of sample. (Similar to what you found). As Don Peters and others (Olikea?) point out: With buy and sell rules including market timing less is more. Ranking systems are much much more predictable out of sample.

The way you described it so eloquently in your pdf, a trend rule helps detract from the out of sample results. Trend has performed in reverse since 2008 as I pointed out here. This highlights the problems of buy rules–a single buy rule which uses a factor that goes out of style can break a sim, and factors go in and out of style.

thank you tom,very good work that can save money.i share your perplexity on over optimization…but i still cannot answer where is the thin line between a solid edge in the market and data mining.it’s like painting a canvas never knowing when to stop.

Tom,

What about re-running with 30 or 40 stocks? That way you will see if the ranking system is really over optimized. The issue with small portfolios has been debated on here a number of times.

Tom - good material. I collected a lot of P123 public systems over the years. I scrapped them all twice, once in 2009, and again in 2011. I started from scratch both times mostly with my own ideas. And I vowed not to be caught up in these micro/small cap stocks that don’t have the liquidity to get out of when you have to. I’m still learning.
Steve

Hi,
Excellent presentation. Back in 2008 I set up some portfolio’s to track some of my favored rankings. I used liquidity of AvgDailyTot(50) > 100,000 and close(0) > .01 and most were switched to All Fundamentals after the Compustat switch. The purpose of the portfolio’s was to track and compare the rankings, not prove they are tradable, so I included no fee’s nor buy or sell rules. These ports simply rotate to the top stocks each week. After some experience trading such low liquidity stocks, I don’t trade them anymore.

From early 2008 these portfolio’s have, in real time, with few changes along the way, gained an average of 43%/yr, and this is through the bear market without timing. They are NOT tradable as they are without fee’s but they do show that good ranking systems are predictive of future outperformance. It’s up to the designers to create strategies that capture this edge and not simply data mine.

I’ll also mention that after seeing the long hold times in some of the high performing ports, I thought perhaps I was missing something. I recently tested a number of buy/sell rules across a couple different sims to see what worked. Then in developing a new strategy for myself and R2G, I tried some of the better buy/sell rules. I still do better without most of them, though my turnover is also still higher than the R2G average. My latest strategy (waiting on the hedging function) uses only 4 buy/sell rules, 2 of which are ranks, and one is correlation. After testing hundreds of variations of popular buy/sell rules, my conclusion is as it was before. For ranks, more factors are better, for buy/sell rules, less is better.

Here are the charts, the portfolio’s and the rankings should all be public.

Don


Zrank ports.pdf (288 KB)

Really excellent and informative thread. Thank you.

I seldom wish to comment about R2G models in Portfolio123.com as I respect all the designers in this forum.

I stand by my position that my investment return gave me much better return by using my portfolio123.com models vs. without them.

Ask yourself this question, if you do not have any R2G models or your own models to follow, what other ways could you do ?

I can think of a few possibilities to illustrate the points :

  1. Search for all the stocks in DOW30 and look for stocks that have SMA(5) > SMA(20) and stay with them until SMA(5) < SMA(20) happen before you sell out. Likely you can achieve the objectives of high liquidity and low slippage. But my questino is what will be your buy price ? exactly at the price that the moving averages crose each other or higher price than that since the crosss over happened three to four days ago and ended up your slippage is much higher percent ? (unless you check for the MA cross over everyday)

  2. Follow the dogs of the DOW by buying the few stocks in the DOW that pay the highest dividend yields exceeding say 3-3.5% payout per year. Do nothing after buying it and do a yearly rebalancing by reviewing the dividend pay out again. How to calculate slippage in this case ? I really don’t know.

  3. Trade the SPX500 or DIA or Rusell2000 3X ETF with SMA(5) > SMA(20) and sell when SMA(5) < SMA(20). in this case what will be the slippage ? I don’t know because it depends on how you measure the slippage (at reference to which point, daily check, weekly check or hourly check) ?

My opinion is that any R2G models or our own portfolio123 models are mathematical models. They have to be based on price quoted either at the end of the day, may be Friday closing price since most of us only have time over the weekend to do some of these re-balancing work. So some form of slippage is build into our way of buying and selling.

So what’s the bottom line ? My conclusion is that whether a system that we follow enable us to make real profit realistically and the model(s) we follow behave in a very predictable way. If the answers to above is “YES”, the question then become whether you can accept the risk of following it since to a R2G subscriber, it is black box trading with the only known downside being the max drawdown in the last 10 - 12 years and how the model behave during 2002, 2008, 2011 in which these are very volatile years for stocks. It will be good we revisit this topic two to three years later after the R2G models go through a serious down draft in stock market and see how many R2G models survive the acid test.

In addition, to follow a R2G model, the best approach is still to be able to follow market price movement in the first half hour of trading on Monday morning and close the trades within the first hour of Monday to try to match the prices in our model. Will we suffer slippage, we definitely will (regardless of which method we follow, even the mothods above by not using portfolio123 ). My own measurement of whether a model works well or not is to calculate the average return%/avg days of per stock holding.

Eg.
Case A
avg gain% is 9% and avg holding is 30 days, ratio is 0.30

Case B
avg gain% is 6% and avg holding is 20 days, ratio is 0.30

So, in which case we need to pay more attention to slippage ? I guess it is very clear case B will have more issues with slippage due to shorter holding period and more frequent trading.

If we have a
case C (let say a micro-cap port with slippage is 1% on buy side and 1% on sell side)
avg gain % is 12% and avg holding is 40 days, ratio is 0.30
comprehending the slippage it will becaome :
avg gain% is 10% and avg holding is 40 days, ratio is 0.25

question is are you satisfied with ratio of 0.25 using this strategy ? If the answer is yes, then it is ok to allocate money to it.

I write all of this is not to defend my R2G models but to defend the great job from Marco and his team for making R2G models available in Portfolio123.com. please bear in mind portfolio123.com need to make profit as a company as well in addition to offering this great services to all of us.

In addition, everyone of us have different risk tolerance and different market temperament. This differences in temperament in us will also create diverse of % slippage in our own trading to mimic the model performance.

So here, I rest my case on focusing/emphasising so much on Liquidity and slippage. Please do some calculation first and compare them with other strategies that I tabled out above or other investing strategies that you have and you will be smart enough to figure out what make sense to suit your own temperament on stock trading and whether to follow R2G models or to leave them alone.

Wuu Yean

Terrific work, Tom.

Here are a few of my thoughts:

  1. Although data mining is much discussed, I suspect there is still a lot of this going on, based on what I’ve seen in some high-performing pubic models I looked at. Perhaps the best rule of thumb on factor selection comes from James O’Shaughnessy, who constantly dances on the edge of this himself: “if there is no sound theoretical, economic, or intuitive, common sense reason for the relationship, it’s most likely a chance occurrence.” In other words, if you can’t logically explain why you think a factor should work, don’t use it. Don’t even bother to test it (you may be worse off if it tests well).

  2. Testing involves almost as much, perhaps more, strategy than model construction. Think about this as you evaluate results. I get nervous when I see a rank-performance chart rising at a straight 45-degree angle. To me, a “decent” result on a strategy I believe in trumps a stupendous performance on a strategy about which I’m not all that sure. Also . . .

  3. Be very thoughtful about benchmarks. Recently, for example I developed a REIT strategy with the idea of writing about it on Seeking Alpha. I looked favorable against all the p123 benchmarks available. But I tested it against IYR (the iShares REIT ETF) and it just tracked that benchmark almost penny for penny. (I ran an backtest of ticker(‘iyr”), downloaded the test-result series into Excel and compared it to the strategy performance series which I also downloaded.) As a result, I abandoned the project and won’t resuscitate it unless/until I can beat IYR. Notice my R2G “Cherry Picking the Blue Chips” models; they don’t look so great. Notice, though, the benchmarks I’m using: the S&P 500 Equal Weight index. I could have made them look better by using SPY but that would not have been appropriate. When selecting just 10 stocks out of the S&P 500 to be equally weighted, the question of small-cap effect comes into play (even within this universe), so I need a test that controls for that. For the Low-Priced Stock newsletter p123 does together with Forbes, I have to compare the outcomes to the S&P 500 and Russell 2000 because that’s expected in the world at large. But to actually evaluate the model on my own, I compare it to a custom index of sorts I created via p123 screen that works with the same low-priced universe as I define it with the same liquidity rules, but consists of three sub-samples: stocks highly ranked, stocks with mid-level ranks, and stocks with low ranks. I have to do that to make sure my model is really working; doing something more than mining the low-priced group. Indexing is a huge and sophisticated business. Outfits like Dow Jones, MSCI and S&P have thousands of indexes – far more, obviously, than are exposed to the general public. That’s because portfolio managers pay up to get very finely-tuned, and sometimes even custom, benchmarks that tell them what they really need to know. With so many low-liquidity stocks in high-performing p123 models, it is very important for people to really think about how they are controlling for the small-cap effect; even the Russell 2000 benchmark might not be enough.

  4. Be at least as thoughtful, or maybe even more so, about the time periods over which you test. I’m a quant aberration in that I don’t believe in long-period robustness testing. My inclinations are toward an obscure but emerging quant crowd that believes in regime switching. The world changes. That’s the one thing we know with absolute 100% certainty. To expect a model to work across all time periods – it’s nice if you can get it – you’d have to assume that structural changes in the world are all irrelevant. Often, though, I know that’s a flawed assumption. For example, I ignore 2008. It was a crash. Investors fled stocks en masse regardless of technicals or fundamentals (actually, some argue that better stocks sold off worse because those were the assets for which many panicky hedge funds could get bids) and I know full well that if we get another crash like that, either we’re out of stocks or we’re screwed regardless of what models we’re using. I also discount everything that happened between 2000 and 2007. I don’t want to know it; I don’t want to look at it. That period was the golden age of screening-ranking, when these platforms were just really starting to break out of the previously limited FactSet type world. So these newly-created objective, rational models were going out there into a world where so many were still trading on the basis of nonsense. Not surprisingly, almost any tolerably decent model performed very well. To a large extent, it wasn’t so much the particulars of a model, but the mere fact that one was simply using a competent model, any competent model. Fast forward to today. Many people still don’t use platforms like p123, but a heck of a lot more (especially big money) do than was the case a decade ago, enough to burden well-backtested models with the concept known in arbitrage as the “crowded trade.” We can still create good models and many of us do. But we have to work harder (and often consider what the nano-second algorithmic traders are not picking up) and test harder and one way I do the latter is to ignore the golden-age period. This, I’m guessing, is the single biggest issue behind models that test well but don’t perform well.

To be clear: I definitely wasn’t trying to attack anyone or P123. Sorry if it felt that way. I love P123. I love trading. And the launch of R2G has given me and the site a lot of new energy and ideas. I have a lot of respect for a lot of traders here and agree Marco and his team have done a great job, and that there are systems (complex and simple) that can make real money.

I’d just like more standards for documentation and testing and reporting sim results…as well as for determination of number of total sub’s and max. system liquidity for the reported results. Would be great to have a more detailed ‘standard’ reporting process for all r2G models that are submitted and cost over a certain amount ($50 a month?)

Best,
Tom

I agree with this measure. It is easily obtainable from the Trading Stats of each R2G model. Any model with a ratio lower than 0.10 should be avoided.

Marc - I believe there is a great need to have an option in P123 to make the benchmark an equal-weighted version of the stock universe being employed.

Steve

mgerstein,

So that leaves you with the periods 1999 and 2009 - 2013, the bull markets. This provides a test whether models high ARs are due to market timing only, or whether they outperform in bull markets too.

Actually, I don’t pay much attention to 1999 which is about as irrational a period as the market has had in who-konows-how-long.

But, yes, on the whole, I am left with a fairly short period. An even the post-2009 era can be subdivided into different sub-sets. So ultimately, my go or no-go decsions are based not so much on any statistical calculation(s) but on my understanding of and degree of comfort with why the model performed as it did under particular circumstances. Using historical data to make inferecnes about the past is both art and science.

(And actually, I haven’t been using market timing lately in my own work but would if I develop timing approaches that can handle the macro-driven market problems we’ve had periodically post-2008. And even there, unless the stock model was good ex the timing, I’d use the timing to just go in or out of SPY or something like that)

That would be valuable but we might want to go further: perhaps the ability to create a model and use that as a benchmark – the model could be one or more tickers, or all stocks in a particular universe, all stocks in a particular universe filtered based on some enabling rules involving liquidity, sector, market cap,etc, or even a replica of the basic model without one or two rules the efficacy of which you’d like to test, etc.

Benchmarks really answer two questions: (1) Does my strategy accomplish something worthwhile? (2) Is my strategy better than a simple easily-available alternative (i.e. my comparison of my REIT model with IYR). It really is an interesting area.

Will look into it.

This is an excellent post. Thanks for sharing.

I like this-a lot. I truly do. But what can you conclude? I think this is a perfect existence proof. Just as you only need one black swan to prove there are black swans, you only need one disappointing port to prove ports can disappoint.

What would be nice is to study a large sample of out-of-sample ports. Now where can we find that…?

I know. Every R2G port is really an out of sample port. Or a prospective study. We have enough R2G ports to have statistical significance although perhaps a relatively short time frame for each. This will get better with time.

I propose that we keep an accurate record of all R2G ports including any that are discontinued and perhaps look at the results statistically.

Regards,

Jim