Be careful of 'pretty' sim's and 'smooth' equity curves

DIFFERENCE IN PORTS/SIMS PRE-CRASH AND POST-CRASH

As Marco and Mark Gerstein pointed out, the difference between pre-crisis and post-crisis performance is astounding. So much so that I don’t even bother backtesting before March 9, 2009 (the date I called - in print - the post-crash start of the rally). Remove all the data from December 10, 2007 to March 9, 2009 from your backtesting. What worked in the 2003 - 12/10/07 rally doesn’t work in the current one, and vice-versa.

However, I now add 1 simple factor to sims that worked in the last 4-year bull rally to post-3/9/2009 data and it produces astounding performance (whether in-sample of out). This factor is are probably the last things you would think of… Care to guess?

As Mark Gerstein noted, conditions change, and Ranking Systems that worked well with the data from 2003-to-Dec. 2007 are not applicable in the current environment, but will be extremely useful after the next consolidation+crash (which should start soon, with a real ‘crash’ beginning the middle of next year, if not before).

I believe that the next crash will be the final crash of this 17-year sideways consolidation, with a 17-year secular bull rally to follow (starting in 2017). Count on the S&P to get back to the 700-800 level Remember, these crashes and rallies create exactly what value investors look for: volatility. Volatility is our friend, and I’m counting on more government-created chaos in the future!

Tom,

I am still interested in the 40 stock sim. Or, if you tell us the ranking system I could do it myself.
Thanks

Not easy, but for those who want long tests, I suppose one thing that could be done is to develop and test models as if today were 1/2/2009. Everything done should be as if that were the case and make a yes-no decision on that basis. Then, see how it does 1/3/09 thru the present which, I suppose, would be the functional equivalent of an out-of-sample test.

Obviously, that could also be done with shorter periods; i.e. pretend today is 8/1/10, build a model on that basis, and then run a three-year out of sample test. . . . etc.

“That would be valuable but we might want to go further: …”

Marc - don’t get too elaborate or it will never be built. Keep in mind that the hedge model has taken a good 5 years to implement and it is apparently still not there yet :slight_smile:

I put my basic liquidity/price/MktCap filtering in my custom universes so I don’t see the need for most of the extra stuff that you are suggesting. I think that as standards go, using the custom universe is important when people are looking at other’s models. We need to see that the model performs better than the underlying universe that stocks are drawn from.

As for using another model for your detailed testing, well you can always do one simulation with some rules disabled then run another one with them enabled. I do that right now to see what effect the rules have. I consider this to be detailed testing and perhaps a tool could be created to make it easier, but I don’t think it is necessary for capturing a benchmark that the public would see.

K.I.S.S.

Steve

cheyenne,
I don’t know what you are talking about. My Best10(S&P1500) R2G models has identical performance for the bull market periods. In the pre-crash period AR= 53.75% with a max DD= -15.4%, and the post crash period AR= 53.60% and max DD= -24.20%. So what worked pre- also worked post crash.


Best10 1-3-2003 to 12-10-07.png


Best10 3-9-09 to 8-8-13.png

I posted this idea before but seems pertinent:

I do my initial testing over the last five years. I then run an out-of-sample test from 1999 to 2007.

Advantages:

  1. sims run faster (5 years)

  2. large out of sample test

  3. don’t waste time on “great ideas” that just no longer work

While it is true that old ideas may not work now, my limited experience shows that what works now would have worked in the past.

In any case, if it hasn’t work for 5 years I’m not going to try it now.

Jim

All,

I don’t understand testing on less than the total available period. Sure, we have had very significant changes in approaches to investing due to changes in tools and technology, but a Sim that performs well over very different conditions I feel will more likely perform better in the future when it all changes again than Sims developed over only the 2 market conditions of the last 5 years. It is hard to compare out of sample time periods that have different market conditions. Is the Sim’s different performance because the Sim was over-optimized, or because the market conditions changed?

All of my Sim development is performed over the full time period without market timing. However, I do it all with EvenID =1. When I feel I have a potentially good Sim that I have run through my robustness tests, I change it to EvenID = 0 to test it with 100% out of sample stocks. That way I am testing over the same market conditions over time. If the performance is similar between EvenID =1 and EvenID =0 then there is little significant data mining. If there is a big falloff in performance and/or increase in drawdown, then I have over optimized, and I go back to EvenID = 1 and try additional development by reducing buy/sell rule limits etc.

If EvenID =1 and EvenID =0 have similar results then I re-run it on all stocks. Every time, I get improved performance over the 2 cases with limited stocks. The last changes I might make, depending on the type of Sim I am testing, is to add my independently developed market timing rules and tighten up Rank buy and/or sell rules if the Sim uses them, since the Sim is now buying higher ranked stocks. That is the background to all of my Sim development since I was out of the market in 2008.

I also use that approach with ranking system development. I create a custom universe for running the ranking system Performance with my minimum liquidity limits and EvenID =1. After I think I have a good ranking system for the approach I am working on, I change the universe to EvenID = 0. If that is OK I try it without EvenID.

Oh shoot, I just gave you some of my secrets! [:))]
Denny :sunglasses:

Denny,

I try to keep my posts a little short (still long I know). Was going to say before I’m done I use EvenID =1 and EvenID =0 using Max period (you taught me this-thank you). I also go to 15, 20 and 25 stock positions again over Max period. I also like RankPos > 5. Would not trade a port if all: present and before 2008 looked good.

Before I’m done, I’m way over-optimized but not too far from that 1999-2008 pure out-of-sample test that looked good (or I would have stopped then).

Edit: The optimizer is also very helpful for testing robustness by seeing the effect changing values has. As you know, sometimes small value changes can effect results greatly, other times pretty large changes have little effect.

Thanks.

Jim

@DennyHalwes
I think it depends on the type of portfolio you are trying to create. By using EvenID, you are essentially splitting your universe in half. I can see this working to test 5 stocks portfolios. But if you are creating a portfolio made for more than 10 stocks, you will have to lower criterons or you won’t be able to fill a complete portfolio plus you won’t enjoy the specific risk reduction of holding a decent quantity of stocks. Other than that, I agree with you, i don’t understand people who test onles than the total available period. Doesn’t make sense to me unless you are just looking for something specific (bear markets for example).

Q,

When I use EvenID I assume that in the end, I will increase my Rank buy and sell rules so that the Sim passes about the same number of stocks and sells them at a higher rank value. That approach tends to maintain the average days held and increases the performance. It doesn’t matter if it is a 5 stock or a 20 stock Sim, it works the same.

Of course if you are not using rank buy or sell rules, you would have to vary other rules to maintain similar number of stocks and holding time. That may negate the previous robustness testing you have done.

Denny :sunglasses:

Stitts:

I really like your idea of using an equally weighted custom universe as a benchmark.  Would you write this up as a feature request so I can vote for it?

Bill

This is one of the best threads I’ve seen in P123 in a long time. It is sure to have the greatest impact to a number of subscriber’s out-of-sample (ie real) profitability. Thanks for your analysis and kickoff Tom!

In light of this thread, I want to push forward (again - sorry) a key feature that we all desperately need, especially with the introduction of R2G and in light of Tom’s eloquent presentation of data mining risks:

https://www.portfolio123.com/vote.jsp?poll=888

Feature N: Convert Ports to ETFs.

Imagine if you could track ports like etfs/stocks. Imagine if you can rank these ETF-ized ports based on # buy/sell rules, #elements ranked (without exposing rules/ranking elements), port volatility, port trendiness, port correlation to a benchmark/each other, the valuation of the ports (Hi vs low PE, PB, PS), across each other, and to themselves over time (remember the Goldman Sachs study called Quantcentration -search in prior posts if you want to download - BTW they took this study off the internet).

I have to do my ETF-ized analysis of P123 ports off line using Amibroker and the downloaded P123 port equity curves as “ETFs”, and quite frankly have not kept it up due to time constraints. Also, without valuation information (PB, PS, PE, other) it is fairly equivalent to actual ETF equity curve trading. So, consequently, I have found trading ETFs based on simple momentum, volatility and correlation rules (See CXO Advisor for various approaches) to yield low but dependable profits.

I stay plugged into P123 because of the great community discussions, and because I know this product is going in the right direction (hats off to Marco and team).

But I feel that we (I) need to stop looking for holy grail, George Soros/Paul Tutor Jones beating, data mined solutions. I haven’t achieved their fame nor fortune nor track record no matter how good a port(s) I built and traded. And as Tom points out - these fantastic (fantasy?) ports can flame out quickly and flame badly.

True, it appears that some micro cap, low scale solutions may be out there…

Here would be a real cool analysis:

Rank out of sample port performance based on the port number of rules/ranking elements.

Anyone want to put some bets on the result on this analysis? Anyone think of other analyses on ports that would be good to have? We can do them if we have a feature where we track ports like ETFs, along with attributes (mentioned earlier). And we will have a true market edge - at least for awhile…

My interest is in looking at the basics that work, when they work and when they don’t (Tortoriello). I don’t think digging deeper and deeper for that nugget, or nugget combo, is going to work that well.

One last point. A group put a lot of effort in tracking factor/factor combos that did well. A lot of work, and a lot of offine analysis. Imagine storing each combination (they are unlimited, but I think a few dozen key combinations would be telling) as ETFs and simply ranking them over time.

If you agree that we need to ETF-ize our ports, and have insights at the valuation and “complexity” levels, please vote for feature N in the poll.

If you disagree, please share why.

And again Tom, thanks for your eloquent analysis. This is the level of quality that makes this site an incredible value.

Carl

Carl - just wondering if this feature request is same as your ETF’izing ports.

Steve

https://www.portfolio123.com/feature_request.jsp?view=open&cat=-1&featureReqID=769

Hi Steve,

Yes, I do believe that is the feature request that is referred by Item N.

I just took a look at open feature requests and saw that I have a few feature requests that focus on providing the valuation of the ports (PE, PB, PS, etc of the weighted port holdings point-in-time). In the write-up above I also toss in the need to know the “data mining characteristics” of the ports, such as number of rules/ranked elements - this need is crucial given the R2G concerns and heavy risk (presumption?) of data mining.

I think the goal would be to have the ETF (your feature) with available valuation/other parameters (easily calculated from the point in time holdings & weightings and the port parameters). I would see the equity-curve ETF feature as a valuable, quick and urgent first step.

Thanks for bringing this feature up. I thought I had voted for it, and now I have!

Carl

My 2 cents for picking an R2G which hopefully holds up to its simulation performance:

  1. Launched at least a year ago and still outperforming the index (not many models are that ‘old’ to begin with…);
  2. Simulation must be from 1999 onwards and not cutting out the 2000 tech bubble;
  3. Market timing must be limited to a maximum of 10% over the entire simulation period;
  4. No hedging in form of shorting.

This leaves less than 20 R2G trader models for me to choose as of today. And most of them are already fully subscribed…

Lastly, every bear market has another reason and hits a different industry more than another. It’s great to be able to simulate from 1999 onwards, but that still only covers a mere two major downturns. Quite a few more will be required to build more robust models.

Happy investing!

This is a really old thread; before the latest post, the prior one was almost a year and a half ago. But I think it’s great that it was revived. The PowerPoint Tom submitted with his initial post is worthy of review, and what he demonstrated will soon be revived: As part of the new chapter on testing that I’m adding to the new A to Z guide, I included a case study along the same lines; a model that tests wonderfully, survives a bunch of robustness tests, and then tanks out of sample in ways not dis-similar to what Tom’s PowerPoint demonstrated.

But on reviewing this thread, I notice a very glaring omission from the discussion. What was the strategy? It looks like most of it hinged on what was in the ranking system and about that, we’re given some generalities, but we are not shown the details. Tom seemed to pin a lot of the blame for the demo model’s performance on its affinity for illiquid stocks, and that may, indeed, have been a factor. But we need to start thinking and talking more about the full substance of strategies. Often, a case study like what Tom presented can be spotted ahead of time by a knowledgeable investor just by seeing the model - and the dangers can be spotted eve before going live.

Except for a simple understanding that if it looks too good to be true, it probably is (hence our long incubation period on super-duper R2G sims), you really can’t count much on so-called robustness tests to steer you away from trouble. There are no statistical short cuts to developing strategies that are just, plain good. And I believe the best things we can do for R2G, now that Marco is working on platform changes that will really make it harder for curve-fit models to overwhelm better ones in user search behavior, is to support R2G designers in developing good strategies that enhance their ability to deliver appropriate live results. The WACC posts I submitted are a first housekeeping step toward my goal of putting a lot more fundamental educational content up on the forums (WACC by itself doesn’t do much but it’s needed to calculate other things that are important). My goal is to combat curve fitting by giving designers better things to offer and giving subscribers better things to demand.

I know I still have that so-far unposted backtesting chapter draft. I’ll get back to it after I post the law WACC piece, later today or tomorrow. And to tell you the truth, I am editing the backtesting chapter carefully because I’m pretty sure it’ll generate some sharp reactions. :slight_smile:

Along these lines, one thing that I have been personally bitten by (my own fault) very recently is the tendency for ranks from different developers to all gravitate to the hottest, latest, sector (ie, energy, in this case (is Healthcare the next?)). I was not paying attention and had completely overweighted (I use 6 ports) in energy and of course they all got whacked recently. After this happened, I have looked at the description of the R2Gs that I am using and they either did disclose (see example below), did not mention anything about limiting Secweight or they openly said that sometimes they had to overweight in a sector in order to get performance (which I appreciated the honesty in the comment). My guess is that many R2Gs have similar things that the ranks contain and therefore many can become sector overweight into the same sectors. So one can think they are getting diversification from using many ports but unless one puts the R2Gs into a book and looks at allocation (which I don’t use because I have too many); you can get blind sided. (I ended up making one large manual port that includes my 70 stocks. I then look at allocation. it is a pain but works).

A way to limit drawdowns is through diversification; so I would like to request R2G developers to be more aware of this and open about how they manage sector overweighting. To me, this should be a best practice to disclose.

I have gotten to the point now that I won’t use an R2G unless the developer talks about it. As a positive example, StockMarket student (Steve Auger) explicitly talks about this in his description for his ‘Stitts Wealth Starter Model - PRussell 3000’ r2G. Good.

David - I think there are a few issues going on. The PRussell Wealth Starter model did well but I have other models along the same theme / sector weighting / ranking system that got hammered. There are two reasons I can find for the Prussell model success. One is the simplicity of the buy/sell rules. I wasn’t out to squeeze every last ounce of backtest performance out of it. The concept was “easy to use” for beginners. The second reason was that there is a timing component which causes stocks from the defensive sectors only to be bought during the summer and fall months. Whether this market timing holds up in the future well who knows…

What I am finding is that there is more to it than restricting sector weight. What is really needed is diversity of ranking systems, or Marc’s blue chip model, one or the other :slight_smile: One can restrict sector weight but I find that stocks outside the energy sector (for example) still drop at the same time as in the sector crashing, probably because the ranking system is still choosing the same sort of fundamentals, if this makes sense. We can’t handle multiple ranking systems at the model level, but we can at the book level. But we don’t have tools to allow analysis of a book with multiple ranking systems that have a low correlation, excluding buy/sell rules. If we could separate the ranking system from the buy/sell rules then that would be half the battle.

Another point is that if I attempt to create a model with weight restricted to 10% - 13% per sector then my backtest results will almost always be very poor. It is a rare case that I have been able to develop a model with equal contribution from all sectors. It is like trying to fight with both hands tied behind your back. This would only work if all developers were forced into fighting with no hands which I doubt would ever happen.

Steve