I just wrote a very long post on this that got deleted - I wish P123 would autosave replies!
(All of the below numbers are from simple tests run with the ranking system).
Alan,
Thanks for sharing. Over the past 2-3 years, I have read many papers in this area.
EDIT (I attached the wrong paper)
http://papers.ssrn.com/sol3/papers.cfm?abstract_id=2249314
This is a paper by ‘smart people.’ They have another paper showing that many of the 360 factors in peer-reviewed journals likely aren’t statistically robust once number of trials is taken into account. How many of the P123 factors are?
For example, there is huge debate in the academic literature as to whether or not the ‘small cap’ effect is real if a) January is removed and b) tiny stocks are eliminated (penny stocks, etc) and c) we only look post 1980. It may not be ‘robust.’
However, their method for ‘discounting’ sharpe ratio’s also likely won’t work.
POINT 1: The idea that you can every count trials is wrong.
All of these papers miss a fundamental issue. Georg gives a simple example of this. It’s impossible to count the history of an idea.
Take academic research. There are tens of thousands of researchers and PhD students annually running ‘backtests’ on data - and this has been going on for decades. Those who find ‘meaningful’ patterns then join with a mentor and seek academic publication. Computers are being used. There are a near infinite number of ‘trials’ on the data, with only the ‘successful’ factors being published. This is massive data ‘overfitting.’ (The authors above talk about this).
Given that unless we can perfectly trace the origins of our ideas, ‘counting backtests’ is impossible, we have to assume that every ‘factor’ is the result of infinite perm’s having been run on the historical data.
POINT 2: Discounting sharpe ratios is also wrong.
My time on P123 has taught me that the ‘discounting sharpe ratios’ doesn’t work. The most overfit backtests often have the highest Sharpe’s to begin with and still (sometimes) pass all kinds of ‘sensitivity’ tests. ‘Math’ is good, but it’s not enough.
So, we should just assume that every system / variable has been the produce of an infinite number of trials, and then ask ourselves what levels of ‘parameter sensitivity’ checks can we run after the fact to gain an understanding of the system’s sensitivity to parameter shifts / changes.
POINT 3: Running ‘batch perms’ on tests - especially black boxes to be sold to others is a minimum thing to do.
Simple ‘trading day’ perm’s will usually not hold up to perm shifts. If a 5% gain over 20 days works, but 4% over 19 or 6% over 21 doesn’t, I likely don’t have a ‘robust’ factor. Some highly overoptimized systems will break in these tests. Some won’t.
Many of the most ‘statistically robust’ and highest sharpe systems will have the highest turnover. They will be very sensitive to slippage/trade execution. And to others continuing to trade them. Those will be the critical input / stress point.
Each system will have it’s own stress points.
POINT 4: More factors can be good or bad
If I start with the ADT100 $1MM universe and close>$2 and just rebalance 100 stocks annually, I earn 8.2% AR - which beats the SP500 handily.
If I choose a ranking system to do this, and I choose 2 random factors, I won’t be that bad off if I don’t do much else.
My annual returns will equal
a) the Universe returns
minus
b) slippage and trading costs
minus
c) taxes
plus / minus
d) tracking error (which is huge for low number of holding and low turn, but falls with high holdings and/or high turn)
plus/minus
e) timing / hedge modules
Most R2G’s are so complex, they are almost impossible to evaluate from the outside. Most very small number of holdings, hi-turn systems will perform well below their backtests… some won’t, but they are impossible to select before hand. That doesn’t make them ‘bad’, but they are hard to allocate large amounts of money to and are very sensitive to very small parameter changes (i.e. assumed slippage or fill rates on the ‘best’ stocks).
POINT 4B: Sometimes more factors is good and sometimes it isn’t
More factors can be used to increase optimization and likelihood of system failure or to reduce it. Nearly every designer likely thinks they are doing the latter. Are they?
Let’s say I have 2 factors. Factor 1 is a true factor with AR% of 18% for the top 5% of this universe and 14.5% for the top 20% of this universe.
Factor 2 is random. In this case, the top 5% bucket will fall to 14% (about a 20% reduction in ‘true AR’) and the top 20% bucket will fall to 13% (about a 10% fall in AR).
Let’s say now I have 9 factors. 8 are random. One has a true AR% of 18% for top 5% of the uni and 14.5% for top 20% of uni.
Now, the top 5% bucket falls to 11% or so and the top 5% bucket to 12%.
In both cases, trading only the top 5% in a well designed system can still have more alpha.
So… given that we can assume that many of our factors are likely random, or will behave as if they are random in the out-of-sample period, it makes sense to build some redundancies into every system (at the ranking, universe and/or buy and sell rules - I use them all).
POINT 5: Every ‘factor’ has periods when it works and periods when it won’t. When it doesn’t it can behave negatively or behave as if it’s random. Even if it’s a ‘true factor.’
I may have heard that ‘value works’, and run one test that confirms it - but ‘value works’ came from 1,000’s of backtests over decades confirming it (or curve fitting that found it, followed by a rush to copy it).
We should attempt to understand why it works, but should also be skeptical of our own stories.
For example, maybe value works initially because people over discount unattractive companies because no one wants them. So, one person learns that baskets of them are good because they sell at huge discounts to their ‘intrinsic values.’ She makes millions. But over time more people learn this and ‘value’ multiples get elevated by lots of people (or computers) trading them because ‘value always works’ relative to growth - so that value stocks no longer have any real value (they are overpriced relative to instrinsic company value) and value won’t keep working unless unless more new naive value buyers and volume comes along, and so the ‘value works’ will work if enough people follow it even though the real intrinsic value relative to growth won’t be there anymore - but it won’t work if people stop believing in it.
There are countless ‘factors’ we can do these thought experiments for, but nearly (if not 100%) of factors are giant ‘shell games.’
POINT 6: Better book backtesting and ‘rule based’ sub-system weighting/selection/functionality is the ONLY reliable way the best multi-asset or multi-system managers have solved this problem for decades. P123 needs a huge upgrade here.
We all (if we are managing money) need to therefore ask, how to best construct mechnical systems for backtesting and, in real-time, building more robust portfolios of low/negative ‘peak stress event’ correlation systems. Having the book feature allow us to adjust weights dynamically based on a ‘basket of underlying’ systems and their trailing correlations and volatility and behavior would be a great starting point.
Being able to run backtests on a PIT, no-survivorship bias R2G ‘graveyard’ will be a great starting point.
I have been beating this drum for years, because it’s the only way to better manage my family’s money. Please vote for this and/or lobby for it.
POINT 7: Unique, proprietary data sets are one of the most reliable sources for ‘alpha’. Alpha is limited by nature.
Upgrading ‘data pack’ add-ons, and/or finding a way for PIT data to be sold by third party vendors after it passes PIT tests should be something P123 looks very seriously at and the community gets behind. Please vote for this.
Please also vote for ‘user defined’ short-term and long-term tax rates, so that we can model things like tax-loss harvesting and look at sim’s with after-tax returns.
Everything else - including counting ‘backtests’ or relying on ‘time tested’ fundamentals, misses the point. The market is a multi-player game long divorced from what businesses would ‘buy’ and ‘sell at’ and we need other traders following similar rules as us for stocks to be driven higher (or lower if we’re shorting) in predictable ways - and we can’t be sure on any single system and need to combine large numbers of them in a rule based way to create reliable return streams.
Best,
Tom