I’ve often wondered how much of my backtesting results is simply random. Well, I’ve figured out a way to quantify it by performing a simple test (see below). These are my results, in which I’m not terribly confident, but which feel right to me. The number on the right is a rough margin of error caused by randomness. This margin of error will probably get a lot higher the farther away from the benchmark you go.
1000 stocks bought and sold: ±26% annually [no, that’s not a typo—a few crazy stocks can easily do this with such a small sample]
2000 stocks: ±3.5% annually
3000 stocks: ±2.3% annually
4000 stocks: ±1.8% annually
5000 stocks: ±1.3% annually
7500 stocks: ±1.25% annually
10,000 stocks: ±0.85% annually
20,000 stocks: ±0.6% annually
30,000 stocks: ±0.5% annually
Just for reference, here are the number of stocks bought and sold in some common backtests:
5-yr backtest, 15 stocks, 4-wk rebalance: 975 stocks
7-yr backtest, 20 stocks, 4-wk rebalance: 1820 stocks
5-yr rolling backtest, 15 stocks, any rebalance: 3900 stocks
5-yr simulation, 15 stocks, 1000% turnover: 750 stocks
10-yr simulation, 20 stocks, 1200% turnover: 2400 stocks
5-yr ranking, 20 buckets, R3000, 4-wk rebalance: 9750 stocks per bucket
5-yr ranking, 100 buckets, R3000, 4-wk rebalance: 1950 stocks per bucket
10-yr ranking, 50 buckets, SP500, 4-wk rebalance: 1300 stocks per bucket
10-yr ranking, 20 buckets, SP500, 3-mo rebalance: 1000 stocks per bucket
In my opinion, any backtest that doesn’t have more than 2000 stocks bought and sold is basically meaningless. The margin of error is too big.
To run a backtest that will tell you that a difference of 1% per annum between two options is meaningful, you need a margin of error of ±0.5%, or 30,000 stocks bought and sold. A rolling backtest of the top 60 stocks over a 10-year period will give you that. If you want that degree of accuracy over a 5-year period, you’ll need a selection of 120 stocks at a time. Another option is to run, say, six different rolling backtests of a different number of stocks at a time over different holding periods. That’s what I do, along with using an Excel spreadsheet that converts rolling backtest results into annualized excess returns. If you’re using the performance button on a ranking system and you break the Russell 3000 down to 10 buckets, rebalancing every four weeks, for fifteen years, you’ll be totally fine—that’s 58,500 stocks per bucket. If you want that degree of accuracy for a 15-stock simulation, you’d better use the rolling simulation that P123 rolled out last year. A simple 15-stock simulation with a 1000% annual turnover over 15 years is only 2,250 tests, so a rule that gives you a 5% annual improvement in your sim is basically meaningless. One that gives you a 8% improvement, though, is probably a keeper.
Here’s my method for calculating. I used a ranking system that consists of one factor only: “Random.” I tested the Russell 3000 over a ten-year period using the rolling backtest for a holding period of 52 weeks. I varied the number of stocks in my basket, tested each number ten to twenty times, and noted the largest and smallest average results. I then compared those to the average result of the entire Russell 3000, with the biggest difference being my outlying number. Not the most scientific method, I admit. But I couldn’t come up with a better one.
All this, needless to say, tells you nothing about out-of-sample performance. It simply tells you how reliable your in-sample results are.