Reasons for 'Big Gap' in results between a ranking system and a simulated strategy

Greetings all,

Usually when I build a (ranking) strategy I have a ‘quick and dirty’ test to check for robustness that is as follows:
(1) I try not only rely to on backtesting, but think for myself in terms of what factors to choose and which factors to give a higher weight based on for example the level on noise in the data for that specific factor and my conviction in the factor’s construction.
(2) backtest on multiple universes (so for example a US ‘total’ universe and 5x subsamples of that total universe with Mod(StockID, 5) = 0)
(3) backtest in multiple regions (so for example not only in the US, but also with European data).
(4) backtest over multiple timeperiods and/or use performance measures that take into account more than only the cumulative return over the period but also its robustness across time.
(5) backtest with multiple methods (ranking, simulations, screens)

Up untill now when robustnesscriteria (1) - (4) were satisfied for multi factor ranking systems, always (5) was automatically satisfied as well. So when for example a ranking system bucket test looked promising, so did the simulations that used that ranking system. But yesterday I encountered a situation where this was not the case, which was surprising to me.

After constructing a ranking system based on (1) I ran a simple multi factor ranking bucket test on a broad US Universe with a volume constraint as in (2).

That looked ‘good enough’, so I also ran it with 5x subsample universes with the result as follows.

Eyeballing the graphs, I was still statisfied, so I moved to step (3). Using a broad European Universe wtih a the same volume constraint I got the following result:

Again using 5x supsamles resulted in

which still looked good enough to me.

Next I use some robustness measures as in (4) that I described here Ranking your ranking systems - #16 by Victor1991. This was also satisfied. So I moved on to (5), testing it on the broad US Universe I used in step (2), but this time in a simulation.

Based on the above numbers, I’m still interested in the strategy. The annual turnover is a bit too high for my liking, but let’s move on and test it on the broad European universe of step (3) for now to see what happens there.

Surely we can expect an annual alpha there somewhere between 35-40% based on our previous ranking bucket tests and simulation with US data. Right?

Well, that was actually quite dissapointing. What happend here? Looking at this result of step (5), I’m not that comfortable about this strategy anymore, even though it seemed quite promising based on steps (1) - (4).

I’m interested to hear your thoughts on what you would do in a situation like this. The gap between the simulation results in the US and in Europe seem very high, even though the ranking bucket results did not seem that far apart. Would you go back to the drawing table and revamp your ranking system? Would you doubt your simulation settings (see below for mine)?

Let me know if you need any more information about my procedures in case you need it to formulate an answer to the above question. I would love to get some input here.

I have developed quite some systems based on procedures that are every much like the procedure I described here in steps (1) - (5) and because this time the ranking system did not satisfiy only (5), I’m starting to doubt my procedure as a whole and think I need more robustness checks even if I use it only as a rough first test.



First, you should have been tipped off to this with the bucket ranks. Your top bucket got 38% in the US and only 16% in Europe.

Second, take a look at the factors and see how many of them are N/A or not appropriate for Europe. For example, any factors using short interest, insider buying, or institutional holdings are going to be NA for Europe, at least for the moment. So will certain estimate revision factors (those that are updated weekly). Also check to see if your factors are commensurate with semiannual reporting. See Adjusting a ranking system for European (semiannually reporting) companies.

There might be a host of other reasons why the ranking system is working better in the US than in Europe, and it’s definitely worth exploring. Try varying the weights of the factors radically and then repeat your experiment. Perhaps the weighting of value factors will make a difference? Or the momentum factors? Try changing your universe rules and see what that does.

And let us know what you find out!

Hi Yuval, where did you see the 16% for the top bucket ranks in Europe? I see 35%+ return for the top buckets in each of the subsamples. Perhaps I’m missing something :thinking:.

I will check out your other suggestions and see what happens. Thanks!

EDIT: I have edited the thread title from (paraphrasing) ‘What action steps to take when a strategy fails your robustness tests’ to ‘Reasons for ‘Big Gap’ in results between a ranking system and a simulated strategy’.

Sorry, I was deeply mistaken. Not sure how I saw 16.

To make this thread a bit more concrete I’m trying to come up with reasons why the annualized returns of a Simulated Strategy (SS) might differ more than 10% negatively from the ‘Top bucket’ of a Ranking System (RS) that is used as a component of that same SS, when using the same Universe.

Other threads on this topic:

Some reasons I came up with out of the top of my head:
(a) High Turnover strategies can result in a big gap between a RS and the SS when not accounting for slippage in a RS. So lower turnover strategies should have a smaller chance of having a ‘big gap’ in annual returns.
(b) The RS is not as robust in the top buckets. This should show up when using more than 20 buckets (e.g. 100 buckets). It could be the case that the Top 5% of stocks are made up of, for example 2.5% stocks that outperformed greatly and 2.5% of stocks that actually did poorly. I think this is actually quite unlikely, but it can happen.
(c) The use of Buying / Selling rules within a SS that do not give enough or not consistent enough exposure to the top ranked stocks in the underlying RS.
(d) Too little stocks held (e.g. < 10) increasing the risk of a bad pick in the SS. A RS only takes into account quantitative measures. If the CEO happens to go to jail (information not encapsulated in the RS) that will likely affect the stockprice.
(e) Different settings in the SS or settings in the SS that overwrite the settings in the RS, like the chosen currency (unlikely to explain 10%+ return diferences in most cases) or overwriting NA’s negative to neutral or the opposite way around.

For the case I described in this post (a) doesn’t seem likely as the gap in returns did not happen for the US Universe, only for European data. I’m going to check out (b) and (e) and report back.

Interested to hear if others have (other) ideas about reasons for big differences in results between a RS and a SS.

You mention slippage, and I’d be willing to bet it’s related to slippage and commissions.

Presumably, you’re running a ranking system performance test with zero slippage. If so, try re-running your sim with commissions and slippage both set to zero, and then compare again to the ranking system performance test. If there’s still a big gap, you could move on to items b through e.

Conversely, you could re-run your ranking system performance with slippage but this comparison to the sim isn’t as clean. First, there’s no commission parameter. But more importantly, in a 20-bucket test with weekly rebalance, you can incur lots of excessive estimated slippage costs as stocks bounce across bucket boundaries whereas your sim with a 2+ month average holding period, may not be as trigger happy. This problem becomes more pronounced with higher bucket settings.

Hi Feldy, I think you are right. Even though at first I thought you were not. Let me explain.

At first, I thought:
:one: the annual returns of the top buckets in my ranking systems are about the same for US and and European universes and
:two: the annual turnover (510%) for both universes is about the same

:arrow_forward: Hence, the transaction costs should be about the same.

But actually this doesn’t have to be the case. Let me tell you how I approached this.

I downloaded the Transaction information of both Simulations (under the ‘Transactions’ tab). First, I then calculated the weighted average trading price. This can be done by using for example Excel with the formula: =SUM(ABS(F4:F9229))/SUM(ABS(D4:D9229)) where ‘F’ is the column ‘Amount’ and ‘D’ is the column ‘Shares’ and 9229 - 4 is the amount of transactions. This gave me a value of 3.36 for the Simulation on the European Universe and a value of 8.81 for the US universe.

To doublecheck I also calculated it by using =SUMPRODUCT(ABS(D4:D9229);E4:E9229)/SUM(ABS(D4:D9229)), where column ‘E’ is ‘Price’. This gave me values of 3.35 and 8.78 respectively. So that confirmed the earlier calculated numbers.

Next, I calculated the average trading fee as percentage of the amount traded with =SUM(G107:G9331)/SUM(ABS(F107:F9331)), where column ‘G’ is ‘TotFees’. This gave me a a value of about 0.01 for the US universe and a value of about 0.021 for the European Universe.

Those numbers made sense to me: when the average purchase price is about twice has high and I pay a commision per share purchase, than my fees payed will be twice as high. This isn’t taking into account the variable slippage yet, which is based on the volume of the stock to be bought from what I understand. I will check that out later.

Anyway: I’m starting to think that I have to take into account the commission price to be payed in my formula weighting in the future and probably the slippage to be payed as well. That will also mean I will have to take into account the estimated benefits (so the expected returns, excluding trading costs) in the formula weighting as well. I will come back to that.

Thanks again for the input :slight_smile: