Ranking Annualized Returns Vs Performance

A ranking system optimized for a tiny universe, no matter how sensibly defined, is always going to be less persistent than one optimized for a large universe, simply because of the amount of data you’re working with.

I do think that developing ranking systems for small universes (e.g. banks) is a very good idea, but you should try to keep those universes as large as possible for testing/optimization purposes (eliminating, of course, low-liquidity stocks), and only narrow them down by adding screening rules (like historical profitability and dividend payments) later. In addition, you should recognize that the ranking systems are going to be more heavily data-mined, and will be more subject to failure, than a ranking system developed for a large universe.

Yuval,

I cannot tell you how much I agree.

Just a nomenclature thing. What you say/do about persistence is VERY much like cross-validation. A VERY VERY good thing.

Take credit (for whatever you call it), and please add it to P123 wherever possible. Maybe see what else de Prado has to say about cross validation/backtesting.

Last comment on this post about that and add whatever you wish. But my last word(s) is: Good work!!!

-Jim

It’s been a lifetime since I used pure statistical math, and the fog of time has certainly dimmed my memory of specifics and my abilities. But some precautions come to mind:

Statistics in every way ABSOLUTELY depends on sample and population size. Too few in a sample can lead to erroneous conclusions. Limit the population too much and the same can happen. So, how can you check for sample validity? Others might suggest specific statistical tests of significance to perform on the size of the filtered universe and each bucket. In some instances and for some measurements, I detest the use of statistics because we can incorrectly conclude that they are all powerful. They are not, and even after performing all of the right statistical tests we might find that our analysis doesn’t hold up going forward in time. In my thinking, that is usually due to our ignorance of hidden variables whose impact was not evident in the original data.

There is nothing wrong with filtering the universe or performing the ranking tests as you have done, as long as you accept a higher probability of spurious results due to small universe and sample (bucket) sizes. Some things I would check in an attempt to reduce that probability:

Use the rank performance test log to download and analyze the median return of each bucket and the standard deviation of individual returns within each bucket. There are few enough stocks in each bucket so that the log should, with your small universe, include all stocks, not just the first 30 in each bucket. These numbers are mostly independent from the regression between buckets and should help to support a decision to use one ranking approach over the other. Tightly spaced returns in each bucket (small standard deviation) and median returns close to the mean returns are preferred, of course. There might also be some value in regressing the set of bucket median returns or the set of standard deviations. The median and standard deviation of each bucket are statistics I would love to see reported in ranking tests and would be specifically beneficial for situations like yours. I thought that I requested them once upon a time, but I can’t find such a request. Performing this analysis requires significant spreadsheet work.
***Edit: I ended up voting for a Ranking request by Vladinvest instead of creating another one. Jerrodmason also requested some stats. If interested, I suggest one check the outstanding feature requests for those that interest you.

I believe you have already compared the ranking results in different periods of time. That can be important to consider.

[quote]
Randomness typically has poor r^2 even with high AR but a case can be made that it’s possible.
[/quote] One very important precaution here: In statistics, EVERY POSSBLE OUTCOME has a non-zero probability. Every day we easily rationalize (“randomness typically has poor r^2”) why something we see (a high r^2) is statistically relevant. But it actually might not be! “Caution, Will Robinson!”

I do agree with trying to keep it as big as possible (caveat, within the scope of project), but this case was something on the Canadian exch and it’s hard to keep any uni big on the TSX that is what I consider a tradeable. It’s a test project nothing more at the moment but it yielded an interesting finding that I wanted to share :slight_smile: I’m glad I did because it got me thinking about how I can apply other comparative tests. I read your piece on ultimate omega I will likely look into that further when I get some time, so thank you for that.

[quote]
It’s been a lifetime since I used pure statistical math, and the fog of time has certainly dimmed my memory of specifics and my abilities. But some precautions come to mind:

Statistics in every way ABSOLUTELY depends on sample and population size. Too few in a sample can lead to erroneous conclusions. Limit the population too much and the same can happen. So, how can you check for sample validity? Others might suggest specific statistical tests of significance to perform on the size of the filtered universe and each bucket. In some instances and for some measurements, I detest the use of statistics because we can incorrectly conclude that they are all powerful. They are not, and even after performing all of the right statistical tests we might find that our analysis doesn’t hold up going forward in time. In my thinking, that is usually due to our ignorance of hidden variables whose impact was not evident in the original data.

There is nothing wrong with filtering the universe or performing the ranking tests as you have done, as long as you accept a higher probability of spurious results due to small universe and sample (bucket) sizes. Some things I would check in an attempt to reduce that probability:

Use the rank performance test log to download and analyze the median return of each bucket and the standard deviation of individual returns within each bucket. There are few enough stocks in each bucket so that the log should, with your small universe, include all stocks, not just the first 30 in each bucket. These numbers are mostly independent from the regression between buckets and should help to support a decision to use one ranking approach over the other. Tightly spaced returns in each bucket (small standard deviation) and median returns close to the mean returns are preferred, of course. There might also be some value in regressing the set of bucket median returns or the set of standard deviations. The median and standard deviation of each bucket are statistics I would love to see reported in ranking tests and would be specifically beneficial for situations like yours. I thought that I requested them once upon a time, but I can’t find such a request. Performing this analysis requires significant spreadsheet work.
***Edit: I ended up voting for a Ranking request by Vladinvest instead of creating another one. Jerrodmason also requested some stats. If interested, I suggest one check the outstanding feature requests for those that interest you.

I believe you have already compared the ranking results in different periods of time. That can be important to consider.

I agree that you can’t rely too much on the statistical data that we look at without thinking about the bigger picture surrounding that data. I think of the data as a picture and a picture is a single moment in time. Post process the picture and you can find out a lot of interesting information. However a movie of the same moment would tell you a lot more about how valid the picture was. So I try to examine trends in weights and different time periods. If I can find a positive relationship then I get a better sense of accepting the best picture. I also agree that the highest AR doesn’t usually translate into the best system. Take a step back again and what are we trying to accomplish? Finding the best predictability not the best AR (maybe reward:risk is a better way to say it). OOS is the ultimate test but large data samples are desired. Current economic conditions are also desired but that means not counting older data …

There are a LOT of ways to attack the data that we have, and we may not be monkeys on type writers but we aren’t far from it :wink:

Tony,

Thank you for raising this issue in your post—it got me looking closer at this issue.

Here is a paper that uses Spearman’s Rank Correlation as its chief metric. Link HERE. You can skip to page 5, section 3.4 and the first paragraph of “performance measures” to find his use of Spearman’s Rank correlation as a “Performance Measure.”

P123 makes heavy use of ranks for the factors. Does it make sense to rank the returns too (as the Rank Correlation does)?

If you had a crystal ball you would like to know the order of the future stock returns starting with the best stocks to the worst stocks in descending order. Then you would pick the top 5, 10,…25 stocks. Rebalance regularly. This is the P123 way (without the crystal ball).

Spearman’s Rank Correlation is specifically designed to see how good your predictions are in this regard. It tells you if your method is successful at putting the future stock returns in descending order.

Ultimately that is all you can act on. All you could ever hope to do is pick the top 5 stocks every time. Anything other than the order is just interesting.

Spearman’s Rank Correlation may be the best metric for this. Regular correlation has the nonlinearity problem. MSE (mean-squared error), RMSE (Root Mean Square Error) and MAE (mean absolute error) are popular but they have problems when the data is not I.I.D. In this case when things are not identically distributed a high RMSE, for example, may not be reflective of the best ordering.

Okay, I admit it. Knowing the return (and the volatility), in light of the transaction costs, may help you decide whether any of this is worth it and whether you shouldn’t just put you money into some diverse ETFs. Maybe there are other things that are more than “just interesting.”

So this paper also used the returns of the top quintile minus the bottom quintile. Our sims and screens give us this type of information already—just slightly different groupings of the data. Maybe we compare the sim’s returns to the benchmark’s, instead of the returns of the lowest quintile, but they are very much equivalent methods, IMHO. We are all probably using sims and screens in a good way (for good reason).

FWIW.

-Jim

I just wanted to update this ranking system a bit to show that a small focused system can still have some merit. I finished the ranking system for this and created a port sim. It’s using a simple 50% hedge with TLT to smooth things out with little sacrifice to AR. It’s set with Var slippage and $0.01/shr commish. I used start of 2003 start date just because TLT inception wasn’t until summer 2002 and it’s easier to just press MAX then change the year on the simulator. Remember this Universe is dynamic and has held as little as 1 stock through 2008 crash. Currently I think there’s about 70. These are all Canadian dividend stocks so either a 2 or 4 wk rebalance seems to work well. Any longer than that and you lose most of the benefit of the hedge. This has been a fun project and I think I’ll dig a little deeper into it since I believe there’s a market for this within an RRSP since you’re not taxed on the Div proceeds.


Does the US have a common tax shelter that doesn’t get taxed on the dividends?

[quote]
Does the US have a common tax shelter that doesn’t get taxed on the dividends?
[/quote] Preface to my reply: I am not a tax expert and might be wrong. ROTH IRAs should be able to avoid taxes on dividends received, as long as they are not from ADRs. That’s all that I’m aware of. Regarding ADRs that have dividends: The taxation of ADR dividends in the USA can be complex and is likely regardless of the account type holding them.