P123 ranking systems underperform the market?

yuvaltaylor · May 16, 2019, 8:24pm

Well, the market cap lower-is-better factor is only 5% of the ranking system. You could eliminate that easily with little effect. There is no volume lower-is-better factor. There’s share turnover, which doesn’t reward small caps at all. The companies with the lowest share turnover in the SP 500 are Walmart, Johnson & Johnson, and Exxon Mobil.

I wouldn’t dream of doing so. In fact I regularly use the opposite factors (sales higher is better, assets higher is better) to moderate the “market cap lower is better” factor, which can have pernicious effects on occasion.

Here I really disagree. The object of using P123 is to improve your probability of getting good returns. You can’t get good returns by curve-fitting–on that we agree. But if you figure out how to incrementally improve a good strategy’s backtested results, you’re also incrementally improving your own. Improving a ranking system by improving its past performance gives you a slightly higher chance of improving out-of-sample returns than sticking with an unimproved ranking system. The logical conclusion is that you should improve your ranking system as much as possible, and aim for the highest backtested returns, as that will give you the highest probability of high out-of-sample returns. You just have to use robust backtesting by using rolling tests and very large sample sizes (weekly is better than monthly, 100 stocks better than fifty, large universes better than small, very long backtest periods better than short ones) to ensure you’re not curve-fitting, and you have to look carefully at your returns distribution, and you have to carefully research all your factors and rules to make sure they make good financial sense. (I have frequently “improved” a factor by making it make more “sense” even if the backtested returns went down by doing so.)

I realize that you can design a very attractive backtest using complicated buy and sell rules and a small number of stocks, and that those strategies are basically worthless. But that shouldn’t invalidate efforts to incrementally improve backtested returns in order to incrementally improve a strategy using robust methods that correlate well between in-sample and out-of-sample periods. And if those backtests show alphas of 40%, fine.

Jrinne · May 16, 2019, 8:52pm

Yuval,

All good, IMHO.

Have you considered using the “out-of-sample” as a formal cross-validation (as in de Prado’s book)?

For example, I joined P123 in 2013. Now, when looking at any new systems I develop them using data from 2000 until 2013.

Once I have it optimized my ranking system to my satisfaction I run the sim or rank performance out-of-sample from 2013 to now and compare this to the systems I have been running live since 2013. If the new system’s cross-validation (since 2013) doesn’t beat what I have been doing live (since 2013) I throw it in the waste can and stick with my present ports.

If I decide to use the new ranking system I do a final optimization with all of the data up until today.

Probably just a different name for some of the things you are already doing. And as I said—all good.

-Jim

mgerstein · May 16, 2019, 9:02pm

[/quote]

No, I’m not interested in checking the performance of anything. The goal is to learn large cap strategy design, not to be a prisoner of the moment to pump backtested performance over a specific interval.

Use of those tests is designed for the assumption that small stocks are better than large stocks. That does not change just because one can point to a period during which large stocks outperformed.

The goal is to understand what the market and these stocks are doing and more important WHY. If folks continue to present backtests and feel good about good results, then nothing is being learned.

Come on, folks, this is a learning exercise. If one is not interested in participating in it, that’s fine. There are many things in which we’re all not interested (including me . . . There are lots of topics about which I don’t care). But if one does want to try to learn to deal with large caps, then one should allow themselves to learn without the pressure of trying to present great equity curves.

mgerstein · May 16, 2019, 9:19pm

That is not only wrong; it’s dangerously wrong.

The only way to get a “clear correlation between in-sample and out-of-sample returns” is if one is lucky enough to be in a period in which trends persist and the future resembles the past.

I’ve discussed this more than enough times over years and don’t have the time or energy to keep rehashing. If anyone wants to pursue the topic further, there’s plenty in the on-line strategy design class and on my blog at actiquant.com.

RTNL · May 16, 2019, 10:29pm

so Marc. I am not getting the point of what are you saying. Can you please summarize?

Monday33 · May 17, 2019, 5:04pm

Steve-

You caught me, jumping around it is exactly what I’m doing here. I realize that. But I’m working on understanding how all this environment works, and I think I need to perform some jumps.

Thanks a lot for your explanation, I think I get the point.

I’m going to use more the “performance” tab in the rankings, but I do not fully understand yet the “pockets” system.

So, I have a couple of questions, for you or for whoever who wants to answer:

Is there any difference from set the Universe to a “All Stocks US” and then exclude the OTC Market in the buy rules, or set the Universe to a “No OTC Exchange”?

I guess the “buckets” distribute the stocks the ranking returns from the highest to the lowest, but if we set to 5 buckets;

What is inside the 0-20 bucket?.. are the first stock to the twentieth, the ranking returns, for instance?
If you do not set the ranking System at N/As to neutral for model Deployment… which one do you use it?

Yuval-

Your ranking it’s great!! Love it.

What do you mean here??

yuvaltaylor · May 17, 2019, 6:43pm

Have you considered using the “out-of-sample” as a formal cross-validation (as in de Prado’s book)?

For example, I joined P123 in 2013. Now, when looking at any new systems I develop them using data from 2000 until 2013.

Once I have it optimized my ranking system to my satisfaction I run the sim or rank performance out-of-sample from 2013 to now and compare this to the systems I have been running live since 2013. If the new system’s cross-validation (since 2013) doesn’t beat what I have been doing live (since 2013) I throw it in the waste can and stick with my present ports.

If I decide to use the new ranking system I do a final optimization with all of the data up until today.

Probably just a different name for some of the things you are already doing. And as I said—all good.

-Jim

I think a better way to do “out-of-sample” tests is to use the evenid = 0 and evenid = 1 universes. Another way is to use the more recent period as your in-sample and a period prior to that as your out-of-sample period. I would hesitate to use a strategy that was developed without regard to recent changes in the way people invest. As I’ve written before, recency bias can be profitable. If I were to choose between a strategy optimized over the last eight or ten years and one optimized over the eight or ten years before that, I would unhesitatingly choose the former.

yuvaltaylor · May 17, 2019, 7:05pm

One of the designer models boasted average annual returns of over 100% with a five-stock portfolio. I don’t KNOW that there were complicated buy and sell rules, but it’s not easy to get 35X turnover using ranking alone. Since it was launched five years ago, it has had a 4% annualized return (compared to the S&P 500’s 9%).

Jrinne · May 17, 2019, 7:09pm

Yuval,

I used even/odd universes extensively at one time and it clearly adds something. But it it should be no surprise if data from the same time period behaves the same and gives the same results—one is fooling themselves if they think this is much proof of a good system. de Prado cemented my view on the limitations of this method in his book (Advances in Financial Machine Learning).

I am sure you did not use this method for your correlation studies that you mention—for good reason I would say.

But I encourage the use of even/odd validation if that is anyones preferred method.

Of course, no one in machine learning (including de Prado) would run a live port without the latest data. Once the method is validated all of the data is merged and optimized again: THIS IS TRUE WHETHER YOU USE EVEN/ODD UNIVERSES TO VALIDATE OR WHETHER YOU USE OUT-OF-TIME DATA. So, this really is not a way to separate the 2 methods. And many—including De Prado—would continue to optimize with the new data before each rebalance. So, no one would argue against using recent data no matter how they chose to validate their methods.

Whatever cross-validation method is used, preventing this is the goal. A good cross-validation would have.

-Jim

Monday33 · May 18, 2019, 8:49am

Then that must be a clearly exemple of overfitting.

Maybe we can avoid that by apliying rules more generic and close to the median? For instance; If ROE>12 is what best performs then choose around that figure, but not the same exactly…

Jrinne · May 18, 2019, 12:03pm

Cross-validation—well done—will expose the most egregious problems with overfitting.

The above is a little dense mathematically, and the rest of the chapter involves consideration of correlation and problems with the data not being IID. All under the category of b “leakage”[/b] problems. But this is the reason you cannot use just any random sample. If the data were IID you could use an even/odd universe method for cross-validation, for example.

BTW, the fact that our data is not IID (the core problem mathematically) was first presented to me by SUpirate1081 and pvdb. It stung a little bit—at the time—to learn how messed up my techniques were. But their input is much appreciated considering the goal is to learn and make some money in the process. Just giving credit to these guys. CyberJoe understood this too but he grew tired of us and is gone now.

But you do not have to understand every mathematical detail. You could just follow the portion in bold (my emphasis). Test on data that is AFTER (in time) the data you use to train the data. Ultimately for reasons that are similar to the reasons you use PIT data in your models (a no-brainer).

P123 cannot stop the problems of overfitting in the Designer Models and has to enforce a valid out-of-sample test by not allowing people to invest in the models until there is (some) valid out-of-sample data. Well done P123.

But if you are worried about your own models (and your money) you can use your own (preferably valid) out-of-sample cross-validation methods.

-Jim

geov · May 18, 2019, 4:14pm

Jim, can you please let us lesser mortals know what IID stands for. I know what PIT means.

geov · May 18, 2019, 4:16pm

deleted

Jrinne · May 18, 2019, 4:35pm

Georg,

IID (independent and identically distributed). It is beyond the scope of my post to detail the importance of this. But I can summarize: IT IS IMPORTANT TO EVERYTHING. Rather than debate this I refer you to any text that tries to cover any of this including dePrado’s text.

Examples,

Use of even/odd universes. These two universes may be identically distributed but they are not independent. Indeed, these are the most correlated samples (and therefore NOT independent) you will be able to find. Use them but know the limitations.

I would not paint everything we do with such a broad brush. But is can be dangerous, indeed, when we ignore some basic ideas.

-Jim

sgmd01 · May 18, 2019, 5:26pm

Jim,

Thanks for bringing up and explaining the important point that each data point is not independent.

Scott

primus · May 19, 2019, 12:40am

Simple systems will likely underperform because transaction costs can’t overwhelm that the market knows about them already.

Complicated systems are not likely to outperform because of overfitting.

So we’re basically preempted from alpha unless we can develop/leverage robust and unique methods and/or data.

That’s not to say that the journey can’t be enjoyable.

Jrinne · May 19, 2019, 12:44am

David,

At least one professional would seem to agree:

-Jim

Jrinne · May 20, 2019, 4:16pm

Georg,

My apologies for a short—and somewhat frustrated–answer previously. I do not seem to be very good at writing about some of this. If anyone is truly interested then people smarter (and better at writing) have written chapters about this, with editors. Better than a few posts from me if there is true interest.

Indeed, de Prado solidified this for me and could probably clarify any questions. He is a principle for a company that invests more than $217 Billion (AQR Capital Management). If you want wisdom from an immortal (a rich and well-trained one) that is truly where you should go (for $15.09 at Amazon). But that should not stop me from trying to give my best answer to a good question.

IID, as I posted before, means independent and identically distributed. The lack of independence gives a preference for how we cross-validate.

An example is best. Let us suppose you wanted to see how a regression strategy (or optimized P123 rank) works and you wanted to TEST the strategy from 2009 to 2010.

I was addressing one simple question. Should you TRAIN the system from 2004 to 2008 or should you TRAIN it from 2011 to 2015?

So depends on what you are looking for perhaps. But if you are interested in how well the system is likely to do LOOKING FORWARD then you should LOOK FORWARD. The training data should be before the test data. You should train from 2004 to 2008 and test using data from 2009 to 2010, for this example.

If you use the data from 2004 to 2008 you are training before the recession which may have changed the market. Maybe interest rates and a bunch of other things changed too. But it is a fair test. Because the regression (rank) was not trained on that information that was not available before the test period. This makes it a fair test of the system.

If you TRAIN with 2011 to 2015 you ARE training on a different market. You are training on data (and information) that you could not have possessed, at any price, in the 2009 to 2010 time period.

It could be that when you test after the recession you will do well but only because you had a set of data (and information) about the market after the recession to train with—which would have been impossible for you in 2009.

Using data that is not PIT is just an example of using INFORMATION that would not have been available at the time. Here you are using information and data to train your system. Data you could not have gotten through the SP500 (or anywhere for any price). It is also A LOT LIKE look-ahead bias.

Anyway, just wanted to give the best answer that I could. And, admittedly, I probably did not do as well as a principle in a 217 billion dollar investment firm (de Prado is a principle for AQR Capital Management). Someone who had several chapters to develop the idea. Also he probably did not have any errors in his math—which I may have done.

This mortal thanks you for your question. De Prado—who, obviously, communicates with the immortals—is a better source if you have a true interest.

And you could just use data that is forward in time if you want to see how your model is likely to perform going forward. This is common sense that a few mortals do posses.

Sorry if that still does not answer your question. But I will stop before I copy and paste an entire chapter and the references;-) You can do that on your own if you are interested.

I appreciate the question.

-Jim

geov · May 20, 2019, 4:50pm

Jim, thank you for this informative response - we can learn a lot here at P123.

Monday33 · May 22, 2019, 7:45am

So, open question… should we add the Yuval’s ranking system to the P123 Ranking system list or not?