How to choose a ranking system? A or B

Whycliffes · October 4, 2022, 4:34am

I have tested two different ranking systems, both in the same period (15y), size of portfolio, and with the same turnover and same universe. I’m going to have 10 stocks “live” in my portfolio.
System A gives better results with 10-15 stocks, while the second ranking system B, gives better results with 25 and 40 stocks.

Which rating system would you choose? The one with the most stocks? and Why?

There are two approaches here:

either that ranking system A is better to find the 10 stocks that would ultimately make up a live portfolio. The nodes are therefore better to find smaller and more concentrated portfolios.
or that, the development in a smaller portfolio, and in individual stocks is so random, that the system that does best with 10 stocks has even less chance of repeating good picks in the future. In other words, the value of a backtest on so few stocks has little value in predicting whether a factor is good for extracting future good stocks.

And if solution number two, in a live 10 stock portfolio, where do you set an upper limit of numbers of stocks tested in a backtest? Do any have a rule of thump, that don’t test more stocks than X-times the numbers you would have in a live portfolio.

(I know there is no clear answers here, but I’m still interested in how any of you would pick and test the system that you are finally going to use live) (and yes, I also tested the systems in Rank backtest, and tried to exclude some of the best stocks in apporach1)

ivillalongabarreiro · October 4, 2022, 8:44am

For me it depends a lot. Does It take 10 stocks because it filters out the rest, or is it based on a ranking system?
I would prefer option B, because 10 stocks seems a little bit over optimized. You don’t know if picking the number 11 would decrease a lot the results. However, with the second one, picking randomly 10 out of those 40 seems to be more consistent. Btw, it would be the test I would have done: Testing the systems spliting the universe with the “Mod” function, and see how the results change.

test_user · October 4, 2022, 4:29pm

In a universe with 1000 stocks, comparing a 10 stock to a 20 stock simulation means comparing the top 1% vs top 2% of stocks. Optimizing a ranking system through backtesting makes it possible to “tune” your system such that the top 1% is better than 2%. But out-of-sample, do you think it is possible to make a ranking system that can predict which stocks will perform in the top 1% vs top 2%? I’m skeptical, but open to be convinced otherwise.

I prefer using the average return from several subuniverses (I use 20 stock sims in six universes) to compare ranking systems (+ rank performance), attempting to see which system can “broadly” sort good stocks from less good stocks. I also use a large set of rolling screeners that include stocks down to roughly rank 50, but I’m not at all confident about how “low” one should go.

Edit: So I would go with option B, but I would suggest testing more to figure out which system is best.

Whycliffes · October 8, 2022, 3:27am

Thanks for the feedback, to following up on my own post:

I am used to testing and optimizing the strategy in the simulator, but:

I see that noise from individual stocks can affect the result a lot, also in a test with 25 shares
I can solve this to some extent with rankpos > 75 and StaleStmt = 0, but one test that then gives very high turnover, but as the rolling test it shows how good the system is at picking the best stocks for each period
I also think that testing a strategy of 10 stocks makes little sense, because the return is far too much the disposal of individual stocks without knowing if similar stocks will come in the future.

Rolling returntest:
4. Removes noise from single stocks to some extent, because it uses average of each period
5. Remove Timing Luck, showing the system’s ability to choose the best stocks for each period
6. Provides a good visibility over periods where the rolling test does not perform better than the reference, and stability in the performance

Any other good point on why rolling systems may be the best way to test a ranking system?

Test_user: Is there any way to run this test “Mod (StockID, 4) = 0” on all subuniversities at once? Or do I have to test one by one?

Jrinne · October 8, 2022, 10:08am

This is not a new problem and not a problem just for stock-pickers. Personally, I get a better perspective looking at baseball. Baseball scouts have this same problem. Here too, it can be a million-dollar question. Baseball has ALWAYS had statistics and people making decisions based on those statistics.

Which of these players is the better player? The player who got on base (from hits) 4 times out of 10 at-bats or the player who got on base 300 times out of 1000 at-bats? One player has a 400 batting average (4/10) and the other has a 300 batting average (300/1000). There is more data on the player batting 300–making this basically the same type of question. But details matter as far as how much data Whycliffes actually has (and how cherry-picked it is).

Me, with the baseball example? I will take the player batting 300 (300/1000) for his hitting skills. Maybe the other player can pitch. I don’t think there is any other rational answer.

John Paciorek has a perfect batting average of 1000 (3 for 3). Is he the best player ever? I mean OMG! Who even cares if he owns a glove if he can bat 1000!!! He is in the Major League Baseball Hall of Fame right?

Here is a book that will tell you who the best-hitter-ever is using Empirical Bayesian Statistics (and how to handle this general problem in the process). But spoiler alert: it is probably not John Paciorek—although you can never be 100% sure. Introduction to Empirical Bayes: Examples from Baseball Statistics

The key to this is using “shrinkage.” Test user uses a different type of shrinkage by using MOD() It is different than Empirical Bayes and represents a type of “regularization,” I believe. This is a serious problem and people are always finding solutions to it.

Whycliffes, you asked a similar question not too long ago: Previous Post

Yuval. Thank you for your excellent answer to Whycliffes’ question about why real-world results tend to “shrink” out-of-sample. Here is the (re)link to the article from your post. The paper uses Bayesian statistics as you know: Yuval’s Link

Yuval got it right here! There are other ways to use and/or predict shrinkage although Bayesian Statistics is probably always the best.

Sometimes I will get a confidence interval and use the lower bound as my estimation of real-world results going forward. That is probably what I would do here if I did not actually uses a Bayesian solution. Using confidence intervals works because more data—like being up-at-bat more times–will narrow the confidence interval. Using confidence intervals is also suggested in the paper Yuval linked to. I have used that sometimes. MOD() can work too. It is not a new problem and people are always coming up with good ideas on how to solve it.

The mathematical proof that using the lower-confidence-interval is probably better than the upper-confidence-interval (or even the mean) uses Murphy’s Law which states that you can always expect the worst. I find it works in practice. More formally, the multiple comparison problem (lots of trials without that much data) can be used in a rigorous proof.

Thank you Yuval for your answer giving what is probably the best mathematical solutions (Bayesian statistics or possibly confidence intervals as a “back-of-the-envelope” approximation) to this problem with your previous link. And again, regularization in all of its forms is always good and works very well in practice.

Jim

test_user · October 19, 2022, 11:08am

Sorry about the slow reply!

I don’t know any way to test multiple mod()-universes at the same time. The closest would be to use a python program through the API (but no simulations this way, unfortunately), or using the optimizer.

The optimizer works pretty well, make a simulation with the rule “mod(stockid,4)=0”, and then add the alternatives “mod(stockid,4)=1”, etc, in the optimizer. Run the optimizer, then update the contents of the ranking system to whatever alternative ranking you want to test, and run the optimizer again. Finally compare the results in Excel. This makes it easy to simulate many universes over several time windows without enormous amount of work.

Jrinne · October 19, 2022, 7:00pm

Test_user,

This seems like an excellent way to do it. Using Mod() has advantages and disadvantages, I think.

The main (and perhaps the only) advantage to using Mod() is that it is reproducible. Meaning “mod(stockid, 4)=1” will produce the same universe each time.

In this regard it is like using random.seed() over at Python, as I am sure you know.

The limitations with using Mod() are pretty significant, however: You can only produce 4 universes and each universe is only 1/4 the size or the original. 4 smaller universes.

This is a creative way to get something similar to Python’s random seed and pretty cool that someone at P123 thought of that! I first saw Georg use it but he may not have been the first and others probably thought of it independently.

There is a thing called subsampling that is often used in machine learning. This is what XGBoost uses. P123 will be providing an implementation of XGBoost shortly. XGBoost uses “subsampling.” To summarize the mathematics of subsampling: it is a really good thing. It is probably as good as bootstrapping (which has been discussed previously in the Forum) and it less computer intensive that bootstrapping.

If you can turn on subsampling with XGBoost in P123 you will technically be performing “stochastic gradient boosting” on the P123 platform. You may want to become familiar with subsampling for that reason alone. You may get better results with P123’s AI or machine leaning.

Subsampling has a been around for a long time, much has been written and pretty much everyone writing about it likes it. And IMHO they like it for good reasons.

The advantage of subsampling is that each universe can be any fractional size of the original universe and there can be as many universes as you could want.

True it would be better to find a larger sample and if you have a larger sample you should use it. Maybe you do. Maybe you try your models on European data too, for example. I am not saying subsampling is perfect and that there are not other things you shouldn’t try when you can to get a larger sample. Subsampling clearly is limited in how much it can accomplish.

You would implement it by putting Random() < .5 in the rules to get a random universe that is half the size or the original. Run it as many times as you want to get as many different universes as you want. Variables will show up multiple times in multiple universes. But they will be selected randomly and with other factors that are also picked randomly selected. This randomness makes the repeats less of a problem—especially if this can be repeated a large number of times.

Here is the original paper supporting subsampling: Stochastic Gradient Boosting byJerome H. Friedman 1999 Googling subsampling offers newer, easier and ultimately better explanations. XGBoost documents also discuss subsampling of the data (rows) and of the factors (columns).

But to summarize they notice—like you have—that it can reduce overfitting. Random forest generally stick with bootstrapping. Neural-nets generally use dropout. Boosting and neural-net can also use other regularization techniques.

Mod() and subsampling are 2 good techniques that we can use at P123. And as you say, it is nice that one can run a sim (or multiple sims) with either one! P123 is great platform. Much can be done with it and more is coming.

I assure you the math is sound on subsampling and unless you want to have reproducibility—and do not have the have time to run this on more than 4 small universes—subsampling is a better way to go. If time and computer resources would allow for more and larger universes, you might investigate further.

BTW, I wonder if P123 uses pseudo-randomizaion with a random seed for their function Random() and if they do whether they would want to expose control of (or setting of) the random seed to users (if there is any interest in using subsampling for sims and rank performance tests with this method from users)?

Jim

yuvaltaylor · October 19, 2022, 10:15pm

You can tweak Mod(StockID,4)=1 to create as many small universes as you want. Mod(StockID,17)=1 through 17 will give you 17 discrete universes. Mod(StockID,2)= will give you only 2. You could, if you want, create universes of different sizes by using Mod(StockID,100) and combining various results (e.g., Between(Mod(StockID,100),1,8) might be one universe, Between(Mod(StockID,100),9,21) another, etc.).

The problem with using Random is that your universe will change with every rebalance. That’s why you need something stationary like Mod(StockID,X) instead.

I’ve also been creating universes made up of subindustries. This is not random; I choose which subindustries to put in each universe. This way I cover the entire spectrum of stocks while having quite different and discrete universes. The Mod(StockID) universes all resemble each other a good deal; universes composed of subindustries offer more variety (and therefore more checks to make sure your ranking system is universal).

Jrinne · October 20, 2022, 11:05am

Yuval,

I had not realized that the universe would change at each rebalance for the sim and I appreciate your pointing this out.

All,

I understand that there are times one would want a static universe. For example, when training and validating data over the same time frame. P123’s even/odd universe is another way of doing this. This is equivalent to using Mod(StockID,2), I think.

Also, I could imagine situations where I might use a rule that includes both methods for validation. For example this rule could be used for the validation data over the same time frame: Mod(StockID,)=2 & random < 0.5. The training data universe would use the rule Mod(StockID,)=1 keeping the training and validation data separate. It might be best to use the screener with no buy or sell rules or the rank performance test when using this method.

One could also think about whether they wanted to use the rule Mod(StockID,)=1 & random < 0.5 on the training universe when developing the system.

If you look at Friedman’s paper, how XGBoost or a random forest handles this it is clear that they are NOT subsampling half to the universe THEN sticking with half to the universe throughout.

XGBoost has a lot of fans, including Marco it seems, as he and his AI specialist will be making XGBoost available to P123 members according to a recent post. XGBoost clearly does not stick with half of the universe throughout. Rather, the universe is reshuffled frequently. Here is a quote from XGBoost’s documentation:

“Subsampling will occur once in every boosting iteration.” I.e., frequently as XGBoost uses frequent iterations in its algorithm.

This also may be interesting:

“Typically setsubsample >= 0.5 for good results.” In other words, you may not want to use less than half of the universe unless you are forced to.

So random (and frequently re-randomized) subsampling has multiple established uses and some will be using it on P123 if the full functionality of XGBoost is made available. When you do consider using it there are some papers and mathematical proofs establishing the best ways to use it–like subsampling half or more than half of the universe when possible.

Again, people should use any (and all) methods that suit their needs. But you may find that random() < 0.5 may be adequate or even preferable for some situations. Maybe you would use both in some situations.

I look forward to the release of Boosting on the P123 platform and possibly a discussion from the AI specialist on the best uses of subsampling. Possibly with some generalization of the discussion to include the uses of subsampling and bootstrapping outside of the XGBoost program.

For me personally, this is one situation I would uses random() < 0.5, I would train on all of the data up until a certain date. Then to test (or validate) the data I would start on a date that is after the last date used for the training data and use random() < 0.5 on this test data. Maybe record each result on a spreadsheet.

This would give you a range of results that you could expect to see out-of-sample. And is pretty similar to what you would think of with Monte-Carlo simulations. Actually much, much better. It is like bootstrapping the results which has a long history ( with multiple books and papers recommending this method) and avoids the assumption of normality of the resulting confidence interval as it is a non-parametric method. It is well studied and universally accepted as having advantages at this point.

This does address the specific question in the original post. And in fact to address this specific question completely, I would use the above method and take a lower confidence interval as the likely result of both strategies. I would adopt the strategy that had the better result. I.e., the better lower-bound of the confidence interval. This is an expansion of what I said above about shrinkage.

Anyway, I would probably use both Mod() and Random() if I were using either of them for my studies at the present time.

Edit: so I think I have convinced myself to use this as I have detailed to test the out-of-sample results of a ranking system I have developed using data up until 2015. Do it as I have described from 2015 on.

And first impression: Wow!!! It definitely works for validation. One strategy held up. The other strategy that work well on the entire universe declined significant using random universe. In a way I might have expected but had not been able to prove.

And it is answered the question of the original post for my strategies. The strategy with lesser amount of data declined more (both in the average and the lower bound).

Jim

yuvaltaylor · October 20, 2022, 2:21pm

If you use Random < 0.5 your universe will change with every rebalance so that your slippage will be enormous and your holding period will be extremely short. Imagine buying a stock because it ranks highly and then selling it a week later because it’s no longer ranked and then buying it again a week later. The results of your screen or simulation will be crazy. XGBoost changes universes with every iteration, which is great! But using Random will change universes hundreds of times within every iteration.

Jrinne · October 20, 2022, 2:27pm

Yuval,

Perfect! And correct 100% correct, I think.

Have you looked at or considered what “force positions into universe” does in a sim? One would halve the number of positions in a sim and adjust the RankPos. Probably use as few buy or sell rules as possible (other than RankPos < 1/2X).

Regardless of what one ultimately decides about using this in a sim, I am less concerned about what you correctly observe to be the case in a screener or rank performance test where everything is rebalanced without slippage each rebalance period.

Just make sure to have the number of stocks in a screener and probably halve the number of buckets in the rank performance test.

There will be increased variance and that is actually what you are trying to accomplish I think. That is actually the definition of (and the purpose for using) regularization.

Anyway your observation is 100% correct, I think, and very much appreciated.

Best,

Jim

test_user · October 20, 2022, 6:14pm

Thanks Jim,

I’m looking forward to the release of ML at P123, though I suspect I’ll have a lot to learn. The use of mod() during rank optimization is both helpful & a bit simplistic I guess, though it has worked reasonably well out-of-sample so far. Are there any books you could recommend to learn more about machine learning (that are relevant to P123)?

Jrinne · October 20, 2022, 7:14pm

Hi Test_user,

Please do not think I dislike Mod(). I would not even call it simplistic. If we are going to label it I think I will call it an excellent, creative and intelligent use of regularization.

Are there (sometimes) better uses of regularization? I do think the answer is yes Random forests have gained a great deal of popularity. This is because they use bootstrapping which is perhaps the best way to regualize things. In addition, they randomly select a SUBSAMPLE of the factors for each tree and generally run at least hundreds of trees with this combination of bootstrapping the data sample (basically the rows in the uploaded csv spreadsheet) and subsampling the factors (the columns on the spreadsheet). That is why it has become so popular. Presumably this is working for some machine learning problems.

But I do like Mod().

As far as books about machine learning I get the impression that you know some Python? I think this briefly covers about everything and you will have no problems running the examples I think: Introduction to Machine Learning with Python: A Guide for Data Scientist

Boosting can be better than random forests but more difficult to understand and it has more hyperparameters to understand and work with. Here is a really excellent book to understand the theory behind boosting: Machine Learning With Boosting: A Beginner’s Guide

The cover and the title may make it seem overly simplistic. I think it pretty much fully explains the theory.

I will have to relearn neural-nets myself. I am not sure I ever fully grasped them in the first place. This is in part because boosting is a close competitor with neural-nets for “tabular data.” Roughly tabular data means it can be put into a spreadsheet (which describes most P123 data).

Neural-nets are good for self-driving cars and recognizing pictures of cats on the web. I do think with a lot of experience one can get a neural-net to compete with and possibly surpass boosting.

I know XGBoost. I am not sure what, exactly, will be offered for neural-nets. Maybe we all can learn that together at P123.

I hope that helps some.

Edit: BTW, if you have ever listened to Pandora that is an AI using boosting to figure out what kind of songs you might like.

Just speaking for myself, I do not know what machine learning tool Amazon music uses but I think it knows me better than I know myself. Now when people ask me what type of music I like I say Amazon figured out that I like………

Jim

Whycliffes · October 22, 2022, 6:17am

Jrinne, I’ve been following your posts with great interest. I’ve never worked with a system that used ML. I’m also not sure how it will work in P123. Do you know when ML will be avaliable in P123?

Much of what you write is WAY above my head.

I hope there will be some guidance in the use of ML on P123 - preferably via webinar or video.

I see some have asked for books on this, but do you know of any published videos that explain how ML works?

Jrinne · October 22, 2022, 9:27am

Whycliffes,

I apologize for that apparent complexity of my posts. But my general idea is already being used and advocated by P123 members in the forum. And the use and advocacy does not seem rare to me.

The main thing I was suggesting was an alternative to using Mod(). I don’t think random() <0.5 is any harder or simpler than mod(). And the reply was to test-user who is using Mod() already. I don’t necessarily think one is better than the other. Indeed, in my opinion either can have an advantage over the other in certain situations.

Mod() was test-user’s idea in this thread. As I understand things, both test-user and Yuval make frequent use of it. AND FOR GOOD REASON.

I basically agree with both Yuval’s and test-user’s posts on the usefulness of what they do. I have found the modifications I proposed in the tread useful for some of what I am doing now.

For example, I will probably be using a port with more stocks than I was planning in the near future which is basically test_users point and a direct response to the original question in this post. I believe (agree with test_user) that this type of thing (i.e., regularization) can remove the illusions that a small data sample can lead to.

Random() was popular a while back: I think Marco popularized it at the time. He was using if for something similar to what I am proposing, I think. He did like to continue to increase the randomness, like Random () < 0.6 then Random() < 0.7 to see when a system fell apart as a test of robustness. So I am not using it exactly like that but I think what I am doing serves a similar purpose. People have been adding randomness to their data for a long time. And rediscovering different ways to do it simply because it is useful. Some have put a name to this (calling it regularization) so they can discuss it in the literature and build on previous knowledge.

Probably just me, but I am not sure that I hate having some of the ideas peer-reviewed in the literature either.

Marc Gerstein proposed adding randomness to all of P123’s data at one point. This was to reduce overfitting which is the usual reason for doing regularization or adding randomness to the data… To the best of my knowledge he never called it regularization. I do not think it was a terrible idea but I prefer to have control of the randomness I am introducing. And the option of removing that randomness when I choose to. At the end of the day—all things considered–I wasn’t a fan of introducing random errors in the data that could not be removed. Whatever you may think of the idea yourself, this does illustrates how people keep discovering the usefulness of regularization and they can be pretty convinced about how useful it is.

The main thing I did that might seem complex was putting a name to all of this in the forum. One word: regularization.

I guess I expanded on that with some references for anyone wanting to explore what has already been written about this outside of this forum and I pointed out that if we used ANY of the machine learning at P123 we would probably be using regularization.

But I was not suggesting anyone needs to use random forests or boosting with P123 data to be a successful investor. The books I recommend were at test_user’s request. Any video I mention below will be at your request.

In fact, while I make use of data downloads and run some of the data through Excel and Python, I am developing an appreciation for what can be done without integration of machine learning into the P123 platform. Basically this is because I now see how things like regularization do not always have to be done in a random forest or with boosting.

A lot of people beside me do a thing or two using Excel BTW. And if they are doing a regression they are doing machine learning. Correlations count as statistics last time I checked.

While not everyone on the forum will want to remember the word regularization I think it is not too complex of a word or idea to be in the forum—especially when it is already being widely used and advocated.

As far as videos, most of (maybe all of) the topics are on Youtube… Here is a video about Ridge regression (which uses regularization) at Stat Quest: Stat Quest

You just have to Google it.

Coursera is an online University but many of the courses can be audited for free. You can learn all you need to know about neural-nets (I recommend Andrew Ng’s course) for free there, for example. BTW, I think Andrew Ng works for Google and is basically promoting TensorFlow and to helping train the next generation of employees at Google and the tech industry. Anyway, I think that may help explain why such a great lecturer can be seen for free with such a complete course. TensorFlow is a Google product now free to use and Google even provides fee servers over at Colab to run TensorFlow. So there is a pattern for sure but I may not fully understand the reasons.

BTW, dropout is frequently used for regularization of neural-nets but Andrew Ng likes the same method used for Ridge regression (mentioned in the video above. IT IS EVERYWHERE including some of Google’s recommender systems. But I am happy if anyone thinks it is just me promoting it. I would be proud if I could legitimately take the credit for regularization’s near-universal acceptance here at P123 and elsewhere.

Anyway, I don’t think my goal was anything more than providing a name for, and support for what Yuval and test-user are doing. Maybe suggesting an alternative way to accomplish the same thing that people can look at and adopt for some situations (or not if they prefer).

Best,

Jim