Are we over fitting?

5. Exclude top performers
P123 enables you to exclude the stocks which perform best in a simulation. If a model’s performance crashes down after removing its best performers, odds are it was “lucky” (a.k.a. over-fitting).
[/quote]

How does one exclude top performers?

Run the simulation. Go to the Transactions, Realized tab. Click on the ‘Pct’ column so the top performers on at the top of the list. Make a list of the tickers of the top performers. If you want a large number of tickers like the top 20, then download the list of Realized Transactions so you can copy the top 20 tickers rather then typing them.

Click Re-Run and go to the Periods and Restrictions tab. Paste the ticker list in the Restricted Buy List. Those tickers will not be bought when the sim is rerun.

Another option is using Buy rule like Random<=.5 to eliminate 50% of the tickers in the universe at random. Then run the simulation several times. The Optimizer works well for this. Just set it up to run the same combination many times and each one will apply the Random rule. If the returns are consistent, then that is a good sign that the system is not over optimized.

Thanks Dan. That is an interesting approach to anti-curve-fitting.
I have been using Yuval’s suggestion of creating multiple (5 in this case) randomized universes like this:
Mod (StockID,5) = 1
Mod (StockID,5) = 2
Mod (StockID,5) = etc…

But to be honest I am not sure I am using it effectively.
If I have 5 universes, I will test against 3 or 4 of them, and then try the fourth/fifth at the end, which is always much worse, and proves I am fitting, big time. This leaves me scratching my head about what to change in my model.
I think what this means is that the out of sample universe that is tested at the end is what will be closest to my actual future results.

Using your approach it seems that if you run enough iterations, you will end up with the same over fitting problem.
If I do it 1MM times, isn’t that essentially the same as not splitting up the universe at all?
And isn’t testing all about running as many iterations as possible, using as many different combinations of factors and weights as possible?
I think this may be my biggest problem in quant modeling.

It is indeed a problem. Can overfitting be entirely avoided? I don’t think so.

My approach is a little different. I optimize (somewhat roughly) for each of my five subuniverses and then combine those optimized systems for my final system, which will (obviously) backtest very well. There might be significant differences between the optimized systems but they get averaged out. So, in the end, I don’t test on hold-out data. That could be a bad thing. My problem is this: if I were to reserve some data as hold-out data that isn’t tested, what will my threshold be for performance? I’m sure that it’ll be worse. How much worse is acceptable? What will I get from doing this?

Hi Tony,
Regarding the approach I mentioned where I use Random in the buy rules… Looking at the average return from the set of sims is not meaningful for the reason you gave. You want to look to see if the returns are consistent. Look at the returns from the worst performing sim in the group - would you be satisfied if your live port had the same performance?

Yuval, that is exactly my problem: what do to with hold-out data results that look crappy/good/whatever compared to my in sample data.

Let me restate how I interpret what you said. You have 5 universes and you have a separate, optimized RS for each one.
And you average those 5 RS into one RS. How does one average ranking systems together?

Dan, I think I understand your approach now. Back to my original question. Is there not a more automatic way to filter out extreme over-performance stocks?

On a general note, lots of very smart people have been doing quant modeling for a long time. Seems like an established best-practice method that everyone uses would have emerged by now.

I just want to add that while ideal, a holdout test set is difficult in practice.

If you look at value factors for the last 5 years, don’t you already have an idea of what you will find? There is no true holdout test set for factors you have used before.

Also, it is human nature to peek at the data before you are done. Or do a little tweaking later–after you have already seen the results in the test set.

For some pricing data–where one does not have some idea of what to expect–it is easier.

Jim

Jim, what method do you use to combat over optimizing?

Hi Tony,

So, I have no problem with what Yuval and Dan have said. And I was just trying to support the idea that holdout test sets are difficult. The other problem with holdout test sets is it takes more data to be able to use a test set.

But as far as what I do: I frequently use criss-cross validation that I found in a text book a long time ago. Validation is different than a holdout test set, BTW.

So as an example I recently trained some data (for a random forest classifier) from 2006 until 2013. I then validated it on data from 2014 till now. Perhaps that is the “criss” in the nomenclature. Maybe this is the most reliable data as the training data occurs before the validation in time.

But there is no downside to looking at the “cross.” In other words training on 2014 until now and then validating on 2006 to 2013.

The other thing that is possibly significant in what I do is that I don’t really tune the hyperparameters much anymore (much of the time anyway). For random forests as an example, I just use the default (again much of the time). This reduces overfitting.

When there is not tuning or the hyperparameters the difference be between a test set and validation set gets a litter blurred. But if you have ever done anything like what you are doing now with the same data you have looked at the data before. If you are honest with yourself you will not pretend it is a test set.

Obviously, I do different things much of the time (e.g., I do not always use random forests). But the criss-cross is easy for a lot of methods and sometime spending too much time tuning the hyperparameters is nothing more than overfitting. This goes for most methods

Hope that is at least somewhat clear if not helpful.

Best,

Jim

Jim, I have no idea what you just said but it sounds impressive.
Maybe I need to read the book you referenced.

Tony

Pretty basic. Take the systems that work best and average the weights of each node. Some nodes might be missing in some systems, so those would get 0% weight in those systems and maybe 2% or 8% weight in others, and you just use the average.

Thanks Yuval. That is very interesting. Is that something you came up with on your own or did you learn that from someone else? I recall reading in one of your past articles about some backtesting methods you picked up from one of the popular investments books. I can’t remember which one. Maybe O’Shaughnessy or O’neil.

Yuval, I know you’ve written about this in the past. Just curious though, say you take the full universe, and optimize factors/ranks for the full universe. Would the sim results be roughly the same compared to breaking up the universe into 4-5 sub-universes and respective optimized ranking systems, and their average? Then again, maybe its the consistency of factor weights across the multiple universes that shows more promise? I’d be curious to hear what your experience has been.

Thanks.

I don’t remember. I think I came up with it myself. Apologies to whomever I stole it from if not.

O’Shaughnessy does this: “We run 100 randomly selected subperiods . . . For each of the 100 iterations of the bootstrap test, we first randomly select 50 percent of the possibly monthly dates in our backtest and discard the other 50 percent. We then randomly select 50 percent of the stocks available on each of those dates and discard the rest. This gives us just 25 percent of our original universe on which to run our decile analysis [bucket returns tests]. We do this 100 times for each factor and analyze the decile return spreads. . . . If we discovered that there were large inconsistencies in the bootstrapped data, we would have less confidence in the results and investigate whether there was any evidence of unintentional data mining inherent in the test.”

I did take some inspiration from this . . .

I think they’d be pretty similar. It helps me sleep at night to know that my strategy worked well not only on the main universe but on five random subuniverses. Maybe I’m subjecting myself to too much extra trouble. I don’t know. I’ve been doing it this way for a while now.

So let’s say you’re testing on your whole universe but using five times the holdings you normally would. Do you optimize the weights of your ranking system by increments of less than 2% or 2.5%? That smacks to me of curve-fitting. If not, you’ll never get more than 40 or 50 factors. (That is, unless you use composite nodes . . .)

If, on the other hand, you optimize on five different universes, you get a variety of different weights that you can average and you can get a lot more than 40 or 50 factors. If you’re like me and you think the more factors the better, then that’s a good thing. It also gives me a valuable perspective on how mutable my ranking system can be and how differently it can work for different groups of stocks.

Another thing I do is I look for statistical ties. You take all your results, find the standard deviation, multiply it by 1.96, and divide by the square root of the number of tests you did. Then you look at your top few results. If the difference between them is less than that number I just told you about, then they’re statistically tied, and you can average not just the best, but the second best, third best, and so on.

(I don’t remember how I came up with the 1.96 times part. I must have read it somewhere.)

What would certainly be different–and maybe better–than dividing your universe randomly would be to divide it by subindustry or size or something else and then take the average of the optimized systems. You’d get more variety that way. The key is to have relatively equally sized universes and not have stocks migrate from one to another.

The goal here is laid out in this article: The Magic of Combination: How Mixing Strategies Can Improve Results - Portfolio123 Blog

Long term OOS Performance and causation of the edge is one answer.

My best long term OOS models are small cap models.

They do well since 2011.

They do well because there is a reason for it. Big funds can not play this stuff and value momentum (Olikea ranking!) is one of the most stable factors our there especially in the small cap area.

Hard to trade, yes, regular 20% DDs and oc. 50% DDs.

A variation of that Olikea ranking system (adding industry momentum, accruals, some quality factors + EPS Estimates) is doing well out of sample since 2019.

All said, this is just the beginning. The best models will not help you, if you can not trade them, human factor is the much harder thing to master.

I am working on a project which will take those systems and find an allocation based on signal strategies as input.
Those signal strategies define the cash level and the system to trade (e.g. capital allocation to cash or to different strategies in the set).

I am asking myself if this is no to much optimization, we will see in the OOS Performance!

Yuval,

Just in case you happen to be interested, 1.96 has its own article on Wikipedia:

“In probability and statistics, 1.96 is the approximate value of the 97.5 percentile point of the standard normal distribution.”

Link: 1.96

Jim

All,

For new members who have not been around long, there are posts above from someone who lost all of his money for investing when there was a drawdown in his account. He has shared this in the forum before so I am not revealing any secrets. I am grateful that he has freely shared much on this forum. He started investing again with money from his income as he continued to work and is doing well now by all accounts (as I understand with the information I have). Good. He is sharing the perspective he has gained about drawdowns above. Double good.

Thank you everyone for sharing your experiences. Much can be learned from this forum including an understanding that people can learn different lessons from the same experiences–or at least have a different perspective. Also this helps one keep in mind that–like everything on the internet–some details are understandably left out for brevity. It is not the internet’s fault (rather it is mine) when I can’t remember what has been written before (or has been left out entirely at times) and bring some perspective to what is being written now: this applies to news stories even more than it does to the P123 forum.

Thank you everyone for sharing and it is a true learning experience that goes beyond investing at times.

For those who think Daniel Kahneman (Nobel Prize winning author of Thinking Fast and Slow) might occasionally have some words of wisdom: he calls the inability to bring outside information into a discussion (or thought process) “WYSIATI,” an acronym for What You See Is All There Is. I am not going to apologize for not making WYSIATI a personal policy if anyone objects to some perspective on this subject.

Best,

Jim

That’s great Yuval, thanks. Your SA piece clarifies for me, but a couple of questions (I also commented on your SA piece):

  1. If you took the top 5 or 6 ranked stocks from each strategy (for a total of 20-24), how do the results differ from the combined strategy of 25 stocks?
  2. Are the ranking systems you used “public” by chance? Would be great to see the details. If not, I get it :wink:

Thanks,
Ryan

I can’t find those ranking systems, I’m afraid. Sorry about that.