Over-Optimization

Hi Georg,

Thank you for this contribution (great as usual).

Have you tried adding a layer of seasonality on top of your current system? As you pointed out in other posts e.g. a defensive ETF like XLP in the Summer.

Also: the result is probably a bit less good if one does not want to take any risk with the future long-term evolution of interest rates (historically falling over the backtest period but who knows in the future). Thus would be interesting to see what happens when replacing UST by, say, SHY (or IEI?)

Jerome

It was more an exercise to see if I could add factors that made sense that resulted in the desired result of having LULU in the 99th percentile bucket. i.e. “What would a company’s fundamentals look like if its brand was so strong it could convince people to pay $150 for sweatpants?”. But as you go through the trial and error of adding and removing factors to achieve that result, the 2nd question comes up a lot.

I haven’t looked at KELYA or any similar companies lately so I have no opinion as to whether it is a value or a value trap.

But I would suggest that if it is, indeed, “a superb value no matter how you look at it . . . massively underpriced by every measure” adopt a guilty until proven innocent framework. IOW, start with the assumption it is a value trap and don’t change until you find good reasons to to do, and if you can’t kind anything one way or the other, move on to other things.

It’s not as if you’ve discovered that the stock is priced low relative to a bunch of metrics. Everybody else sees it too. So why aren’t they jumping on the stock? Usually, if its a bona fide value play, it because they are underestimating something about the company. If you know why you disagree with Mr. Market, that’ fine. But if disagree without knowing Mr. Market’s side of the story, that can get dangerous.

A lot of us don’t have the inclination to dig into details stock by stock all the time. But if we don’t do it the old fashioned way, we should t least do it the new fashioned way (algorithmically). We can do that with multi-style ranking systems, but that’s an iffy approach. Stocks can bubble up if they lack in key criteria if they are extreme in some other. Hence the benefit of screening/buy rules; you can set certain minimum thresholds for growth/quality that if breached, will disqualify a stock no matter how great a value it seems to be

Of course, you can tell whether you have overfit you ranking system if you have been using it for a while. You can tell if you have added a feature (or features) that has caused overfitting.

For example, there are some systems that I have been using since 2014.

I simply do a rank performance test from 2014 until now and look at the performance. I then remove a factor (just a single factor one each time) and normalize the other factors.

If your performance improves over the out-of-sample period when you remove one of the factors then perhaps you have overfit you model.

This is cross-validation. More specifically, it is cross-validation for feature selection.

So, some might debate what to do going forward. Removing the factor and re-optimizing the system over the max period (2000 or 2005 to now) would be one possible way.

But whatever you do, you may not need the “experts” at P123 to tell you whether you have overfit your system. And you can do this with a hold-out period before you start to run the system live.

If you do this with a hold-out period you would optimize from 2000 to 2013, say. Once optimized you would run the system from 2014 to now and remove factors one-at-a-time and normalize the others—as outlined above.

If the system’s performance improves after removing some factors you would consider (it is up to you) not using those factors in your final system. You would then re-optimize the system over the max period, form 2000 (or 2005) till now—without the factors that might cause overfitting.

Or not. But if not, you must not believe that overfitting is a potential problem. You must think it is just a fun topic for the forum, I guess.

Make no mistake, this is from Statistics 101 or a first course in Econometrics. It is a simple topic that belongs in the forum.

And honestly, someone (anyone) at P123 should be more than familiar with this topic. Upper division undergraduate stuff—along with a few things that use a standard deviation–would not be a bad thing.

In any case, there should be no resistance to an idea that is USED EVERYWHERE.

To be fair this has been on the forum before. Denny Halwes advocated this but he used even/odd universes. This is not appropriate because of a non-PIT type of problem. Support for this being a problem can be found in “Advances in Financial Machine Learning” by Marcos López de Prado. Even/odd works with some types of data but not stock data.

Denny was ahead of his time. But I wonder if his Designer Models didn’t have some overfitting problems. Perhaps, they had problems from the use of even/odd universes.

Whether you use multiple regression, kernel regression, Econometrics or the P123 method–with stock data—a pro or academic will use something like what is outlined above. Walk-forward validation being a more advance, but similar, method. The main advantage of Walk-Forward and other methods being they avoid the non-PIT problem

-Jim

Hey Yuval, is there a way I can PM you? Or if you have my email feel free to email me a way to contact. thanks,

Michael - My email is yuval@portfolio123.com.

Jim -

I’m extremely grateful to you for your discussion of walk-forward optimization.

From what I understand about walk-forward optimization (I’ve found a few articles on it on the Internet, one focused on sunspot prediction), this would be the process for doing so if we had data going back to 1999 and if the minimum optimization period were eight years.

Optimize a strategy based solely on the data available for the first eight years (1999-2006). Test it on the 2007-2008 period. If it “works,” optimize it on the 2001-2008 period. Test it on the 2009-2010 period. If it “works,” optimize it on the 2003-2010 period. Test it on the 2011-2012 period. If it “works,” optimize it on the 2005-2012 period. Test it on the 2013-2014 period. If it “works,” optimize it on the 2007-2014 period. Test it on the 2015-2016 period. If it “works,” optimize it on the 2009-2016 period. Test it on the 2017-2018 period. If it “works,” optimize it on the 2011-2018 period. Use it.

Here’s what I don’t understand. What do you do if the strategy is a total failure in one of the two-year out-of-sample periods? What do you do then?

Yuval,

I mostly recommend the simple test for over fitting I outline above. I believe this works.

I might add that it very often says a factor belongs in a system—is confirmation that all of the factors belong in the ranking system.

Walk-forward has been too time intensive for me to do on P123. It may work for you. You are truly more organized with your spreadsheet than I am. That is just a fact and I am sincere about that.

I do think it would only take a little more ability than I have using Python to do this. In fact it is built into SKLEARN I believe. So I think you and/or P123 could use this. I have severe data availability problems: we all do. I understand you have your own limitations and I do don’t need the details—they may not be different.

With the proviso that I have never completed a walk-forward validation let me give my understanding. Also there are better and more complete sources than what I could do here on my best day.

I think I would do best with a P123-like example. Walk-forward would be like if I optimized up to January 1 2014 then used the system to select 25 stocks the next week and recorded their performance.

Next you re-optimize with all of the data INCLUDING THE FIRST WEEK IN JANUARY 2014. You use all of the data to make predictions for the second week in January and record the results.

Repeat. You can see that weekly this would be hard for me. But it is not beyond what Marco is capable of.

But in the end you just see how the ENTIRE SYSTEM WORKS. How did the stock picking go OVERALL?

Having some. bad periods is to be expected and would not be a reason to throw out a system.

I might add that I have been overly concerned about metrics recently. If the annualized returns are good look closer in your usual manner. Just look at THE ENTIRE COMBINED RESULT.

EDIT: It is as if you manually put the top 25 predictions into a port (selling the ones that were no longer in the top 25). SEE HOW THE ENTIRE PORT DOES.

I think walk-forward adds to my simpler system for quants that constant optimize. The simple system I described is just as good for those that never re-optimize. If someone re-optimized once a year then walk-forward could probably add to my simple method and be doable.

I am out of town using a iPhone. I would be surprise if I did not miss something in your post or have more than a few auto completion error or just poor writing. Met me apologize ahead of time.

I will assume you are looking to use this for your personal use. I do think that in the FAR FUTURE (not pertinent to this discussion) P123 could help some beginners with some by automating some of this. Not for discussion here.

I also think some of what you do looking at persistence duplicates some of this already. I have long had a theory that this is one of many reasons your systems do well. But I could be wrong (obviously).

I do think that if you could get a general understanding of cross-validation in general and walk- forward there would be something there for all of us.

Use this or any modification and IS OUT-OF-SAMPLE. The biggest benefit.

Again a bit disjointed (more than usual) on an IPhone but I hope it helps little.

-Jim

Yuval,

I have done a terrible job of reading your post and answering your specific questions.

When you are done you optimize over EVERYTHING STARTING 1999. The port with money runs on data starting 1999. Or as far back as you think is relevant.

The purpose is to test out-of-sample performance. The test/validation sample is forward in time and out-of-sample. It gives a better idea of how the port will do if you use a holdout TEST sample.

It is also used for feature selection and to avoid overfitting when cross-validating.

I could have read your post better. I blame my small iPhone screen.

Happy to do email if you want-especially when I get to my computer.

If it is for your use there is no need to sell you on this. I am confident it could be useful (to anyone), however.

-Jim

Jim -

I’ve been thinking about this pretty hard. While the walk-forward practice makes perfect sense for machine learning, it’s more problematic for humans like me. The reason is that I can’t possibly go back to 1999 and figure out what factors I would have used back then. If I optimize for the 1999-2007 period using factors that I’ve developed over the last four years, I’m basically cheating. Of course it’s going to work out of sample, because the factors I’m using were developed using that “out of sample” period. I have no idea whether I’d have used share turnover in a model designed in 1999!

My conclusion is that in the end, as long as you’re optimizing, you can’t totally avoid over-optimizing. There are some good and bad practices, but you’re never going to be able to get as clean a result as you would if you were just feeding data into a machine and having it draw its own conclusions, blind to what comes after (which is perhaps the ultimate advantage of machine learning!). The only true out of sample for us P123 users would be if we were to test our systems on pre-1999 data or on Japanese or Indian stocks–or if we were to just sit on our systems without revising them and then see if they work over the next few years. (It is interesting to do this. Look at some systems that you designed two or three or more years ago and haven’t revised since, and test them over the subsequent period. How did they do?)

Uh, yeah.

If you already have out-of-sample data why would you need this anyway?

Kind of a “straw man” argument were you set up a fallacy and easily knock it down.

I have factors that have proven themselves out-of-sample over the last 4 years and I just use them. So yeah.

But there are real things that can be done with this.

Any new system I am considering gets optimized until 2013.

It has to prove itself to be better out-of-sample (2014 till now) than what am already using. If I optimize the new system over the max range there is no way to compare the new system to what I am now using. This is the holdout test sample use.

Furthermore each factor that I am using in my live system can be looked at now to see if it is really contributing or whether removing it over the last 4 years would have been better.

Including factors that are not contributing is the definition of overfitting. This can very easily be addressed. This is the cross-validation for feature selection use.

If the topic of this thread is overfitting and there is a method to see if we are actually overfitting then… Of course, we need to switch subjects!!!

I am happy to discuss what I am actually advocating at any time.

There are also books and courses you could consider if it is really a topic of interest. My posts will not replace a University course.

Whoever, might win a debate among members in the forum, P123 should start to bring in some new and well-developed ideas from the Universities.

Maybe Marco is working with people with more degrees in this area than I have, behind the scenes. I truly hope so.

With that hope in mind let me say I appreciate the discussion.

-Jim

There is another way that I have used to check for over optimization. After you have formed your trading system if you look at the realized transactions and you sort based on pct return then you see your biggest winners. Put your top contributor on a restriction list and run the sim again. Do at least 10 iterations of this and compare how your performance looks (iterations are Universe size dependent). This is different than rolling tests (I like those too). You can also break up your sim into time periods and perform the above practice. If you want to try and be fair then you can add both your top and bottom contributor to the list. Granted this doesn’t eliminate the bias we have with optimization but this does ensure that you aren’t relying on a few good picks to make the returns. Also the reason you run iterations is because just eliminating the top 10 at one time doesn’t have the same effect.

I have found several “fools gold” sims with this method alone.

Right. In fact it INCREASES bias but reduces variance. Overall a good thing.

Cross-validation also increases bias but reduces variance.

An overfit system has low bias. But it does not work well out-of-sample. It is fit TOO WELL to the sample you are fitting to (or optimizing). So if you can stop it from fitting to outliers this will usually help too.

Anyway just a bunch of jargon confirming what you already know. I like it!!!

-Jim

I think the best way to avoid over-optimization is to not over-optimize. I am guilty of over-optimizing, but as I look back, I feel like optimization tools helped me to explore general patterns and develop intuitions on the drivers of returns. I don’t believe as though the countless hours I spent tweaking weights and parameters did any good, except reinforce that basics of good data science: a smooth optimum is better than a global optimum. I.e., it’s better to be approximately right that exactly wrong.

All this has been discussed at length in this view-thread.
https://www.portfolio123.com/mvnforum/viewthread_thread,6939

There is a scientific test called the Aikake Information Criterion (AIC).
http://en.wikipedia.org/wiki/Akaike_information_criterion
https://www.portfolio123.com/mvnforum/viewthread_thread,6939_offset,30

Given a set of candidate models for the data, the preferred model is the one with the minimum AIC value. Hence AIC not only rewards goodness of fit, but also includes a penalty that is an increasing function of the number of parameters used. This penalty discourages overfitting.

AIC is good. ROUTINELY taught in Statistics 101 for feature selection (as is cross-validation). I have used it and it works.

And Georg provided the formula for AIC that could be extended to what we do—a long time ago (2013).

There is already real knowledge out there—no need to reinvent the wheel. I MEAN WE SHOULD NOT HAVE TO AS GEORG SUGGESTS.

It might require Marc to brush up on his Econometrics or for others to take a course or two but we should not have to constantly reinvent the wheel.

Georg understands statistics. THANK YOU GEORG (AGAIN)!!!

-Jim

I like this :wink:

Right.

But when you use the following approach you should be open to ideas for addressing overfitting. Maybe more than one method.

Could there be one factor (just one) that might represent over-optimization when you use this technique?

Just one?

Personally, I have no problem with the method.

But there are more than a few (outside of P123) who would then go on to use a little cross-validation or an AIC to find that one factor that represents overfitting.

-Jim

Thanks for the link to the previous thread, which is a great thread with some fantastically informative posts. I wasn’t familiar with that thread when I started this one, or else I probably wouldn’t have started it.

On page 5 of that thread, o806 pointed out some real problems with using the AIC, and on p 7 is the best comment of all, by DennyHalwes, in which he posits that “the ranking system by its very nature is not appropriate for statistical measures.” I don’t think Denny meant that we shouldn’t be using statistics, but that we shouldn’t be using statistics to determine whether or not a ranking system is overfit.

Of course, Denny was a HUGE proponent of cross-validation FOR RANKING SYSTEMS.

We have gotten some depth of knowledge from Georg. Much appreciated.

-Jim