Over-Optimization

yuvaltaylor · June 24, 2019, 9:55pm

Equally dangerous–or even more dangerous–is not including factors that really make a difference. (I guess that’s “underfitting”?) In statistical language, including extraneous factors is a type 1 error and not including important factors is a type 2 error. Both increase risk. But type 2 errors are a lot harder to catch!

Jrinne · June 24, 2019, 10:26pm

So you remove a factor from a ranking system over the last 5 years (from 2014 in the discussion above) and the performance improves: YOU ARE WORRIED ABOUT TYPE 2 ERROR AND DO NOT WANT TO REMOVE THE FACTOR.

Have you ever calculated the probability of a Type I or Type II error? One would have to calculate and use a p-value to do that: so no.

Does everyone understand now. Optimize “over 100 factors” and “randomly” increase the weights. NEVER CHECK FOR OVERFITTING.

EVEN IF THERE IS SHOWN TO BE OVERFITTING DO NOT REMOVE THE FACTOR (BECAUSE OF THE POSSIBILITY OF A TYPE II ERROR).

Never “cross-validate”, and never calculate a p-value (even though we are talking about Type I and Type II errors which is weird). Also never do anything with a standard deviation (including a Sharpe Ratio).

Just look at the buckets ON IN-SAMPLE DATA.

Uh. Okey-dokey. P123 is now officially a fringe organization.

Yuval, you should take some courses and not rely on forum threads from 2013 (or sunspot data on the web for that matter) as your sources of information.

You should calculate a p-value in class once. Maybe a few other things: AIC maybe?

I suspect Chicago has multiple resources.

Maybe Marco would offer some continuing education.

In the meantime, I think we can do without your final pronouncement on AIC (Georg’s proposal).

-Jim

philjoe · June 25, 2019, 11:48am

It’s almost as if some users have been begging for this for years…

Barn · June 25, 2019, 6:08pm

Personally I see little use for pre 1999 data. Personally I typically don’t use any data pre-2000 anyhow since there are several factors that aren’t useable pre-2000 (from my understanding). Besides the average retail trader couldn’t access the markets online and the market volume overall was substantially less than we have now. If you look at the trading vol of the SP500 for 1999 it’s in the 15B /mth range and 2019 is 80B /mth range. That’s 5X less volume in 1999 vs 2019. IMO that means we have a lot more signal to work with now since I think that more volume increases market signal to noise. I’m not sure how far back people here would want to go but if you go back just an additional 10yrs to 1989 then you’re looking at 20X less volume than current. Not to mention the tools investors used back then were less sophisticated, and we may refer to that as a different regime? To me current data is more valuable than past data so the further we go back the less I care. I also feel that if we can’t build a ranking / trading system with 20yrs of the most current data then what makes me think that I can build a better system with less relevant data? 20yrs of weekly rebalances is >1000 samples. If I can’t find an edge after 1000 tries of the most current samples then I think it’s safe to say that more samples aren’t going to reveal an edge worthy of using tomorrow.

Jrinne · June 25, 2019, 6:24pm

Tony,

I believe you have clearly missed the purpose of this.

Whatever, you think its purpose was it clearly was not a serious proposal. It is not something anyone at P123 is considering.

I would call it a debating technique: a variation of the “straw man” argument. Maybe something to make you feel hopeless about any endeavor to prevent overfitting. That is possible.

A serious effort to address overfitting: not a chance.

-Jim

InmanRoshi · June 25, 2019, 7:05pm

Jim -

The only true out of sample for us P123 users would be if we were to test our systems on pre-1999 data or on Japanese or Indian stocks

It’s almost as if some users have been begging for this for years…

Personally I see little use for pre 1999 data. Personally I typically don’t use any data pre-2000 anyhow since there are several factors that aren’t useable pre-2000 (from my understanding). Besides the average retail trader couldn’t access the markets online and the market volume overall was substantially less than we have now. If you look at the trading vol of the SP500 for 1999 it’s in the 15B /mth range and 2019 is 80B /mth range. That’s 5X less volume in 1999 vs 2019. IMO that means we have a lot more signal to work with now since I think that more volume increases market signal to noise. I’m not sure how far back people here would want to go but if you go back just an additional 10yrs to 1989 then you’re looking at 20X less volume than current. Not to mention the tools investors used back then were less sophisticated, and we may refer to that as a different regime? To me current data is more valuable than past data so the further we go back the less I care. I also feel that if we can’t build a ranking / trading system with 20yrs of the most current data then what makes me think that I can build a better system with less relevant data? 20yrs of weekly rebalances is >1000 samples. If I can’t find an edge after 1000 tries of the most current samples then I think it’s safe to say that more samples aren’t going to reveal an edge worthy of using tomorrow.

Same thing with international data. If I tested my ranking system against Australian data and it blew up, did I prove that my ranking system is over-fit or that it just doesn’t work in Australia? What if it’s a matter of using GAAP accounting factors in an IFRS environment? Okay, now I ignore Australia and try to find a robust GAAP country more comparable to the US, and my ranking system works. But, wait, am I curve fitting because I’m ignoring Austrailia?

But let’s just say I don’t even want to use international data to cross reference against ranking systems built on US data. What if I just wanted to use Australian data to build an awesome Australian focused ranking. Apples to apples. What could I even glean from Australian data looking ahead when most public companies are natural resource miners (or banks that lend to natural resource miners) who just happened to be supplying a global generational economic boom in China the last 20 years? There’s no magic bullets to be had anywhere. Every answer opens up new questions.

*Assuming there is a commercially available dataset out there that had stored 20 years worth accumulated quarterly reporting for every line item for every publicly available Australian company like we’re accustomed to in Compustat US data.

philjoe · June 26, 2019, 7:56am

Jim -

The only true out of sample for us P123 users would be if we were to test our systems on pre-1999 data or on Japanese or Indian stocks

It’s almost as if some users have been begging for this for years…

Personally I see little use for pre 1999 data. Personally I typically don’t use any data pre-2000 anyhow since there are several factors that aren’t useable pre-2000 (from my understanding). Besides the average retail trader couldn’t access the markets online and the market volume overall was substantially less than we have now. If you look at the trading vol of the SP500 for 1999 it’s in the 15B /mth range and 2019 is 80B /mth range. That’s 5X less volume in 1999 vs 2019. IMO that means we have a lot more signal to work with now since I think that more volume increases market signal to noise. I’m not sure how far back people here would want to go but if you go back just an additional 10yrs to 1989 then you’re looking at 20X less volume than current. Not to mention the tools investors used back then were less sophisticated, and we may refer to that as a different regime? To me current data is more valuable than past data so the further we go back the less I care. I also feel that if we can’t build a ranking / trading system with 20yrs of the most current data then what makes me think that I can build a better system with less relevant data? 20yrs of weekly rebalances is >1000 samples. If I can’t find an edge after 1000 tries of the most current samples then I think it’s safe to say that more samples aren’t going to reveal an edge worthy of using tomorrow.

Same thing with international data. If I tested my ranking system against Australian data and it blew up, did I prove that my ranking system is over-fit or that it just doesn’t work in Australia? What if it’s a matter of using GAAP accounting factors in an IFRS environment? Okay, now I ignore Australia and try to find a robust GAAP country more comparable to the US, and my ranking system works. But, wait, am I curve fitting because I’m ignoring Austrailia?

But let’s just say I don’t even want to use international data to cross reference against ranking systems built on US data. What if I just wanted to use Australian data to build an awesome Australian focused ranking. Apples to apples. What could I even glean from Australian data looking ahead when most public companies are natural resource miners (or banks that lend to natural resource miners) who just happened to be supplying a global generational economic boom in China the last 20 years? There’s no magic bullets to be had anywhere. Every answer opens up new questions.

*Assuming there is a commercially available dataset out there that had stored 20 years worth accumulated quarterly reporting for every line item for every publicly available Australian company like we’re accustomed to in Compustat US data.

And what if it does work in Australia? I’m interested in international data btw, not as much pre-1999.

piard2 · June 26, 2019, 1:57pm

A few disorganized thoughts.
1/ overfitting is not a crime, as long as you make sure it doesnt increase the risk, and more important as long as you don’t believe a simulated performance is an expected performance.
2/ to make sure it doesn’t increase the risk, the best way is to move variables one by one and look for discontinuities. Big output changes from small input changes are red flags.
3/ I would never waste time in optimizing weights (consequence of my second point).
4/ As soon as P123 gave us Canadian data, I have tested all my stock models OOS in the Canadian market.
5/ I am skeptic about scientific references and super-powerful statistical tests that I don’t understand (and don’t try because life is short). Modeling investment strategies is about modeling the expected reaction of a group (“the market”) to input information. As a subject of research, it should be classified as experimental psychology. We are far from hard science. The caveman’s common sense is more useful than statistical tests. I sometimes read things from very smart people who have lost the caveman’s common sense.

A caveman (with PhD)

Jrinne · June 26, 2019, 2:27pm

Here is a reference regarding regression-to-the mean: Thinking Fast and Slow by Daniel Kahneman. Although it places his opinion in question at P123: he has won a Nobel Prize in Economics.

One of the chapters (chapter 17) provides a layman-level understanding of regression-to-the-mean.

This SHOULD be must-reading for anyone who pays for a Designer Model. Needless to say, it is my opinion that any professional at P123 should read (and understand) this. And frankly, have a deeper understanding than this book provides.

You may wonder why a Designer model (usually) does not do as well once you purchase it. You could read some complex articles about this.

But if you read and understand the chapter in the book you will stop researching this question.

When asked why it occurs you will simply say: “Regression-to-the-mean. How could it be any other way?”

This was a New York Times Best Seller. Millions have read and understood this. NOT (EVEN) STATISTICS 101.

No PhD required.

-Jim

Barn · June 26, 2019, 3:42pm

A few disorganized thoughts.
1/ overfitting is not a crime, as long as you make sure it doesnt increase the risk, and more important as long as you don’t believe a simulated performance is an expected performance.
2/ to make sure it doesn’t increase the risk, the best way is to move variables one by one and look for discontinuities. Big output changes from small input changes are red flags.
3/ I would never waste time in optimizing weights (consequence of my second point).
4/ As soon as P123 gave us Canadian data, I have tested all my stock models OOS in the Canadian market.
5/ I am skeptic about scientific references and super-powerful statistical tests that I don’t understand (and don’t try because life is short). Modeling investment strategies is about modeling the expected reaction of a group (“the market”) to input information. As a subject of research, it should be classified as experimental psychology. We are far from hard science. The caveman’s common sense is more useful than statistical tests. I sometimes read things from very smart people who have lost the caveman’s common sense.

A caveman (with PhD)

I agree with everything except #3. Weight optimization should be viewed as trend optimization. If you adjust a weight and plot a result you should be able to see an inverse U response, else one of the factors/formulas are not needed. I think of it more in terms of optimizing relationship trends rather than peaking for highest AR. That’s another way to help reduce over optimization IMO.

Why do you feel that keeping everything equal weight is the best approach?

piard2 · June 26, 2019, 5:47pm

Barn, I don’t waste time on optimization, but my approach is not equal weight. As a fake example, I may use 30/30/40 for value/quality/technicals, and 30/30/20/20 for value terminal nodes. I may try a few variants for each node and choose one based on bucket charts, but I would not spend hours on a systematic optimization process that others have described.

By the way I may add a 6th point: I prefer screens to sims because screens are ergodic (maybe not the appropriate word, in my caveman’s common sense its means buy and sell rules have possible butterfly effects). I use asymmetrical buy and sell rules as “hopefully harmless curve-fitting” starting from a screen that works without them.

Edit: (source Wkipedia)
In probability theory, an ergodic dynamical system is one that, broadly speaking, has the same behavior averaged over time as averaged over the space (…) The state of an ergodic process after a long time is nearly independent of its initial state.

Jrinne · June 26, 2019, 5:59pm

So I have no point. I will ask an open question and won’t follow-up.

Ergodic (along with mixing) is pretty important as to whether one thinks taking a sample is a valid thing.

Anyway, just any general comment on the validity of what we do here would be appreciated. It may (or may not) alleviate some of Marc’s concerns too.

Thanks

-Jim

yuvaltaylor · June 26, 2019, 7:15pm

Here is a reference regarding regression-to-the mean: Thinking Fast and Slow by Daniel Kahneman. Although it places his opinion in question at P123: he has won a Nobel Prize in Economics.

One of the chapters (chapter 17) provides a layman-level understanding of regression-to-the-mean.

This SHOULD be must-reading for anyone who pays for a Designer Model. Needless to say, it is my opinion that any professional at P123 should read (and understand) this. And frankly, have a deeper understanding than this book provides.

You may wonder why a Designer model (usually) does not do as well once you purchase it. You could read some complex articles about this.

But if you read and understand the chapter in the book you will stop researching this question.

When asked why it occurs you will simply say: “Regression-to-the-mean. How could it be any other way?”

This was a New York Times Best Seller. Millions have read and understood this. NOT (EVEN) STATISTICS 101.

No PhD required.

-Jim

Personal opinion: I read this last week and it was a real eye-opener. I second Jim’s recommendation.

Barn · June 27, 2019, 3:36pm

Why do you feel that keeping everything equal weight is the best approach?

Barn, I don’t waste time on optimization, but my approach is not equal weight. As a fake example, I may use 30/30/40 for value/quality/technicals, and 30/30/20/20 for value terminal nodes. I may try a few variants for each node and choose one based on bucket charts, but I would not spend hours on a systematic optimization process that others have described.

By the way I may add a 6th point: I prefer screens to sims because screens are ergodic (maybe not the appropriate word, in my caveman’s common sense its means buy and sell rules have possible butterfly effects). I use asymmetrical buy and sell rules as “hopefully harmless curve-fitting” starting from a screen that works without them.

Edit: (source Wkipedia)
In probability theory, an ergodic dynamical system is one that, broadly speaking, has the same behavior averaged over time as averaged over the space (…) The state of an ergodic process after a long time is nearly independent of its initial state.

I see what you mean now. I tackle the weighting a similar way by grouping themes and weighting them together. I too don’t go with a resolution finer than 10%. The market doesn’t work with that level of precision and IMO either should I.

Ergodic … I learn something new everyday and yes I don’t see how one would create a system that isn’t ergodic. If it wasn’t then wouldn’t it be random?

Jrinne · June 27, 2019, 4:28pm

In his book “Skin in the Game” Nassim Taleb has a chapter on ergodicity. I think he oversimplifies the concept a bit. But he gives the clearest example of a sim or port that is not ergodic: one that goes bankrupt.

Edward O. Thorp edited a book called “The Kelly Capital Growth Investment Criterion” which has a strong foundation in math theorems.

Every few pages you can find something like: “Assume it is an ergodic process that is adequately mixing”

Bottom line is that this (or a stronger assumption) is necessary to prove or use the central limit theorem

I am not recommending that anyone actually use this. But I think that is how it gets into the mathematical side of finance.

-Jim

piard2 · June 27, 2019, 4:32pm

An ergodic process is also random, but all possible paths are convergent (we don’t know to what). Possible paths of a sim may not converge

SpacemanJones · June 27, 2019, 4:37pm

LOL. As Ron White would say “I was wrong.” MLHR Herman Miller up 17% today on earnings

piard2 · June 27, 2019, 4:52pm

Be careful with Kelly’s criterion in investing. If you take daily returns as a “Kelly game”, a 1xETF and a 3xETF on the same underlying have the same K-criterion. The formula is an invariant with leveraging. Would you allocate the same % of your capital?

Jrinne · June 27, 2019, 5:43pm

I could not agree with you more. At this point I just like the math of it.

I read my brother’s copy of “Beat the Dealer” before I finished the Dr. Seuss books my mom gave me. Okay I am exaggerating but not by much.

When I was in college we would make road trips to Harvey’s casino in Lake Tahoe/Reno to count cards. I have to admit the Kelly Criterion did not work that well for me even then: with the simple probabilities of Blackjack. I probably earned about $3 doing that (if you don’t count the gas). It could not have been my card counting skills;-)

Thanks for the good advice.

-Jim

primus · June 27, 2019, 6:26pm

This is not true. Thorp sets this up as a quadratic optimization problem in which leverage decreases the expected rate of return.

The assumptions on GBM are of course fantasy math (see Question about quadratic form of f* in the Continuous Kelly Criterion on SE Quant for a deeper inquiry of this). However, the result of this fantasy is consistent with all of our intuitions about risk and ruin. Leverage increases the expected rate of capital growth until it doesn’t.

Something that may be interesting is that if you include non-constant rates of return (i.e., by integrating the instantaneous rate with decay) and transaction costs in the optimal rate function (this all Kelly criterion really is), then you don’t actually need GBM with fantasy re-weighting assumptions to set up a quadratic optimization.

The takeaway is that Kelly really does provide a good basis for capital allocation, but this assumes our beliefs about rates, variances, and covariances are somewhat correct.