All that glitters is not gold: Comparing backtest and out-ofsample performance on a large cohort of trading algorithms

Jrinne · February 17, 2020, 6:23pm

Absolutely true.

And I do not know when or if it will happen but those value models will pop after any significant downturn, I think.

-Jim

geov · February 17, 2020, 6:30pm

Jeff, which is the right factor now. I don’t think anybody knows.
But avoid all Vanguard factor ETFs please.

Why Vanguard Should Retire The U.S. Momentum Factor ETF
https://imarketsignals.com/2020/vanguard-retire-u-s-momentum-factor-etf-vfmo/

ustonapc · February 17, 2020, 7:01pm

Jeff,

Georg definitely has made a point here. There will always be some factors that works in any given timeframe.

If you want to know the recent performance of factor investing, just look at the returns at AQR during the last 2 years . Its AUM has fallen by more than 35% during this period. Even Cliff Asness himself says it is a crappy time for factor investing accoriding to this article from Bloomberg.

https://www.bloomberg.com/news/articles/2019-05-15/aqr-s-asness-is-right-it-s-a-crappy-time-for-factor-investing

Regards
James

Schm1347 · February 17, 2020, 7:04pm

I stay away from factor ETFs. I do think momentum can work if paired with cheap valuation, longer term underperformance, and low volatility. Basically you try to ride the mean reversion back out of the trough. I question how well pure momentum works. I have never been able to produce reasonable returns in backtesting with most momentum strategies.

Jeff

geov · February 17, 2020, 9:55pm

One can do a quick backtest on P123 to check which of the ETF providers does best, because if one holds all four factors (Momentum, Value, Quality, and min Volatility) equal weight the return should be the same as for the benchmark VTI.

iShares does best, the 4 factor funds match 100% the performance of VTI.
Fidelity is marginally lower.
Vanguard is a disaster.

yuvaltaylor · February 18, 2020, 2:19am

Aren’t your expectations a bit high, sir? The number of people who have made an excess return of more than 10% in each of the last two years has got to be pretty tiny, especially when 2019’s benchmarks all made over 30%.

Personally, I’m totally fine with the fact that I massively underperformed the S&P 500 in 2019. The S&P 500 was nutso. I beat the market handily in 2016, 2017, and 2018, and I have every expectation of doing so again in 2020, unless it’s another year like 2019, in which case I’ll still make plenty of money. And because of the magic of compounding, I made more money in 2019 than I did in 2018 or 2016, even though I would have made twice as much investing in SPY.

As for designer models, some of them did just fine in 2019, and some of them didn’t. Just like mutual funds and hedge funds and ETFs.

Just keep things straight. How many people have outperformed the market by 10% year after year? Warren Buffett, Shelby Davis, George Soros, David Einhorn, Peter Lynch, Charlie Munger, Stanley Druckenmiller, Jim Rogers, and Joel Greenblatt all had AVERAGE outperformance of greater than 10%. But year after year, it gets harder. In some years, I would guess that almost none of them beat the market. Look at Buffett’s outperformance. Over the 55 years between 1957 and 2012, it varied between 45% and -20%; in 24 out of those 55 years, it was less than 10%. Excess returns are great, but consistent returns are better, and when markets get crazy, that means underperformance.

Jrinne · February 18, 2020, 4:52am

Yuval,

I think I must agree with you or why would I be here? I do not even have to think about it.

But then there is the question of risk. That can be controlled to some extent. If my ports ended up being more volatile than I expected then I can decrease my exposure. Still the risk is greater with ports compared to passive investing, I think.

What is the expected reward? You are right 2 years is probably enough. 5 years is probably better. 10 better still unless you think the market is changing, of course.

In medicine they hit you early, hit you hard and never stop with the question of risk/reward. For every drug, procedure and test.

I say here, at P123, it is the predicted reward versus risk, cost and time spent (opportunity cost really) going forward.

I am still running those spreadsheet calculations.

Yep. It is an ongoing calculation that can change.

-Jim

ustonapc · February 18, 2020, 6:44am

I think Georg’s analysis on the Designer Models (pls see attached) shows their relative -ve performance as compared to SPY,

Regards
James

P123_R2G_List+12-9-2019.xlsx (56.4 KB)

Schm1347 · February 19, 2020, 2:09am

Georg’s analysis raises a lot of concerns. Thanks Georg.

Is the problem overfit models with impossible to sustain gains + impossibly low drawdowns. How many DMs do we suppose rely heavily on ranking vs screening factors? I personally think over-reliance on perfected ranking factors could be problematic, but maybe I am wrong. Seems like heavy ranking factor reliance can easily shift a model out of work ability.

It would be useful if someone could do deeper analysis in those models. Of course that individual would need access to those models. Do better performing models have more or fewer factors? Are they based on absolute or relative rankings? Is there a factor slant? How inclusive or exclusive are universes? Are rules price based or valuation based? How market timing dependent are they?

I think we really need to do some post mortem analysis to understand if anything drives strength and weaknesses in models. Perhaps there isn’t anything and success is random, which would be concerning to say the least.

I don’t think this can be brushed aside.

Jeff

geov · February 19, 2020, 2:44am

Jeff, none of the designers went out to design a model which would under-perform the broader market. These people are not idiots, but they believed that the available P123 tools used in the models would assure good returns going forward based on historic performance.

So the question is why did most of them fail to beat the benchmark SPY?

When one designs a strategy you basically lock in a set of rules which you assume will work permanently going forward and produce good returns. I think that this is the problem, the same set of rules cannot indefinitely produce good returns, that’s pretty obvious. Just like, for example, there are some periods when value will perform better than momentum, and vice versa.

That is why my preference are ETF models with market timing based on economic indicators. As economic conditions change the model adapts to those changes, it is not locked into some strait jacket of rules that don’t work.

The exception to this are small-cap models. I think that is where one can do better, but only with relatively small investments.

Schm1347 · February 19, 2020, 3:09am

Georg,

I hope I didn’t convey that I thought these models were created by idiots. If I did I apologize. What concerns me is that they were probably well thought out and created by intelligent people.

On an interesting side note it seems that using a 50 EMA vs 200 day SMA on the S&P500 gives remarkably good timing signals if you switch between that and bonds. Not sure if I’d trust it though as it seems too good to be true. Something that might be worth considering rather than straight 50 day SMA vs 200 days SMA.

Jeff

Jrinne · February 19, 2020, 9:52am

Jeff, didn’t you already have the idea of using regularization (randomizaton) with the models? This is a useful idea. You did not research this idea for expanded use by P123, however. The present uses available on P123 are not ideal and could be improved upon. Of course, this not your fault. The idea seemed to die the usual death of good ideas at P123. Seemed to be killed in the forum.

We thought P123 had no plans to make regularization available on the platform. We were wrong about this, however (see below).

More generally, people already know the limitations on the models we use and ways to address these limitations. Much of this has been know for decades (again the value of randomization that you have recommended is a prime example).

People have done a deep analysis of this and that analysis is ongoing. You can get a Ph.D. in this stuff. People are working on a Ph.D. thesis on some aspect of this subject as we speak.

Yuval has used a form of cross validation for a long time. The theory for this and any refinements in the method can be studied at any university and people can go to Coursera or buy a textbook.

Yuval now uses bootstrapping (or a method inspired by bootstrapping) now for the expressed purpose of reducing overfitting.

Bootstrapping serves as a regularizer (among other benefits). The way Yuval uses bootstrapping it also serves as an Ensmble Method. And regularization is an idea you have already recognized as useful, Jeff

At times in the forum–and certainly for his own models–it is evident that Yuval understands and believes in some of these methods. For sure they take a lot of effort and he must believe there are things that can improve the present standard P123 methods or he would not go through the effort:

Dear redshield:

I’m grateful for this. I’ve been using an iterative approach to optimization on partial universes and time periods and then averaging the results for my final model. The iterative approach to optimization is extremely time-consuming. It involves taking a weighted ranking system with forty weights (2.5% each) spread over 100 factors (obviously the large majority would get 0 weight), randomly increasing the weight of one factor (from 0 to 2.5 or from 2.5 to 5 etc) and decreasing the weight of another, then seeing if performance is improved; if so, using the new weights going forward and if not reverting to the old weights. I have found this to be a very effective procedure in terms of getting good results (I made 45% in 2016, 58% in 2017, and I’m up 32% YTD), but it’s a huge job. I don’t honestly know if I want to delve into multivariable or discriminant analysis or whether I should simply stick with my method; I can imagine that coordinating P123 backtests with multivariable analysis (using R, XLSTAT, RUBY, or some other program) would be just as big a job, if not bigger, than the way I’m doing it now, and I’m not sure of the advantages it would offer. I’d be grateful for your feedback on this question.

As for equal weights being suboptimal, I’m with you there. Imagine a ranking system with three factors: 1-year accruals, 3-year accruals, and price-to-sales. Only an idiot would weight them equally. That’s an extreme example, of course, but there are so many factor interrelationships that equal weighting doesn’t take into consideration. And, of course, equal weighting takes no account of how many factors of each type (value, growth, quality, size, sentiment, technical) you have.

Thanks,

Yuval

But I think there are more things that can be done and it would be nice if some of this were automated. Fortunately, soon enough, we can discover, on our own, what is useful without having to get Yuval and Marc’s endorsement of an idea.

Marco has a power-user that recognizes the benefit of some of this. This power user wants additional methods and he would like to have these methods automated where possible.

These methods may come at a higher price according to Marco but also some of those methods are likely to filter down to the what is available to P123 at the regular price, I would think.

Thank you Korr123, Georg, Jeff and James for moving the discussion forward.

Marco has the good sense to not reinvent to wheel and to go with well developed programing that is already available in Python and its libraries. He will be implementing methods that have been established and are already well known.

Thank you Marco for moving the platform forward. It is hard work and will take additional resources to do that. I have no problem if Marco needs to charge a higher fee for that.

But there really is no longer a debate about this. Just a question of whether people want to learn about this—as Yuval has obviously already done for his own portfolios. And whether people will want to use what will become available on the P123 platform.

And my apologies for thinking any of my ideas–here in the forum–would be the impetus for progress here at P123. Turns out Marco already knew how to do it and just needed the funding.

In any case there is no longer any reason for debate on this. People will be free to use (or not use) ideas that they find valuable as they see fit–like regularization, bootstrapping, ensemble methods, cross-validation etc. Free to use those ideas without an endorsement for Marc, Yuval or the members of the forum.

If people can beat the 5-year record of Designer Models without these methods and ideas: Great. I do not mean to suggest that I have even scratched the surface as far as the different kinds of things that could be done.

-Jim

geov · February 19, 2020, 1:12pm

Jeff, nobody thought that, and it was not my intention to imply it.

Perhaps better DMs can be constructed using the “piggy-back” method which we discussed before.

The 50 day SMA vs 200 days SMA is not a good system. Best long-term signal is a recession indicator, as per this article:
Market Timing with the S&P 500 Golden-Cross and a Recession Indicator
https://www.advisorperspectives.com/articles/2020/02/10/market-timing-with-the-s-p-500-golden-cross-and-a-recession-indicator?bt_ee=WcSnIP2aQCaZ97oeveT%2BSObziicKrgXVjhIahNsIfMs%3D&bt_ts=1581419214491

ustonapc · February 21, 2020, 5:15pm

The 30 yr treasury yield just hit a all time low and the Nasdaq dropped more than 1.5%.

This is exactly the time to see the out of sample performance of a model when all the backtesting could not have predicted this.

Regards
James

ustonapc · February 21, 2020, 6:08pm

Since my last post, the Nasdaq has dropped more than 2% now and my profit for the whole Feb has been wiped out.

Let’s see if it is now time to risk-off our portfolios and switched to GLD or TLT.

Regards
James

Schm1347 · February 21, 2020, 6:56pm

I just keep holding gold and treasury calls and roll them over. I looked at market timing you sent me but it didn’t seem to outperform simple buy and hold. That being said with the calls my portfolio is flat today. My portfolio is also fairly defensive already and has actually shifted more underweight on tech the last month or so.

ustonapc · February 21, 2020, 7:22pm

Jeff,

The market timing tools I sent you reduced max drawdowns by over 50% and you end up getting an improved risk adjusted return and a better Sharpe ratio as compared to buy and hold. Just like you mentioned in another thread, using the golden cross (50/200 MA crossover) or any other recession indicators would have helped you avoid all the recession cycles in the past few decades

If you have keep rolling your GLD and TLT calls, you must have seriously underperform the market in the past 10 years bull market cycle.

Regards
James

Jrinne · February 21, 2020, 7:23pm

Hi Jeff,

You posted about utilities in the past. I do not know if you ended up increasing your position or not but I certainly found your post interesting.

My XLU and XLP (consumer staples) ETFs are up today as are my overall holdings—partially inspired by your posts.

-Jim

Schm1347 · February 21, 2020, 7:29pm

Jim,

I did start following a small cap utility strategy. It’s actually down a little bit. I do have utility positions in some other strategies which are up, though. But I am glad my portfolio has more utility positions on a day like today. I am heavy healthcare around 30% portfolio. A lot of my quality and value strategies have been identifying a lot of healthcare companies as good candidates the last couple months.

Jeff

Jrinne · February 21, 2020, 7:32pm

I have about 17% healthcare (XLV). Down 0.1% today is why I did not mention it.

-Jim