Evaluating models

Got some right here in the office!

Steve, I enjoy your posts and more importantly learn from them. I look forward to our next discussion. As I said, I have resigned from the statistics police and people can look at whatever they wish.

Good luck in your new roll as “Inspector.”

-Jim

Tested a number of my models comparing model tail risk ratio to bench tail risk ratio over 13W, 52W and 3Y periods. Investing in the model when model tail risk ratio was better (higher) than bench tail risk in at least one of those periods. One definition of tail risk is 95th percentile return / ABS(5th percentile return). I actually used Max/ABS(Min) returns for 13 weeks and 97/ABS(3) tail risk ratio for 52W.

The Good

  1. On most of the unhedged models, it produced slightly better total returns with slightly better standard deviation.
  2. From 2002 to today, I was only out of the market 26-52 weeks in the models - not too much trading.

The Bad

  1. It did not work on all the models.
  2. In particular, it is not suited to hedged models.

The Ugly

  1. Even in the models where overall returns were slightly better, the excess return during the periods when the models were out of the market was significantly positive.
  2. The slight positive impact to total return is likely offset by trading costs and slippage of getting in & out of the models several times.

Conclusion

Weak predictive ability in terms of total return, likely offet by costs & slippage. Poor predictive ability in terms of excess return.

The search continues.

PS I also looked at Sharpe and Info Ratio over combined 13W, 52W and 3Y time periods. Combined Tail Risk Ratio was much better.

[quote]

Chaim - this is almost correct. I weight each ranking system in proportion to how many WINS it gets on each of the twenty tests. So if I’m testing eight ranking systems and one gets only one win, it gets 1/20th weight. And if it gets no wins, it gets no weight.

In my limited backtests of doing things this way, a combination of different systems that wins in different universes usually beats one system that wins over the combined universe. Perhaps because more factors are in play . . .

I’ve only been choosing my optimal ranking this way for about six months, I think, but my YTD returns are over 50% annualized.

I don’t believe that there exists a metric that can predict performance over such a brief time period. I do believe one can reasonably predict relative performance over the next two or more years by using conventional metrics like the information ratio, CAGR, etc. It comes down to probability rather than prediction, though. Which strategy has the highest probability of producing excess returns? That is something you can actually measure by doing correlation studies and using statistical tests. But you need lookback and look-forward periods that are commensurate with the way you actually trade.

How do you calculate “Tail Risk Ratio”? I haven’t come across this formula before. Googling it doesn’t help. Thanks! - YT

I ran across it here:

[url=https://papers.ssrn.com/sol3/papers.cfm?abstract_id=2745220]https://papers.ssrn.com/sol3/papers.cfm?abstract_id=2745220[/url]

For 13 week returns in excel, I used:

=MAX(PORT1:PORT13)/ABS(MIN(PORT1:PORT13)) and compared to =MAX(BENCH1:BENCH13)/ABS(MIN(BENCH1:BENCH13))

For 52 week returns in excel, I used:

=PERCENTILE(PORT1:PORT52,0.97)/ABS(PERCENTILE(PORT1:PORT52,0.03)) and compared to =PERCENTILE(BENCH1:BENCH52,0.97)/ABS(PERCENTILE(BENCH1:BENCH52,0.03))

For 3Y returns in excel, I used:

=PERCENTILE(PORT1:PORT156,0.94)/ABS(PERCENTILE(PORT1:PORT156,0.06)) and compared to =PERCENTILE(BENCH1:BENCH156,0.94)/ABS(PERCENTILE(BENCH1:BENCH156,0.06))

This is great. But rather than prescribe metrics which are “predictive of OOS performance”, I think it may helpful to elicit your intuitions regarding what sorts of things should NOT work.

I’m asking you all to frame your ideas in terms of refutation because while we can never statistically be certain that any given system will work in the future much like it would have worked in the past, we can easily estimate the likelihood that is not working as intended. I.e., Science.

Thus:

[]At what point should we admit something is not working?
[
]What are the indicators that something is not working?
[*]What are some things we expect and/or know not to work? (…and how can that be used to hone in on what is actually going on?)

Here is some Haiku for a few of my core intuitions about performance indicators:

While all data counts
Recent data matters more
Don’t ignore the tails

Fair results in sample
Exceed even eye-popping backtests
When the story fits

Factors are fickle
Are they trending or reverting?
Which ones are in style?

Discount statistics
Except when they result in
Unique Information

Logic fundamentals:
Statistics can prove nothing
Refutation is key

Valuation models
Built from first princples
Are likely robust

Yes Primus - I use backtesting to look at what a sim might have done in the past and will reject the backtest if it doesn’t look or smell right. I believe there is some factor persistence that can rub off into the future. But choosing a system based on backtest results is an art, not a science. We should therefore not get worked up over which rearview indicator is the best as none has dibs on future performance.

Interesting paper, Miro.! The results of the study are noteworthy but keep in mind this was done over a relatively short period of time ~5 years in pretty much a bull market.

Steve

Yuval,

Thanks for sharing!

I certainly will be looking at the Omega Ratio more closely.

I see from your link that you may have developed some of my aversion to OLS linear regressions based on your use of Kendall’s Tau and Spearman’s rank correlation. If so, I definitely understand. All I can recommend is take it one day at a time and as a fellow previous addict please feel free to call any time;-) But, in truth, slipping off of the wagon and using a linear regression on something that is non-linear may not be as much of a problem as I once thought.

The above is an example where I am now less interested in whether it is statistically correct but instead just want to know if it works as a stock picking method (as a machine learning tool). I think this general method is likely to work well. It should have the robustness of a “Decision Jungle” I would think. I have not looked much at the specifics of what you are doing with this.

Using machine learning might even extend to selection for ranks and factors. People in machine learning loosen some of the statistical assumptions. For example, they might use a linear regression knowing the “target function” is not linear. In fact, at times they know that TRYING TO FIND THE SHAPE OF THE TARGET FUNCTION IN-SAMPLE WILL LEAD TO OVERFITTING. Using a simple regression: linear, or a nonparametric regression is accepted even knowing things are not linear. Indeed, with 100% certainty using too many polynomials to get the perfect fit will lead to overfitting and poor out-of-sample performance. Hence the end of my days with the statistics police.

I do tend to want to get the statistics right when I finally test what I have done. I do this separately and for me personally it is KISS. But for the machine learning process anything that works can be used as far as I am concerned. And for machine learning the only thing that is certain is that it will never be perfect. Working for the machine learning police is much less demanding.

Interesting stuff with much to learn from!

-Jim

Jim - Inspector Sector (the police) is interested in machine learning as well :slight_smile:

And I will NEVER forget how helpful one of your Excel spreadsheets for optimization was.

I think that in machine learning parlance it might be called an evolutionary algorithm.

But call it what you will: Inspector Sector knows machine learning!!!

-Jim

This is interesting. But I don’t understand the main concept behind it. At first glance, it seems like a strange idea. If you had two strategies that got the same return, you’d choose the one with a higher standard deviation. Following this idea, the bigger the tails, the better. I like my returns smooth and steady. Don’t you? Doesn’t following this strategy privilege the system with the most outliers? The system that’s least predictable?

I must be fundamentally misunderstanding this. Please help.

On further examination, I don’t find the results of the study particularly meaningful as there is no discussion of benchmark, a factor that should be integral to both the in-sample and out-of-sample data. If one is to compare systems, then either the benchmark for both systems must be identical, or the performance measurement must be relative to the benchmark. As I stated in an earlier post, the only meaningful benchmark (in my opinion) is one created from the custom universe from which the stock picks are drawn from. I haven’t used the Quantopian platform but I suspect the benchmark(s) do not conform to my requirements.

On the positive side, the study does hint at testing for property transference from IS to OOS. In other words, if your performance measure is risk-adjusted return, then one should be looking for an improved risk-adjusted return in the OOS versus the benchmark, and nothing else. Here at P123 we have a tendency to judge various performance measures (Sharpe, Sortino, Calmar, etc) against each one another. But what are the criteria for judging the best measure on OOS data? For example, it wouldn’t make sense to judge risk-adjusted return IS based on the OOS criteria of total return or vice versa. So I think we need to change our mindset, and instead of asking the question “Which is the best performance score”, turn it around and ask “is there some level of transference of properties from IS to OOS and to what extent? And how much persistence can we rely on, assuming a decay over time?” If it turns out that there is some level of transference from IS to OOS then the designer can decide what properties he/she is designing for. i.e. improved risk-adjusted return, low beta, etc.

Steve

Different tools for different situations.

Nonparametric statistics do not assume normality and reduce the effects of skew and outliers among other benefits (and potential problems). The Friedman test, used in my post above, is an example of a nonparametric statistic. So I use (and misuse) nonparametric statistics at times.

I like the Omega ratio too. But it is the opposite with regard to skew and outliers. The full impact of the outliers is in the Omega ratio statistic. That is one of its most highly advertised characteristics. It does not assume normality, however, and is like nonparametric statistics in this regard.

Yuval introduced me to the Omega ratio and to “An Introduction to Omega” by Con Keating and William F Shadwick: From this article: “It therefore omits none of the information in the distribution ….” None of the information including the skew and the effect of the outliers (as well as kurtosis).

Link to this paper:HERE

Maybe you do not like outliers or their effect in you analysis. You might use medians, nonparametric statistics, or more generally, robust statistics. Maybe you use something else.

If you do want to understand the full effect of the outliers and of any skew then the Omega ratio might be considered as one statistic to look at. And it may tell you things the nonparametric statistics do not reveal.

For those who invest in OTC stocks or microcaps where a few stocks can have extreme upside-returns the Omega ratio might show the true benefit of those ports. Something that could be missed by looking at the median or nonparametric statistics alone.

Which statistic/method is best may depend on the situation. The statics may even reveal a different story at times.

-Jim

I think there are two ways to compare IS and OOS performance measures. One is to compare the same performance measure between IS and OOS periods. One might find, for example, a low correlation of Sharpe ratio, an even lower correlation of Calmar, and a higher correlation of information ratio, all on the same data and same periods. The other way is to compare to “actual performance,” or CAGR. You say this doesn’t make any sense, but it does for me. What I am hunting for is the risk-adjusted performance measure that will best correlate with (i.e. predict) actual OOS performance, because actual OOS performance is what I’m shooting for. I don’t care about the standard deviation of my returns if it doesn’t correlate to my total returns. (If it does, though, I do care.)

  • original response deleted -

So the question is - why would you score based on risk-adjusted return (or any of the variants) in an attempt to achieve the best total return OOS?

What would be palatable for me is to theorize the following:

(1) the risk-adjusted return demonstrates some propagation from IS to OOS (taking benchmark into consideration)
(2) risk-adjusted return in OOS translates statistically into higher long-term absolute returns (not adjusted for risk).

I won’t try to tackle number (2)… perhaps it is a statistically sound postulate, but not being a statistics expert, I won’t speculate on that.

What is important however is (1)… Does the risk-adjusted return propagate from IS to OOS to some extent? If it doesn’t propagate then (2) is irrelevant. This argument holds whether we are talking about Sharpe, Sortino, Calmar, Information Ratio, etc.

What we don’t know is to what extent factors and performance scores propagate fro IS to OOS and how long that propagation will last. Nothing can be proven mathematically and empirically would take one hell of a study sample size.

However, all this being said, even the study refers to “machine learning” outperforms other measures. Thus we can conclude that:

(a) there is likely some propagation from IS to OOS but to what extent we don’t know
(b) there is likely some decay in performance over time hence the need for machine learning
(3) this entire activity of creating a forward-looking score is not a “science” but an “art”. If it were a science then there would be a definitive solution which immediately dissipates due to overuse/market efficiency. The alternative is to believe that due to IS–>OOS propagation, the art of forward-looking score is open to the possibility of combining factors and optimizing strategies. No one can prove mathematically either way :slight_smile:

How well any of this will work depends on how many good models there are out there.

If most of the Designer Models are good then a t-stat of 2 is likely to find you a good model. We know this because throwing a dart is likely to find a good model if most of the models are good.

If there are only a few good models a t-test of 2 is likely to find you a model with no value at all. There will be about 12 false positives out of 259 models setting the t-stat at this level. And some of the good models will have a bad start and be showing a t-stat less than 2 early on. The majority of the models with a t-stat of 2 will be poor models when good models are rare.

So, as far as short paragraphs to sum it up. Hmmm…… I am working on it.

-Jim

The good inspector makes some valid points. Here is my perspective, though perhaps I’m using flawed logic.

  1. There is a positive correlation between in-sample and out-of-sample returns. No matter how you slice the dice, if you take fifty strategies and run them over two distinct periods, there will be some degree of correlation of the results. The strategies that perform better over one period will be more likely to perform better over the other period than the strategies that perform worse. The correlation, however, may be quite low. This is true no matter what performance measure you use, as long as that performance measure is somewhat correlated to CAGR.
  2. There are lots of different ways to measure performance. Some of these methods have better “persistence” than others. The information ratio is more persistent than the Sharpe ratio, which is more persistent than the Calmar ratio. In other words, the correlation, in a fifty-strategy study, is going to be stronger for some performance measures than for others. The information ratio of the returns of the two periods is going to have a higher correlation than the Calmar ratio.
  3. Most risk-adjusted performance measures adjust for outliers in one manner or another. They may do so by dividing returns by standard deviation, or discounting beta, or using median measures, or taking into account distribution of returns. In doing so, they may be MORE predictive of OOS CAGR than plain CAGR. In other words, their correlation, in a fifty-strategy study, may be higher to OOS CAGR than in-sample CAGR is. The underlying aim of a risk-adjusted performance measure is often to be more robust than a simple measure like CAGR or average returns.

Now if there’s a fundamental flaw in my logic, I would really like to know what it is. It’s a problem I’ve been puzzling over for years, and if you poke a hole in my logic, I will be very grateful.

This must be generally true, I think—without trying to identify all of the reasons this is correct. And if it is correct for many reasons: all the better.

Can you asses risk without considering probabilities?

These metrics do look at risk but they all do more than that. If you have a t-stat of 5 or higher using Parker’s method with the information ratio or the Shape ratio that is a pretty good indication that the model has something about it that makes it outperform. Being truly good, any outperformance will likely persist.

The total return should correlate too but some measures are designed to look at the probability that it is a good model directly.

I use the Sharpe ratio and information ratio as an example only because they can be easily converted to a p-value. An insight (albeit an incomplete one) into the probability that the model is good. I like the Omega ratio (and other metrics) too. In fact, the Omega ratio is a “weighted ratio of probabilities” at at certain value of return. Your odds of success are right there in that one number. I would suggest that this is one of the reasons this is a good metric. And one of its potential strengths is that it makes a full accounting of the outlier’s effect (and of skew).

One thing to consider is that many models are just white noise. You can have the best speakers in the word but if there is no signal they will not do much for you. You cannot turn static into The Rolling Stone’s Greatest Hits even with the best speakers in the world.

Finding just static on a channel may say nothing about the quality of your speakers (or the receiver).

Likewise, if a metric does not seem to be working we might consider whether there is a signal in the data. There probably is but it may be faint at times. One metric may be better at finding the signal than another but don’t expect to find Beethoven’s 5th Symphony on your road trip through Wyoming. It doesn’t matter how good the radio is.

When there is a strong signal most radios will be able to find it.

-Jim