I think Pitmaster has a point here. People should be able to choose their own benchmark, of course. Maybe they have a model that is heavily weighted with growth features and want to use a growth benchmark. This can be important information for a sub looking to use the model for investing.
But if it is a competition, which it seems to be for fun, maybe each “player’ should be forced to use their universe for excess returns (in addition to their selected benchmark) for comparison to other models in the competition..
In that case, I would think information ratio (IR) or even IR * sqrt(number of years) would be a fair comparison. Where IR is calculated with excess returns relative to the universe.
For sure that would carry a lot of statistical meaning as well a being a fair/objective comparison, perhaps. It is a common metric in financial analysis.
Sharpe and Sortino ratios are still useful, but they can be misleading if the benchmark itself had a higher Sharpe ratio during the period—like now, when the S&P 500 has surged. The IR helps distinguish between doing well and outperforming your universe, which is a key distinction for subscribers and competition scoring alike.
For the record, I don’t currently publish any models or subscribe to any. This is just a suggestion. But I can say that if a model approached a t-score of 3.0 with this metric, I’d probably start watching it daily—just like Pitmaster—not just for competition purposes, but potentially as a candidate to sub.
A bit in the weeds, but the biggest issue in model selection is the multiple comparisons problem. If you test enough models or factors, some will appear to perform well purely by chance and most of those will regress to the mean out of sample.
One way to address this is by controlling the False Discovery Rate (FDR)—that is, limiting the proportion of models you select (or even consider) that are likely false positives.
And while this is a simplification, using a t-score threshold of ~3.0 or higher (which corresponds to a p-value < 0.003) is a statistically conservative rule of thumb. For a moderate number of comparisons (say, 100 factors models), this is typically enough to keep your FDR under 5% in practice.
That may be a reasonable simplification for our context, given that there are roughly 100 Designer Models with meaningful out-of-sample performance to evaluate. It puts the t-score > 3.0 threshold on firm statistical ground, without being overly strict or impractical.
This is just a back of the envelop calculation of the FDR with t-score > 3.0. People would be free to use their own calculations or thresholds, of course.