Steve - I’m not so much worried about experienced P123 users, but newbies are easily distracted. It doesn’t take much for someone to put up an obviously overfit model and secure 80 subscribers at $100 per month. It has been demonstrated over and over again. I really think that blinders need to be in place In other words, backtest data needs to be hard (but not necessarily impossible) to get at. I also believe that the first year OOS should be in the same category so the inexperienced don’t flock to the latest “streaky” model.
** EDIT ** Apart from the performance versus simulation issue, I think that there is enough information on the live model page without throwing in additional info. And in the future the live model page will undoubtedly expand as people think up new things to put on it. I’m not sure why people are having a hard time accepting that you have to click on a nother tab to see simulated backtest.
Include simulated performance graph without timing aspects (not talking about rolling tests here, but year-by-year results).
2)Any tests must be done both with and without timing if a system uses timing. Not including the timing results may be misleading if the timing method in question does in fact work.
3)Include 5yr liquidity statistics and historical weekly returns distribution.
extra: include simulated performance since at least dec 31 , 1999. Did not say Jan 1999 as there is some incomplete data then.
Disagree! You have quite helpful ideas most of the times!
Agree!
Regarding the benchmark discussion:
I think all ideas have their merits:
a) Use a custom benchmark of the model, for example, weighted average of all the stocks that can be possibly bought by the buy screen - that would be most accurate and model-specific to check if the ranking works. Cons: Very individualistic, intransparent and possibly complicate to calculate.
b) The gray area: Use a broader custom benchmark: Use only specific aspects of rules used in a portfolio - the benchmark would be still close to the model, but only cover certain aspects (like industry, liquidity etc.). Cons: Which factors are most objective to allow for broad indices for any model?
c) Use a standardized benchmark. Most probably the S&P 500 for all models. Pros: Easy, understandable, transparent and STANDARDIZED for the purpose of comparison. Cons: Not specific and thus less relevant for for example constant-hedge or market-neutral models.
I struggle with c) myself, for example with this model. The portfolio outperforms a rising stock market - using only a 30% allocation in stocks!
For reasons of simplicity though, I still tend towards c). But then again, it would be nice to have both a version of a) and c) in different (or the same) charts … however, this would get messy for most end-users. There’s some tough decisions to be made weighing between end-user friendliness and professionalism.
fips - I think simply taking the equal-weighted custom universe is conceptually simple and “should be” easily implemented. (I can say that because I am not the one doing the implementing Additional liquidity filters from buy/sell rules can be ignored as this is part of the developer’s discretion. If there is a hedge, it can be added to the benchmark. In my opinion, this is a very clean solution, but of course doesn’t work for ETFs.
When you look at the bigger picture, the performance of similar models should be compared. So you could have a category such as smallcaps. Choose an index such as the S&P1500 or Russell2000. Then any model, whose custom benchmark has a correlation > 0.95 (need to experiment with this number) gets lumped into that group. I suspect you don’t even have to look at the makeup of the individual model. i.e. whether it truly is made up of smallcaps.
Using this methodology, the user could choose any index, even sector indices, and compare the models with highly correlated benchmarks. Instead of 5 categories like we have now, we could have a drop down menu: choose an index category. Maybe there there are 20 or 30 indices to choose from. And models are not boxed into one category. They could be in multiple categories.
I would suggest adding a 2-sample paired T-test for the R2G redesign. My impression is that it’s the single most important set of data in evaluating a third party R2G. I think it’s more difficult to beat from a data mining perspective compared to a historical approach that infers probabilities from rolling historical performance.
For the simulation period:
Sample 1 should be the lognormal daily returns of the R2G since inception, and sample 2 should be the lognormal daily returns of the equal weighted total return PR3000 index. You’d then subtract sample 2 from sample 1 to get the excess lognormal daily return of the R2G and then that excess daily return stream would be what you’d use for the t-statistic. I would then report the following:
Probability of the R2G beating the benchmark by more than 0%/yr over a 1 year period.
Probability of the R2G beating the benchmark by more than 5%/yr over a 1 year period.
Probability of the R2G beating the benchmark by more than 0%/yr over a 3 year period.
Probability of the R2G beating the benchmark by more than 5%/yr over a 3 year period.
In the paired t-test t-statistic formula, #1 and #2 would use n=250 to represent a 1 year period. #3 and #4 would use n=750 to represent a 3 year period. In all four cases, the “n” used in the t-statistic formula doesn’t match up with the actual number of values used to calculate the mean and standard deviation. This allows you to report probabilities for a 1yr or 3yr timeframe while still using the full simulation history.
I would cap the probabilities reported so that anything under 60% is simply “<60%”. This will help ensure that people don’t start making faulty inferences about what the probabilities actually mean. Also anything over 95% should be capped so it’s only reported as “>95%” given that equity returns have fat tails.
I would also add a large bold footnote that warns users that the statistics underlying these probabilities assumes only one hypothesis is tested while in reality every designer has tested many more hypotheses than that before publishing their R2G. As a result, these probabilities don’t take into account the number of tests the designer has done (or the covariance of the return stream among these tests) so it doesn’t account for the multiple testing bias or the data mining bias. Also it makes other statistical assumptions that aren’t exactly matched to reality. Nevertheless, my impression is this is the single most important data point a 3rd party evaluator could use in choosing between R2Gs.
For the Out of Sample Period:
I would repeat the above process but only using the out of sample data this time. Even though this is out of sample, this section should still have the warning about the multiple testing bias since if you’re evaluating 100 R2Gs with sufficient out of sample data, you still have the multiple testing bias again, since each R2G’s out of sample record is a separate test.
I don’t care how reliable statistics are obtained and TheAlgorithm’s idea seems like a good one. Here is another way to consider for getting a reliable paired t-test. This would also address the suggestions put forth that the universe be used as the benchmark.
A good way to do a paired t-test would be to pair each actual trade with a randomly selected stock from the universe over the same time period. Personally, I would be happy just to know the average difference between the returns of the actual trades and the randomly selected trades along with the standard deviation and the number of trades. I could then calculate my own statistics as desired.
I believe (and someone please correct me if I am wrong) this a rigorous and correct way to do a paired t-test. There would be a few loose ends for discussion. How would slippage be done on a sample of randomly selected stocks: I would be for no slippage at all which would be the most conservative way to do it. This would also be a practical answer to the question of whether commissions and slippage outweigh any benefits (profits). Also, using the log of returns would probably be the most correct way to do it but may be unnecessary with returns in the 5% to 10% ballpark. I also I leave it to people smarter than I am as to whether the central limit theorem makes using the log necessary for a large sample like this.
Finally, I would do the statistics over the entire backtested range also (in addition to TheAlgorithm’s idea). This would address my main question as to whether the port is statistically superior to randomly selected stocks from the universe: using a large sample.
I think this would probably be doable, and after input from the entire community (correcting my mistakes), it would likely be high quality statistics.
I have run some non-paired t-tests: in- and out-of-sample. I think you will find that you get some pretty highly signficant p-values on the good ports and that this will be a selling point (rightfully so) to the quants and math geeks (I only aspire to math geekdom and this is a compliment) that frequent this site.
What will be the defining criteria for performance, annualized returns or excess returns?
annualized return will be very inflated for models launched at a market bottom. In fact their stats will outperform all other models.
excess return seems more appropriate, since it’s relative performance the date of launch has no influence, but the benchmark needs to be the same for all.
The benchmark needs to be the same for all within a group of related systems. And in all likelihood, the more volatile the benchmark, the more volatile the model excess performance will be.
I like the suggestion for t-tests from TheAlgorithm. When analyzing my own models, I download factor return series from Kenneth French’s website, and run a multiple regression against my model returns. This helps to confirm exposure to several factors (e.g., value, size, profitability), including p-values. The constant (i.e., “alpha”) should be positive over multiple time frames (with low p-values), before I would feel comfortable with a model.
Regarding the proposed R2G changes, perhaps it may be helpful to have separate models that focus on ranking systems versus market timing. Market timing R2G models would only include the market timing signals, which can then be used as overlays to other R2G models (or one’s own models). It would then be much more transparent to evaluate the effectiveness of the underlying ranking system versus market timing or hedging.
I think it is important to point out that it is impossible to know the probability of an R2G beating the market, by any amount, over any period.
It is important to realise the limitations of mathematics - there are so many assumptions in measures like t-statistic as to be useless. As Feynman said, one can only learn through experimentation.
Better to be upfront about the uncertainty of the future than to publish misleading information. There are those who know they don’t know, and those who don’t know they don’t know!
I think you know way more math and statistics than I ever will.
Still, I at least want to make an effort to know when I am getting into the negative (and very real) aspects of probability. I want to know when I am risking (to too high of a degree) gambler’s ruin. I want to know when my bet sizes are too high. I would want to know if I was taking on too much leverage, if I used leverage. Perhaps, more pertinent to my situation, I would want to know if I were making bets in the market with no edge at all. I will want to maximize any edge I think I might have.
Personally, I’m going to keep using whatever tools I have available to try to understand these questions: as flawed as they are.
I like statistical tests but not on in sample R2Gs because I believe that it will yield meaningless results. The reason is that different types of R2G designs have different relationships between in sample and out of sample results.
For example: Buy and sell rules are notoriously famous for failing unexpectedly – even when they have worked for years both in sample and out of sample. For example, there was a forum discussion here about different methods to predict which AAII stock screen was going to outperform going forward. No one was able to come up with a good method. Stock screens = buy and sell rules. Therefore when it comes to buy and sell rules (both when designing my own system and when evaluating other systems) I assume that the rules will add zero going forward and look at what the system would have made with minimal rules. Market timing is in the same category as buy and sell rules in this respect. Ranking systems are much more robust, but the important metric is how out of sample performance compared to in sample. If a ranking system performed 100% AR in sample and 50% AR out of sample then I deduct 50% from simulated returns. The type of factors used is also an indicator of robustness. Value factors tend to keep working because they measure intrinsic value. Certain types of technical factors could work extremely for decades before they suddenly work in reverse without warning. For example, price pullbacks and low prices stocks started working all of a sudden at around the time of the dot com bubble. In the 90’s and earlier low priced stocks and pullbacks had below market returns. By knowing the basic design of a model I can sometimes tell how reliable it is and if it’s future returns have the risk of stopping to beat the market without warning.
All these things do not show up at all in the in sample stats and if these stats were to be published then it would incentivize designers to optimize the stats instead of the results going forward.
When I evaluate a model I want to see how it would have performed without timing, without rules and without technical factors. By comparing how the model actually did out of sample to that figure I can get a feel for how well the in sample results predict future results.
There is one caveat in using this metric of comparing oos to in sample. Oos peformance over very short time periods (<45 days) are completely meaningless because over any given day, week or even month, returns are quite random. From around day 45 a pattern may begin to form if the market doesn’t have a correction. For the first year or two small daily fluctuations in returns can radically change the annualized return since launch from day to day.
Comparing oos to in sample would also produce meaningless results for periods when the stock market had a correction or even a drawdown. There is sometimes very weak correlation between what system works for one bear market to another. For example any system that avoided tech or loaded up on small value would have done well in the 2002 bear market without timing. In 2008 it was financials. In 2011 it was financials again but with a twist. Which factors are going to work for the next bear market? Who knows? It’s not always possible to extrapolate from one bear market to another. But bull markets are often more similar to each other and therefore you can often extrapolate from one to the next. Therefore I like to use histograms comparing in sample to oos. This doesn’t fluctuate from day to day and doesn’t get completely distorted by a single bear market.
Oliver raises a good point that statistics cannot “prove” anything. I personally like using quantitative methods such as regressions, but as one of many data points to help evaluate a model. But perhaps showing official-looking stats through P123 would be misleading.
Just a quick note: I like the statistics we have. We have annual returns and annualized standard deviations for all sims, R2G ports and their benchmarks. It is not too hard to use this to make a t-test. For individual ports, at least, the trades can be downloaded into Excel and statistics done that way. One could even download random trades with the screener using the sim’s universe to get good comparisons: Tom has done this with some interesting posts. Of course, the the simulation itself is a statistical tool.
In any case, my real point is one of the most established and accepted statistical test is the 10 or 20 bucket rank performance. Although a little qualitative, it is in every book I have read related to what we do at P123. I understand the problems this has for R2G ports that use a lot of buy/sells rules and I have no opinion as far as the R2G model presentation on this.
Not to mention Sharpe ratio, Alpha, Beta, Sortino Ratio and correlation with the benchmark.
Statistics are not useless, everything must simply be taken in its own context and understood properly (and rarely do people want to learn anyways)
The below is of course a “straw man” argument as I realize it takes a quote out of context, but it does provide an opportunity:
“Statistics cannot prove anything” this is true but deceiving. The scientific method itself can never truly prove anything. Is science useless? …Hopefully everyone can understand that this is not the case.
The closest thing we have in science to a fact is a theory, which is the result of thorough experimentation and reasoning.
If I drop a ball 1000 times and it always falls into the ground one can theorize that it will continue to do so. Statistics, depending on how they are generated to begin with, are basically one of the many ways to make conclusions from experimentation itself.
Now… what happens if the ball is filled with helium? In that case would science be useless? No, one would simply have discovered an additional phenomenon and update/replace the theory by considering additional factors such as density to account for this.
The question then is not whether statistics can be useful when used with the required knowledge, but rather whether we can trust that everyone can understand and use them properly. I believe that upon reflection everyone knows the answer to this question.
Now, as to the matter at hand, I believe if someone wants to make their owns statistics it is ok, but in the end asking p123 to put all the effort into complex tests that subscribers for the most part are not asking for is not a good business decision (although I would like that test). For now the weekly distributions can be used instead.
It is perhaps better to let each person experiment with his/her own methods and focus on other matters.
This is very true. I have run the same port each day of the week since September 2013: 5 ports each rebalanced weekly. The best day has returns nearly twice as good as the worst day. These ports are not even independent: they always hold many of the same stocks!!! So I would say that 45 days is meaningless and it takes much longer than I would have guessed to get meaningful out-of-sample data.
I have been considering that there may be a day of the week effect but I doubt it (and ANOVA tests do not support this idea). BTW, Monday is not the best day and quality of the data is not the answer.
There is a lot of luck in out-of-sample data too.
Edit: I was about to write that the above was a 5 stock port where variability would be expected and give you the results of my daily 10 stock ports started in June of 2013. Problem is the best port does 2.4 times better than the worse for the 10 stock port!!! Note: all the above ports had positive returns with the worse barely beating the SP500 benchmark so not too weird of ports, probably.
You don’t know what you don’t know for any data with a small sample size (unless you have run a power test).