Renaissance Technologies

Thank you Yuval. Very well reasoned and objective, I believe…

Speaking only for myself, there is one additional thing I look at, besides my own performance.

I ask myself just how special am I really. I have nothing to complain about as far as my returns.

But I have the experience of a large sample of Designer Models. Models that are doing poorly even over the last 5 years. I have to consider that luck could have played a part in my returns.

Luck that may not hold up if I am not much better than the Designers (on average). I still hold positions in my ports that have done well out-of-sample (if not so well recently) hoping for a rebound.

After all, maybe I am special and can do better than the Designers. Not by much (if at all) would be my realistic guess.

Hmmm. Reading my own post, maybe I should close my positions now.

-Jim

You can do it in excel.
t Stat= -11.09546279
P(T<=t) one-tail= 1.04972E-17

WHOA!!!

-Jim

Jim, sorry to answer so late.

My post relates not to black box systems. I know I shoot me in the foot here, but I personally would never trade a black box system and I actually would not recommend it.

I might trade one but only if the provider would give almost everything away (like on a 1:1 ZOOM session where he or she shows me at least the ranking system and reveils it and does the robustness tests life together with me).

The reason: you need to have 100% trust in your system if you want to trade it and you have at least I need to have test it on robustness (1, 3, 5, 10, 20, 50, 100, 200 Stocks, different caps (nano, small, mid, big), evenid =1, evenid = 0, different universes (US, Canada, Nasdaq 100, Sp500), at least all sectors, if not at least 20 industries and sub industries) and then looking at the capital curve (no statistics, bc. they do not capture the tails, at least the once I know).

Also on every factor I use, I need to find at least a dozen researching papers from akademia (my favorite OSAM), If they can not backtest it
for much longer timeframes (like 1870).

Also, I always look, that my system that I trade has a lot of degrees of freedom. That means as less buy and sell rules as possible and a lot of stocks
left in the universe (I look at the screen of the buy rules, if there are less then 1000 Stocks I am not using it and get rid of buy rules).

For example: I have a trading system that works great on the overall market and this trading system works almost perfectly on
the sector healthcare, but I do not use it (the healthcare version) simply bc. the overall stock count on healtcare is to low for me, I am to scared
that something might happen to the sector and I am dead in the water. I give up performance bc. I do not want to get even near to overoptimisation.
I know I might let something on the table, but the closer I get to optimisation, the less confident I will be to trade it.

Last step is to find an assumption on why the factor works, and if this is not assumptionable by cycle behaiviour or simply emotions, I do not use
it no matter how good the backtest and the robustnes test was (though this step is highly subjective).

That is also the reason why I would never trade a system where I do not understand the methods that are used. For example AI stuff where even
the backtest change from run to run.

the Job of my coach is not to understand all this, but to keep me in the process. he asks: ok. lets go through the process and make
sure you have done everything concerning to your rules. Then he asks, o.k., trust your sellf (in an event of a DD that is hard for me).

My main point is: For me, following above rules, P123 is already perfect (I might want to have a function where the system buys more stocks when it is 20% down, if you know how to do this help me :-)), also bc. its already a lot to do to follow this rules. Something new I would probably not use until I have mastered the fuctionality set of todays P123 and I have not yet, maybe I am at 30% (though looking forward to international stocks and how my modells do there, bc.
its perfect for robustness testing). I want to master what is in front of (and that is 80% my mindset and only 20% P123) me. it is me, not the function
that is missign on P123, that is my point.

This process is 100% based on what I learned since 2010 here with the p123 Comunity.

Best Regards

Andreas

Andreas I love you but I think you do “shoot yourself in the foot.”

Ethically, I think you should (at P123) give a disclaimer. I really do. You certainly have a financial incentive since you have retired and financial disclosures are require almost everywhere.

Since P123 does not require it yet, here it is:

The mean 2-year excess returns of all of your systems: -20.5. I have not looked at your 5-year returns (as Georg has averaged for all Designer Models).

I look forward to out-of-sample results on your new system. You surely have fitted the backtest–including timing. Maybe the results will be good. We will see.

Believe me, I understand that a t-score of -11 (as Georg verifies) is hard to overcome (p-value < 10^-17). I wish you the best with your efforts in this regard.

The p-value of someone winning the lottery 2 times in a lifetime is greater than this.

I will add that I really like your systems and even hope mine are similar. Yours is one of the (group of) systems that I do not believe I can beat long term. Sure I could get lucky for a year or two but the law of averages will eventually catch up with me. One or the reasons I have cut my positions significantly.

-Jim

That p-value is just to match SPY.
For a 5-year return 20% higher than SPY excel calculates a t Stat = -16.16 and a two-tail p-value= 5.45E-26 for the 75 models under consideration.

I think a better measure to calculate the probability of a model performing better than SPY is to derive it from the number of models.
Number performing better = 6
Number performing worse = 69

Probability that a model performs better than SPY = 6/75 = 8.0%
95% Confidence Interval: 3.0% to 16.6%.

For models performing 20% better than SPY:
Number performing better = 3
Number performing worse = 72

Probability that a model performs 20% better than SPY = 3/75 = 4.0%
95% Confidence Interval: 0.8% to 11.3%.

Georg, Thanks for the other way to look at it. -Jim

Forgetting about the benchmark, there are 17 models showing a negative total return; the average being -21.2% over 5 years.
That means the probability of a model losing money is 17/75= 22.7%, with a 95% confidence interval of 13.79% to 33.79%.

Therefore the upper probability that a DM will lose money over 5 years is 34%.

That’s not a good bet. We have to re-think the design process of our models.
Any good suggestions would be welcome.

In the discussion so far we’ve been placing the blame for the failure of the designer models on over-optimization. So I decided to create a simulation without ANY optimization at all to see how that would have done over the last five years. You can find it here: it’s public: https://www.portfolio123.com/port_summary.jsp?portid=1592851 It’s a very simple system: it just buys the top 25 stocks ranked by the old QVGM ranking system and holds until the rank goes below 95; the universe is the Prussell 3000. Well, lo and behold, it underperformed the S&P 500 by 42.25% over the last five years. EXACTLY the same as the designer models.

Draw your own conclusions. I would hazard the following guesses:

  1. The old, tried-and-true factors like ROE and P/E have largely been arbitraged away, and stocks are priced in such a way that there are very few values to be found by using those factors. Investors are now so used to pricing stocks by using these metrics that it’s hard to find any advantage in using them.
  2. We may be in a regime that resembles the late 1990s, when overpriced high-growth tech stocks ruled the market and S&P slaughtered all other strategies in sight. As we all know, that was not meant to last.
  3. While there certainly have been some egregiously over-optimized designer models, given the “base rate” I have established, we cannot blame over-optimization entirely for their overall failure to perform. Some of the “failed” models may actually be very good ones that are just going through a blue streak. After all, every single tried-and-true strategy and every single great investor has had some long periods of underperformance.
    [Edit: somehow the chart below got lopped off at the right. The 2019 results are 13.95 for model, 27.20 for benchmark, and -13.25 for excess.]

Thank you Yuval. EVERYTHING gets blamed on over optimization. Yuval presents evidence that the answer may be more nuanced. Nuanced enough that I am sure I do not have an answer now. Maybe someone else does. But it should be looked at rationally as Yuval is doing here.

If P123 can do it they should run the Designer Models WITH NO SLIPPAGE and see if that takes care of the problem. It could be that our result are fairly random with the guaranteed drag of slippage. It may be a simple answer.

Yuval posted an excellent reason that timing should not work most of the time (the market is usual up). P123 should see if they can look at this as a potential problem.

We now have some pretty convincing evidence that optimization of sims has some sort of problem—whatever that problems is. GEORG IS JUST CORRECT ABOUT THIS, PERIOD. Yuval presents anecdotal evidence that optimization may not be the problem or at least not the entire problem. Please correct me if that is not what this study shows or if there is another more important lesson from the study.

P123 hired an AI expert who looked at Random Forests. They were understandably frustrated with the fact that this was no better than Random. RANDOM IS LOOKING GOOD NOW AS THE EVIDENCE WE NOW HAVE SHOWS THAT RANDOM FORESTS ARE BETTER THAN SIMS.

Sims are not Random but are statistically inferior. The “null hypothesis” that sims are equivalent to a random selection of stocks from the universe can probably be rejected.

This specific experiment should be done to be sure this statement is true. Someone with knowledge of this LIKE GEORG should do a paired comparison of sims with equal weigh of factors that seem good and compare this to a random selection of stocks from the universe. Perhaps P123 could randomly select from the 50 factors that they gave to the AI expert for this. Do this enough times to get an answer (this is NOT rocket science).

P123 was going to look at support vector machines (hire the AI expert to do this). I think this can work ALONG WITH FEATURE SELECTION. The SVM must look a nonlinear solutions for this to work.

P123 should look for proof of concept. See if machine learning can work. Once it is shown let the P123 members find the best solutions and sell them as an expanded version of Designer Models. Of course, proof of concept may be desired first.

The first attempt at machine learning—with a Random Forest—seems to be better than sims already. At least it was not shown to be a way to steadily lose money.

I have few answers. I have some anecdotal evidence that machine learning can outperform sims. I certainly do not have access to enough data to prove it.

If P123 could find something that is proven to work with an intelligent selection of factors it would be helpful to their business model. This may be a proprietary method owned by Renaissance Technologies at this point. Maybe, knowledge is not owned by anyone. But it is NOT rocket science.

I do not think hiding the fact that a port is generally a method of gradually giving away your money is a good business model. A t-score of 11 is a tough handicap. People will eventually leave P123—because they have no money to invest if for no other reason. I miss Oliver Keating. Some of his GREAT IDEAS are performing terribly. Why did he leave? He is smart enough to know when something does not work. I would have bet that DennyHawles’ models would have work. Why didn’t they? Just market timing?

I had extreme respect for Oliver Keating!!! If his sims cannot perform then mine cannot either (long term). This is simple fact. Since he is gone now, because he has a large number of sims (24), and because there is no recent survivorship bias we could look at his stuff (with other good samples). 2-year excess returns -26. I have not looked at his 5-year excess returns. Denny’s models speak for themselves with out any statistics.

I have been a big fan of P123 but I am not going to chase a bunch of cherry-picked backtests. If the sims I have now (that have done okay out-of-sample up until now ) do not rebound in a year or two I will move completely to a method that uses price data to select Sector ETFs. I will probably stay with P123 to get this data. But Yahoo has this data.

I would check slippage and timing first. I am not saying I know here in December of 2019. But I will know at some point (it is NOT rocket science). I do not know if P123 will be with me then. So far they have been hostile toward finding out—with Yuval’s study here being (perhaps) sign of a change at P123. It is a first step.

-Jim

keating ranking, 3 Month rebalanced not changed since then…


used the same ranking from 2011 - 14 too… not optimized… was able to beat transaction costs realtime too!

Yes. It appears all quants are “over optimized”, even the old time pros like Assness and O’Shaughnessy, if they’ve been buying value or quality or small caps the last few years.

Andreas,

Make sure to share some of your abilities in the Designer Models when you get a chance. You have enough models that your average result on these models would be good if you had a secret that you are sharing with us in these models.

I am happy to keep the discussion to Olikea or the all of the Designers as a group if people would not make anecdotal claims that cannot be verified. This is important enough to P123 that we need objective data, I think.

Anecdotal stuff, Cherry-picking needs to stop if P123 is to survive.

Fundamental analysis until we buy black boxes that do not work is a business model that can be expanded upon, IMHO. Andreas’ model aside: it is not a black box perhaps. Maybe we can just all use his model. I might even subscribe to help Andreas. I will want out-if-sample result before I put any money in it.

-Jim

Good to know.

With regard to P123 we manage to underperform even value benchmarks when used.

One could repeat Georg’s study where a value index is used as benchmark for all models to get more information on this hypothesis. Not sure what it would show.

Georg presented convincing evidence for 5 years. Not a few.

But you present a hypothesis that will be tested with time.

I only suggest that we do a little active testing. If people would prefer to just let Designer Models run for another 5 years and see what comments we get in the forum it is a plan at least.

BTW, has anyone seen a p-value like Georg gets? I understand we are to accept anything we are linked to over at SeekingAlpha but not the best p-value you have ever seen.

Good luck with that.

-Jim

So here is the evidence that backtesting does not provide a good indication for OOS performance:

I have selected 25 DMs with large + mid caps >=70% all with inception date earlier than 5 years ago and put them into a book. (25 models is the max allowed in a book.)

Look at the backtest from 2002 to 2014. That is the backtest period which designers considered. From Figure-1 one can see that designers did very well. Annualized return= 28.6% with a max D/D= -18%. The 2008 financial crises is not even visible on the performance curve. Calendar year performance is equally impressive - every year has positive returns all exceeding that of SPY, all as shown in Figure-2.

So why did this great simulated performance not continue over the 5-year out-of-sample period 12/1/2014 to 12/2/2019 (Figure-3)?

Almost immediately the combo starts under-performing SPY, over 2015 by -4.0%. How can that be when for each of the preceding 13 years it out-performed SPY?

Over the last 5 years the annualized return= 5.0% with a max D/D= -21%. Calendar year performance is equally unimpressive - every year the 25 DMs underperformed SPY, 2015 to 2019: -4.04%, -6.34%, -4.33%, -4.89%, -11.48%.

Performance relative to Value is not much better. Calendar year performance relative to IWD is equally unimpressive - 2015 to 2019: 1.17%, -11.60%, 3.93%, -1.01%, -7.20%.




Georg,

Thank you.

Shouldn’t P123 hope this is overfitting? I think they should.

Twice you mentioned AIC which will stop overfitting in its tracks. I leave it to you to try to explain how you would implement this to the staff at P123 and the members—if you think anyone (other than me) will listen to you the third time. I think they should.

AIC is good. I also like LASSO regression. That would mean using that m x n array I keep mentioning to do a LASSO regression. But once the features are selected you could move back to a rank performance optimazation with the selected features. BTW, LASSO or AIC would not use much computer resources.

Result: no more overfitting. Gone, nada, none. Zero, zip overfitting. The big goose egg….

Ultimately I would use something related, myself, but I doubt we will get past AIC or LASSO regression in any discussion.

Anyway, I think P123 should hope the problem is the easily addressed problem of overfitting and give us the tools to end the problem (if that is what the problem is). P123 should hire a consultant if Georg is not understood by P123 staff this third time.

It is NOT rocket science. But you cannot read a post about this somewhere and become an expert either. Georg has obviously studied this to know this as well as he does. But I am going to stick my neck out and say that despite his obvious training and credentials he might not have been a “Rocket Scientist.”

Okay, it helps to be a Rocket Scientist or Engineer like Georg. People should really listen to his ideas about AIC and maybe some of the other methods of addressing overfitting. If overfitting is the cause of the poor Designer Model performance if would probably pay to hire a consultant to make sure Georg is understood.

-Jim

Yuval:

In my opinion your first two points are very plausible, and your final one is an absolute certainly.

From a logical and behavioral perspective point 3 has to be true — if a model or group of related models did not have “a blue streak … some long periods of underperformance” then everybody along with all their siblings and cousins would eventually be using the model and it would then stop working for — and this is key — stop working for long enough to get the majority of people to stop using it. At that point it would have a good probability of starting to work again.

Put in other words, if a method is going to work over the long term it has to inflict enough pain, from time to time, along the way to get most people to quit. It is a variation of “no pain, no gain”.

Our data only goes back to 1999, but I’ve heard some who have access to data from the 1990s say that value and small caps did not work very well compared to the growth and larger caps during that period. From that perspective it is not surprising that the typical P123 portfolio (mine included) back tested very well from 2000-2002 (general bear market for the large indexes and range bound for the R2000) and continued to do outstanding well for 2003-2007. Many of our models did well for 2009-2017. Our methods are currently in a 1.5 year period of painful underperformance.

Will this period of underperformance be over in 3 months or 3 more years. I have no idea of the duration but I believer it will need to be long enough and painful enough to get the majority of money to stop using the types of value and small cap models that worked will in the past.

Well that’s my 2 cents.

Brian

You’ve probably heard that a picture is worth a thousand words. Here is one of those pictures. In order to make the right decisions, you need to know the facts.

As you can see, this older ranking system did its job even over the most recent twelve tough months. Yet, it did not beat SPY!

Think: How could SPY beat all the buckets? What does this tell us about what happened? What does this tell us about the power of the ranking systems?

Suggestion: Use the correct benchmark. Russell 3000 equal weight is close enough, but an equal weight of the sims universe is even better, especially for targeted universes.

A DM might be working. Or it might not. But what do you learn by comparing it to the SPY?


Good thoughts. And ultimately it may take a lot of ideas to express our present situation.

The idea that there are a series of events for us to have only 8% of Designer Models beating their benchmark over a 5 year period is a good one. And as Georg said:

Not wrong to think there may be multiple problems including the above ideas, I would guess. I am not sure it is even possible for just one thing to do this.

I think posts implying there is just one problem–so no worries–are pure speculation.

BTW, Does equal weighting remove the problem of overfitting? I think not. Otherwise, feature selection to remove noise factors would not be the standard of for statistical learning. Look to Georg’s suggestion of AIC for an informed idea about removing noise factor that cause overfitting. You may need to search it for now.

Generally, factors are removed and not given equal weight to prevent overfitting.

Which is not to say overfitting is the only problem. But I see no evidence that it is not one of the factors—not in this thread for sure. I do think P123 should hire a consultant to give us the tools to end overfitting when it is a problem for a Designer.

-Jim