alternatives to optimization?

This question is not at all related to my role as product manager at Portfolio123 but only to the way I use Portfolio123 to create ranking systems to choose stocks to invest in.

If I have 100 factors that I’ve created, that I believe in, that I have good reason to believe actually work, what is the best way to combine them into a ranking system for superior OUT-OF-SAMPLE performance? (Obviously, some of these factors will get 0 weight.) Do I optimize the ranking system for incredible in-sample performance? That is what I’ve been doing for years, and I wonder if I shouldn’t. I can create a simulation that that holds 20 to 25 stocks and gets a CAGR of 60% since 1/1/2004 and 62% over the last ten years (using variable slippage). My worry is that this is over-optimized and will result in low out-of-sample performance, as per my most recent blog post (see ). Should I therefore take a more oblique approach, and if so, what would be a good one? I know that many Portfolio123 users want to optimize their ranking systems, but perhaps some of you have come up with alternatives.

My guiding principle is that I want to use my own discretion in creating factors but I want to use some sort of objective/mechanical method to combine them into a ranking system. As I wrote in another forum discussion, I have been experimenting with a very crude form of bootstrap aggregation, but it’s presenting a large number of problems. I’d be grateful for any suggestions/thoughts.

Yuval - I cannot provide any mathematical basis for what I do, so take it for what it is worth. My experience comes from playing with Neural Nets back in the '90s, back in the time when financial NNs were fashionable. The main criticism back then wasn’t that NNs don’t work but that the result is a black box with no visibility inside the decision-making process. People have noted a lot of similarities between my optimization process and NN training. All I can say is that my OOS results are generally positive

I like to optimize the RS multiple times, each optimization run will produce a highly optimized RS with different factors and weights. I usually start with ~20 or 30 factors and through the optimization process, I reduce the factors down to maybe 5-10 for any one optimized RS. I typically generate 5 optimized ranking systems. This is strictly due to time constraints. If I could speed up the process of optimizing then I would generate more optimized RS’s. If possible you want to generate optimized RS’s with low correlation with one another. Unfortunately, I don’t have a tool for such an evaluation. In place of such a tool, I try to ensure the each RS is generally different, with no two RS’s having exactly the same factors or similar weights for given factors.

I then combine the optimized Ranking Systems into one super RS. While the optimized RS’s are “highly optimized”, the Super RS is not. The optimized RS’s are operated in parallel as nodes with equal weight in the combined RS. This tends to diffuse the highly tuned output from the individually optimized RS’s.

For each optimized RS. I vary the rebalance frequency, start date and total lookback period. I don’t extend the lookback period too far, usually in the neighborhood of 5-10 years only. You need to be prepared to re-optimize yearly for lookback of 5 years, and re-optimize every 2 years if you use a lookback period of 10 years. The OOS period is only good for ~ 1/5 of the lookback period (in sample period) before results start degrading.

For each optimized RS, you need to squeeze as hard as possible. By this I mean that you want to eliminate as many factors as possible while retaining as much backtest performance as possible. The buzz word for Neural Nets in the modern world is “Deep Learning”. This is exactly what you want to avoid. The objective is not to memorize history, but to make it impossible to memorize. By pruning as many factors in the RS as possible, you reduce the potential for memorization. What you are left with is potential for discovery of hidden relationships and hence prediction capability. You will still end up with some memorization but you want as much prediction capability as possible with minimum memorization.

Welcome to the my dark world.


Hi Yuval,

My current approach is to see how any “optimization” affects multiple sub periods. My assumption is that if a factor is robust it should improve results in significantly more sub periods than it hurts. I also look to make sure the improvements are in recent periods as well as earlier periods because sometimes changing a factor can improve the overall results and the results for years prior to 2010 but don’t help recent years.

An example might help. I recently tried adding to my own Trading Universe some of the Universe filters you made public in one of your sample models.

Universe Filter = DaysFRromMergerAnn = NA or DaysFromMergerAnn <0

Results for My Model “SA”,
Overall CAR (21 year test) up 1%.
CAR up 2% or more for 2/3 sub periods (1999-2007, 2008-13, 2014-19)
CAR up 2% or more for 3/7 sub periods (3yrs each). I ignore differences of less than 2% for sub periods because that might just be noise
CAR up 2% or more for 3/10 sub periods (2yrs each starting with 2000)
The improvements were in sub periods before and after 2010 (2 after, 1 before) for both the 3yr and 2yr sub periods
My conclusion — it appears to help (a bit) and to be somewhat robust
My action — none yet because I want to see that it also works for my other methods. If it helps more than one of my methods my confidence in its robustness will go up.

Results for My Model “P2E”,
Overall CAR (21 year test) up 2%,
CAR up 2% or more for 1/3 large sub periods (1999-2007, 2008-13, 2014-19)
CAR up 2% or more for 3/7 three year sub periods
CAR up 2% or more for 3/10 two year sub periods (1999 not included, 2yr periods starting 2000)
The improvements were in sub periods before and after 2010 (but none before 2008)
My observation — it appears to help (a bit) and to be have moderate robustness
My reaction — confidence in robustness is increasing since it helps two of my methods (which are quite different from each other)

Results for My Model “AER”
Overall CAR (21 year test) unchanged (less than 1% so I consider that noise)
CAR up 2% or more for 0/3 large sub periods (1999-2007, 2008-13, 2014-19)
CAR up 2% or more for 1/7 three year sub periods (rest were less than 2% worse or better which might be just noise so I ignore that)
CAR up 2% or more for 1/10 two year sub periods (1999 not included, 2yr periods starting 2000)
The single improvement was in sub periods before 2010 (around 2004-2006)
My observation — it appears to help (a tiny bit) but if this was all I had I’d say it was insignificant and not robust.

Conclusion — I decided to put this filter into my trading Universe. I consider it to be moderately robust since it clearly helped two methods and helped the third method a small amount. Of particular importance to me is that it helped sub periods before and after 2010.

Another one of Universe Filter idea from one of your public models
Universe Filter = AltmanZOrig > 0 or AltmanZOrig = NA
This might help exclude stocks that could go bankrupt. But lets see if it helps my 3 trading methods.

Results for My Model “SA”,
Overall CAR (21 year test) unchanged (ie less than 1% which I treat as noise).
CAR down 2% or more for 2/3 sub periods (1999-2007, 2008-13, 2014-19) BUT up by more than 2% for 1/3
CAR down 2% or more for 2/7 sub periods (3yrs each). And up for 2/7. The other 3 periods differed by less than 2% so noise in my view.
CAR down 2% or more for 4/10 sub periods (2yrs each starting with 2000) and up for 5/10 with the other 1 period being less than 2% so ignored.
The improvements (and worsening) were in sub periods before and after 2010 but most of the better periods were before 2010.
My tentative conclusion — Does not appear to be robust. The number of Up and Down periods are almost equal.

Results for My Model “P2E”,
Overall CAR (21 year test) is DOWN 4%,
CAR down 2% or more for 3/3 large sub periods (1999-2007, 2008-13, 2014-19)
CAR down 2% or more for 5/7 three year sub periods. And up 2% or more for 1/7 (with 1 period less than 2% either way)
CAR down 2% or more for 7/10 two year sub periods (2yr periods starting 2000). And up for 2/10 periods.
The down results appear in sub periods before and after 2010.
My observation — it appears to hurt (a LOT) and to do so consistently (ie very robust in its affect on method “P2E”)
My reaction — very different results vs my model “SA”. What about my 3rd model?

Results for My Model “AER”
Overall CAR (21 year test) unchanged (less than 1% so I consider that noise)
CAR up 2% or more for 0/3 large sub periods (1999-2007, 2008-13, 2014-19), And no down periods (2% threshold)
CAR down 2% or more for 1/7 three year sub periods (rest were less than 2% worse or better which might be just noise so I ignore that)
CAR down 2% or more for 2/10 two year sub periods (2yr periods start 2000), and up by 2% or more 1/10 periods.
The down periods (2% or more) are only seen in the 2008-2011 years with the 1 up period in 2010-11. Rest are less than 2% difference so maybe noise.
My observation — Similar to results method “SA” (not robust).

Conclusion (applies only to my models) regarding the bankruptcy filter— Don’t use this filter for my current models. For two of my models it is not robust. For the 3rd ("P2E) it consistently reduces profits. Although this Universe filter might help models that other people use, it is not robust (and often hurts profits) for my three current trading models.

— My main point in this post is not to argue for or against either of the two Universe filters presented as examples above.
— My main purpose is to illustrate that looking at multiple sub periods is the best way I have found to evaluate whether nor not an item is robust.
— There may be other good ways, perhaps better, to determine robustness. I’m eager to see how others tackle the challenging issue of determining robustness.



Combining pared down multiple RS into a single one gives me something new to think about.


1 Like

Steve -

Your approach is really interesting and has given me a lot of ideas. I really appreciate your openness in sharing it.

  • Yuval

Hi SteveA,

This has extreme merit, IMHO. Obviously, you have general experience in machine learning (not just neural nets). It is also painfully obvious that we do not have the tools, as you say.

If your nodes are completely uncorrelated then you eliminate the collinearity problem (or multicollinearity problem) and you could even consider using multiple regression.

This is one answer to Yuval’s original question: “alternative to optimization?”

But multiple regression has the problem of the linearity assumption. Fortunately, the whole rest of machine learning is about getting around the linearity assumption (almost).

Jim Simons (Renaissance Technologies) went far using Kernel Regression which gets around the linearity assumption. A P123 member could go far using a type of Kernel Regression (LOESS) on these uncorrelated nodes—if they had access to the data. LOESS works well, I have used it. But it is older.

BTW, if we had the tools each node could be optimized using Principle Component Analysis (PCA). So yes there are well established methods that do no involve manual optimization.

Yuval, introduced bagging to the forum. A Random Forest with bagging is non-linear. The bootstrapping also has advantages of its own for reducing overfitting. But this cannot be implemented with P123 tools—looks like Yuval tried with spreadsheets and is about to give up. Good effort!

SteveA, I am coming around to your view that one should not backtest too far so that the model changes as the market changes. ONE COULD KNOW HOW FAR BACK TO TEST IF ONE USED WALK FORWARD OPTIMIZATION as de Prado discusses extensively: basically, everywhere he writes.

If shorter backtesting methods do not work then perhaps time series analysis could help adjust the weights of ports. Perhaps something line ARIMA.

There are a bunch of smart people who have worked on the answer to Yuval’s question AND FOUND ANSWERS. One can get a graduate degree in this. One can access all of the tools through Python without having to get a graduate degree.

The people at P123 are truly intelligent people. I did not think we could reinvent the wheel so well. But we cannot keep up with Python and a bunch of people with graduate degrees here. And we cannot hope to compete with the institutions who can hire people with graduate degrees and who have experience with Python.

If we want to get beyond the wheel into the jet age, we will need Python. It is not as hard as it might seem and perhaps the forum could be used to share some script.

Honestly, if we want to see the designer models beat their benchmarks I think this will be necessary. P123 loses members when they lose money.


P123 could attract new members if they could use basic, proven methods without having to get multiple features approved to be able to use these methods. Features—that in truth—will never be approved.


For an anecdotal example, WARREN BUFFETT IS NOT TOP DOG! Jim Simons beats him handsomely: see image.

BUFFETT’S RETURNS ARE DECLINING AS ARE THE RETURNS AT P123 I THINK. For Warren Buffett, it is a reality that he admits to frequently. Some of us are in denial.

But it is fun to reinvent the wheel. A joy to try to replicate—to the extent possible–what the institutions do using our spreadsheets.


What is “a random forest with bagging”? If the obstacles are excel / something with programming / macros, maybe I can help… but I don’t understand what random forest bagging means.

A Random Forest cannot be done in Excel.

I already know how to do it. I think the person who originally wanted to understand methods that do not require optimization was Yuval. Depending on the situation, multiple regression, Kernel Regression or Random Forests MAY be reasonable methods. There are other methods. I have tried them all and I use something else now—for the most part. Random forests can work well. LOESS works VERY WELL but is computer resource intensive. Multiple regression with feature selection (e.g., LASSO) works very well for linear data. And of course, I use nothing if I cannot get the data.

So, I recommend none of these methods. One has to find the best method for the type of data they have—if they do not want to do manual optimization. The original question was methods that do not require manual optimization. People who want to can stick with that: in fact they are essentially forced to in most situations.

Assuming Yuval does not have access to increased downloads within P123, I suspect he already knows the best methods to download data. I am not sure I need any help in this regard either.

But if Yuval has problem accomplishing this within P123 with his position as Product Developer at P123 then you can bet anyone wanting to do real bootstrapping is having problems. And potential problems would not be limited to bootstrapping.

I do want to make clear that Yuval has often said we should use the download available. I COULD NOT AGREE MORE. The ARIMA method I mentions mostly uses price data that can be easily downloaded. This can then be easily uploaded into Python. No feature request necessary and nothing need be done to use Python.

Yuval definitely has a point where this applies. I am not sure I would like the way P123 would implement ARIMA anyway.

But anyone who has a real question about methods that do not require optimization knows the options are limited at P123.


For me, everything is whether I can explain why I did it. If I cannot, I don’t do it and just group similar styles into similar sub-folders and each factor, sub-folder and folder is equal-weight.

But if I can explain why…

For instance, technical price rules work well but create high turnover. Therefore, I often choose to give price momentum or reversion about half the normal weight. Or else I eliminate it and just use a rule like ‘frank(“pr52wrel%chg”)>50’ depending on circumstances. Some factors work good as entry rules but make lousy exit rules. For instance, I might want to buy a high volume upwards spike and from low volatility to high - but lower volume and a downwards spike with reduced volatility may not be a sell signal.

I really like adding a component of low volatility to my ranking system but my goal is to have a mix of high growth, deep value with the highest volatility stocks removed. Therefore, I will either assign half-weight to low volatility or create a frank rule to cut out just the tail.

There is definitely some data-mining. If the ranking system works poorly in a certain regime or market, I try to come up with the reason why. I might be right, I might be wrong but I don’t feel right tweaking a system until I have a rationale or belief as to why I should change it.

If I cannot come up with an explanation as to why I am adjusting the weight, then I am in fear of creating bizarre correlations that might explain past events with no predictive power. People in purple shirts may have won more than average poker games over the past 12 months but creating a ranking system on shirt color will not be helpful in any context.

That’s my take. Don’t optimize unless you have a reason. Try it in a smaller sample and period first then try it in another and see if it holds or not. Usually it does not. If you can’t explain why, then just try to categorize it really well in folders and sub-folders and equal-weight everything. Try to figure out if a factor is better served as a rule where you are just playing in a tail or whether higher ranks mean better returns across factor loading. That’s my two cents even though we no longer use pennies in Canada.


There is a lot of great stuff there. I like the purple shirts example. Everything you say is a valid method of preventing overfitting.

I would only want to add that much of statistical learning is meant to help with this overfitting. I think they spend more time in machine learning courses discussing methods of preventing overfitting than anything else. I think the perception at P123 is the opposite. I think this is not correct.

I actually have to agree with Yuval the member. You can have your cake and eat it too, if done right. Uhhhh….well in theory anyway;-) But clearly you have a good method no matter how you look at it.



You have referred to these concepts in many many forum posts without explaining exactly how one could work with them. I can’t see how a method that’s based on decision trees could apply to giving weights to factors or to running multiple backtests. I think there probably is a way to do this, but how? If one could use Python to assign weights to factors and/or run different kinds of backtests, then tell us exactly how that would work and what Python can do in that area which Excel and P123’s current features can’t. And do so without resorting to jargon. Give us step-by-step methods that can be implemented using the right tools (R, Python, Excel, whatever), and explain clearly what these programs are doing. You often include in your posts illustrations of stuff you’ve taken from R or other programs without explaining clearly what they signify. I worry that Portfolio123 users who are not statistically sophisticated will avoid the forums altogether after being repeatedly confronted with things that they don’t understand and that are not clearly explained.

And please don’t throw around names of major investors or machine-learning theorists without considering what they’re doing differently from what we do here. Jim Simons, from all accounts, uses day-trading methods and doesn’t consider fundamentals. P123 is built primarily for fundamentals-based investing. If you want to invest like Jim Simons does, use Trade Ideas, not Portfolio123. George Soros looks for obscure arbitrage opportunities that often have nothing to do with stocks.

If some of the machine-learning tools you write about can be successfully applied to systematic fundamentals-based investment strategies, I’d appreciate a clear explanation of how that can be done. I’ve been reading about this in my spare time, and I’m baffled by the work that OSAM and de Prado (both of whom I admire) are doing in this field, even though they’re doing everything they can to couch things in terms that laymen can understand. I understand decision trees, and I understand random forests to some degree, but I just can’t see how those things can be used to assign weights to factors in a ranking system or to exclude some factors from consideration. If you can explain these things step-by-step, then please do so; if not, your suggestions will be difficult for users to consider.

Walk-forward optimization can be done with Portfolio123, and that’s an example of a suggestion from you that is valuable and implementable. More suggestions like that one may be valuable for users. If you would like to explain exactly how you do walk-forward optimization in Portfolio123, I think some users would really appreciate it.

As for me, I’m trying to wrap my head around how to AVOID optimization as it’s conventionally thought of (i.e. if you define optimization as the attempt to create a system whose performance exceeds that of all other systems using the same inputs) and come up with something that has more out-of-sample potential. My difficulties with crude bootstrap aggregation have to do with finding the right way to implement it. There are a lot of little problems with the way I’ve been doing it and the results I’ve been getting, practical problems having to do with factor classification and randomization and weight assignment. I’m still working on it, but I don’t have a lot of confidence that I’m on the right track.


  • Yuval
1 Like


As far as “jargon” bootstrapping and bagging are terms introduced by you in the forum. Sorry if I though you had some understanding of how they might be used.

My answer as to 3 of the most simple alternatives to optimization stands (multiple regression, kernel regression and Random Forests). In order of difficulty without skipping any simple solutions, I think.

Was your question a serious question? Do you have any simpler methods?


If you perform optimization as I have suggested, by forcing number of factors to a minimum while maintaining backtest performance, then “overfit” and “over-optimization” are meaningless terms. There is a point at which pruning nodes (factors) becomes counterproductive and that is when you stop, at least for that given RS.

As for the buzz words de jour (random forest, bagging, etc), which of them will you still be talking about in ten years? My point is that I have lived a long life, and there have been plenty of ideas that have come and gone, most of them are not memorable. What is special about these latest terms? As I see it, the market is an Ocean, fluid, constantly changing. Sometimes you can surf the waves, sometimes you get hit by a Tsunami. The important thing is to go with the flow, change your style as the market changes. When an underwater earthquake hits, you head for the hills. I personally won’t be chasing the latest theories because I expect they will pass into history like everything else (unmemorable) and expending a great deal of effort on them will be a waste of my time.


Possibly correct. We may only talk about Deep Learning or what follows it.

But we, undoubtably, will have to learn how to crawl first. Now we do not even have multiple regression which has been around forever.

Uh….I am dreaming. We will never have any of these methods if we have to pass them through P123. NONE OF THEM! But if we get Python we can make our own decisions on this.

If you don’t like it don’t use it with Python is what I am suggesting. I don’t really care if you like these. I would like you to be able to use what you do like, however.

And actually, if P123 wants to stick with manual optimization and see if the Designer Models can start to beat their benchmarks that is not really my concern.

As far as alternatives to optimization, there are some regardless of whether they will be replaced by better methods or not.


For a RS that can’t beat the benchmark without optimization, I would be surprised if tinkering with weights was enough to make it start beating the benchmark.

Of course, if the RS does not beat the benchmark you will add (or remove) a factor even if you do not adjust the weights. That is a type or optimization isn’t it?

Do it manually if you want but everyone, except possibly Marc, uses optimization. Unless, they just stick with a RS that cannot beat the benchmark even on a backtest—not likely for most of us.


So you also want the optimization process to add and remove factors?

There are multiple methods that do this. Even multiple regression (the simplest, I believe) has many methods of “feature selection.”

Or more simply, yes I do. Even if I manually optimize it myself., I do. Remember I am not against manual optimization. Not for myself and definitely I do not want to stop anyone else from doing manual optimization.

The specific question of the post was about alternatives to optimization.

But manual or not, yes I do!


It was definitely a serious question. I can perform multiple regression in Excel but I have no idea how I would use it to optimize or create a ranking system. I know nothing about kernel regression. And I would LOVE to know how to use random forests to optimize or create a ranking system. So if you can walk users through these processes, step by step, I think it would be a great service to the community.

As for bootstrapping and bootstrap aggregation, I attempted to explain my rudimentary methods in a different post. But I have not refined those methods, and I’m not comfortable with them yet. I still have a lot of work to do before I could walk users through those techniques. If you’ve already done the legwork, though, sharing it would be cool. But it does need to be done in a manner that statistical newbies will be comfortable with. Say, given fifty factors, the ones in use in the Core: Comprehensive ranking system, how would any of these techniques–multiple regression, kernel regression, random forests, bagging–tell a user which factors should be retained, which should be discarded, and what weights they should have? That would be a fantastic educational tool. Me, I have no clue.

I love the idea behind this, but I’d also like to get your thinking on why “overfit” and “over-optimization” wouldn’t apply to this technique. It’s very intriguing!