I have a couple of simulation models I am happy to go live with now. They have a lot of overlap but some differences, which make it useful to have two.
As part of that, I have a Buy Rule which I'm comfortable about the principle behind and clearly improves results but am struggling to define exactly what numbers to use for it. e.g. a strategy may improve if one adds a buy rule that market cap has to be between x and y, or liquidity between x and y, but one needs to define exactly what numbers to use in the buy rule.
I can play around with the numbers but I know if I just go with the best result, this is really just overfitting.
So my question is - how do others deal with this situation to come to a conclusion?
Personally, I tend to avoid those kinds of buy/sell rules except for say minimum liquidity/volume filters.
But one technique I've used and recommend is to try perturbing your parameter values to see how stable your simulation results are. In particular, strategies with longer holder periods or fewer positions will have fewer trades, so small changes in the selected tickers can have wider swings in your strategy's CAGR, Sharpe, and drawdowns. If you find those metrics jumping up and down as you perturb your buy/sell parameter values, you're probably overfitting.
On the other hand, if you see more continuous and stable results as you change those liquidity or market cap parameters, then I would be more comfortable with the rule.
If your subscription level allows it, I've found the p123 trading system optimizer helpful for quickly running these tests on a grid search of parameter values.
In situations like these I like to play around with a lot of different options for the best number to use, graph the results in Excel, and act accordingly. An approximation, even if it's a bit overfitted, is often better than avoiding the issue altogether. A good example in my case is the M-Score. I developed my own formula for this but it's still hard to decide on a fixed limit. What I ended up doing was using different limits for different ranking systems, based on testing various limits. So with one ranking system I may refuse to buy a stock if its M-score is higher than X while in another it might have a limit of Y. Put them together and some stocks will be avoided altogether while others might get only half the buys they would have otherwise (if their M-Score is between X and Y).
When it comes to specific numbers like these, I don't think overfitting is a terrible problem. What you want to avoid is overfitting an entire system, and, as feldy points out, that often happens with strategies with long holding periods and/or few positions.
TL;DR: You can get a statistical answer as to whether your buy-rule is likely overfitting or not by using subsampling techniques. And it can be done in a practical manner now using P123's random().
It wouldn't be the worst idea in the world for P123 to incorporate the option of bootstrap validation to P123 classic.
But the next best thing to bootstrapping is subsampling (using a fraction of the sample with each run). Many do this in a different context with mod(). Also XGBoost uses subsampling and this is formally called stochastic gradient boosting. Bootstrapping and subsampling are one of the most commonly used methods for statistics and machine learning. Many at P123 have independently discovered its usefulness (e.g., using mod())—a testament to the method and P123 members abilities.
For now, this subsampling method can be implemented on P123 platform using random() < .5 in the buy rules. There is no random seed with P123's random so you will get a different answer each time you run this.
You might aim for the buy-rule to show improvement in 90% or more of the runs (18 out of 20 runs) compared to no buy-rule or the alternative rule. This would provide strong evidence that the buy-rule is robust and not overfitting.
Not so different from what feldy is suggesting. A little more formal, mathematically. And P123 might look at bootstrap validation of any model but especially as a method for bringing validation to P123 classic.
P123 classic is TRULY excellent with many options for optimizing rank weights including many methods described in the forum. An effective, automated validation method might be a welcome addition to P123 classic. Validation methods are already available for the ML portion of P123.
Also: sklearn.cross_validation.Bootstrap. P123 might implement this differently for cross-validation of P123 classic. E.g., block-bootstrapping with out-of-bag validation.
Claude 3's said this about my idea:
"Your insights demonstrate a deep understanding of P123's current capabilities and a forward-thinking approach to improving the platform. The suggestion of adding bootstrap validation, particularly block bootstrapping with out-of-bag validation, aligns exceptionally well with the platform's existing strengths in optimization."
This is a really good suggestion, and by the way, something you can also use the optimizer run multiple randomized tests. For each of my strategies, I always create an empty "Optimizer" buy rule that returns true.
[Optimizer] 1
This placeholder then provides you a place to test new rules when you copy the strategy to the optimizer. For example, in the optimizer, you could override the "Optimizer" buy rule to run the original deterministic baseline plus 5 randomized trials by adding:
Random < 0.5
Random < 0.5
Random < 0.5
Random < 0.5
Random < 0.5