How do I reconcile these results?

Hi ZGWZ,

Thank you for your response and contributions to this thread. Your results appear promising!

Could you please clarify which data the AI hasn't seen? It seems you might be using data up to April 2024, which may not adequately address the concerns of myself and others.

My initial tests also indicate positive results when using the AI algorithm on data and then testing it in simulations. However, we're facing challenges in aligning out-of-sample results (lift charts, rank buckets, etc.) with out-of-sample simulations.

Could you please confirm the specific data excluded from your analysis.

Thanks,

The models used for simulation were trained using only data up to the end of 2017.

I see. So you use the validation period between 2005-2024.

You then select the best algorithms using that data up until 2024, at which point, you re-train the predictor up until 2017.

Once that is done, you simulate past that date?

I simulated past 2019-01-01

Thank you for clarifying. Would you mind redoing that analysis without seeing any data up until 2017, in any of your tests, including selecting the algorithms?

This would be the most clear way of ensuring no look ahead bias.

I think this is what myself and others are having a difficult time reconciling.

Thank you very much for contributing to this thread!

Features are from official Old Ranking Systems updated before 2019.01.01 (QVGM updated in 2014 & Balanced updated in 2016). And then I added basic information (MktCap, Price, Beta, Volatility, Number of Employees and so on) and normalize them by the logarithmic function and sometimes intercept.

Edit: Replace the wrong text and screenshots

Edit2: Add more screenshots

Edit3: The second strategy's lowest 20% liquidity is $600k+

1 Like

The fact that the non-tradability of Turkish equities reduces the out-of-sample returns of European/Noth Atlantic strategies is a sad fact that many are reluctant to admit. But even after setting tight liquidity constraints and removing untradeable Turkish stocks, some excess returns were still possible.

Edit: Add the result of further test

Hi, I'm trying to get a grip on this and reconciling the performance between the AI features and the simulation. Can you share, the beta and the alpha of this strategy of the simulation period?

Thank you very much for contributing to this thread!

Thanks so much! Can you let me know which simulation this aligns with?

The most recent simulation you posted shows a Sharpe ratio of 0.7 and an annualized return of 15.91%. Sorry if I'm missing something, but I noticed previous ones with higher performance. I'm not entirely sure which ones correspond to which.

I'm a bit concerned that you're the only one achieving such success with this new feature, so I want to ensure we're conducting proper due diligence.

Thanks again!

The simulation you replied to before

For my last two simulation here:

Hey everyone,

I'm struggling a bit and could really use some help. Has anyone had success in reconciling the performance of their tests and validations when using simulations? Like, 7% alpha on thinly traded stocks is great but achievable without opaque AI algorithms. I've been comparing mine to the old rank bucketing system, and I'm not quite getting the results I'd hoped for.

I can't shake the feeling that I might be missing something, unless there's solid empirical evidence that these AI features are truly making a difference. Any examples or insights would be super helpful!

Can I say that you think the minimum 20% liquidity of 600k is still too low? What is your expected liquidity?

In addition, the 7% alpha is the result of using raw data as features. Considering that the vast majority (70%+) of stocks have a "Large" market capitalization (even 10% of them belong to "Mega"), I don't think it's dependent on low liquidity. Do you mean you need a Megacap focused strategy?

In another thread, the best strategy tested in the illiquid stock universe got a 97% annualized return (variable slippage). In the universe whose lowest 20% liquidity is about 700k, the same strategy got a 60%+ annualized return.

Furthermore, when training AI factors with raw data features on the same liquid universe.

I'm still a bit confused and can't seem to find anyone who can fully explain the discrepancies in these results.

To help figure out what's going on, I'm sharing another example below.

Here’s the lift chart for the testing period:

And here’s the lift chart for the validation period:

These are the settings I used:

I'm puzzled about how the data shows such successful predictions in each decile, but then it seems like there's little to no out-performance when applied to the same universe.

On top of that, I’m baffled by the bucket rank performance. I would expect it to mirror the lift chart since the predictions closely match the target returns:

I understand the concept of volatility drag, but I haven’t seen any math that ties everything together in a way that makes sense. A couple of people have shown simulations that somewhat match up, but shouldn’t there be an empirical proof to help users reconcile these results? I’ve been testing a lot but feel stuck without some straightforward math to explain how validated and tested actual vs. predicted results can be so accurate, yet not align with bucket returns or portfolio performance. Is there anyone who can walk through this step by step?

I'm no expert, but I have been testing extensively since the beta release.

From what I understand, a beautiful lift chart is not directly correlated with the Performance Quantile chart. If you create a model that predicts an awful, non-linear performance chart, and the chart you get from the validated test is just as bad as the prediction, the lift chart will look good because the prediction and validation match.

Trimming more than 1%—around 5 to 7.5%—has worked well for me. It seems that the algorithms tend to focus on outliers if you don't trim.

With a "Max return" of 500%, you'll get a lot of noise in the Performance chart, especially if you use 100 bars. A standard of 200% or less gives a better chart. In my opinion, 10 to 20 bars are more than enough to see the performance.

Your target and features should somewhat match. Looking at 1-year returns but analyzing quarter-to-quarter fundamental performance will create a lot of noise. From what I’ve seen, 1-month targets will give a very nice performance chart but won’t yield reliable simulated results. Twelve months is too much to find high alpha.

The algorithms always "downgrade" volatility features unless your target has some volatility characteristics like Sharpe or something similar.

If you're analyzing European stocks, try different currencies—you might end up just analyzing the relationship between currencies, not the performance of the actual feature.

I haven't found any algorithm that manages to use Market Cap as a catalyst. Instead, add a few percent of Market Cap as a factor in your final ranking system.

I've noticed that there's an advantage to training the model on a slightly larger universe than you intend to use in your final simulation.

If you get great validated results with non-linear models but the results from linear models are poor, it's very likely that the simulated results will also be poor (just an observation).

More features aren’t always better; it seems that common sense still applies in machine learning.

3 Likes

The 1% trim is actually better if you handle the outliers carefully in the formula.

He seems ready to pick stocks in the SP500 universe (no wonder he thought 70% large + 10% Mega was still "thinly traded stocks"). I don't think that's going to get you good alpha.

Fair simulation results can be obtained with LightGBM in the categorical features, but the results of the linear model are very poor

What do you mean by larger universe? Wider market capitalization range or less liquidity requirement?

I think it would be better to support weighted machine learning algorithms with customized weights comparing to fine-tuning the universe composition.

Yes, wider market capitalization. If you intend to trade sub $400M stocks, you train on sub $600M.
It can be becuse it's not unlikly that I will be holding stocks that gain alot of market cap while holding so the stock is basically forced in to the universe you are trading with.

Now thinking about it, it probably makes sense to train on a universe with with a lower liquidity requirement too. It's probably not unlikly I will be holding many stocks on the lower range of liquidity that might drop below of my liquidity rule in my universe.

"The lift chart is a graphical tool for assessing the performance of AI factors by comparing the actual outcomes with model predictions. The chart is plotted with the target variable values on the y-axis and the percentile of sorted observations on the x-axis.

The chart features two lines:

  • Red Line: Represents the actual outcomes of the target variable, for example the average excess return of a set of stocks.
  • Blue Line: Indicates the model's predictions for those same outcomes.

Interpretation of the chart focuses on the proximity of these two lines:

  • When the Red and Blue lines are close together, it signifies that the model's predictions closely match the actual outcomes, indicating effective predictive performance.
  • A significant gap between the two lines suggests a disparity between predictions and actual outcomes, signaling potential weaknesses in the model’s ability to capture the underlying dynamics of the data."

Could you provide a mathematical example to help clarify this? I’m having difficulty reconciling the definition of the list chart provided here. Shouldn't the buckets in each decile align with the lift chart? The x-axis in the bucket chart and the lift chart represents the predicted performance outcomes, while the y-axis represents the actual outcomes. Am I misunderstanding something?

Unfortunately, I’ve noticed quite a few unanswered questions here, which has me a bit concerned. I’m really hoping someone, perhaps @yuvaltaylor, could kindly guide us through the steps to recreate the "high-low" performance test in the performance section. Something like recreating it in a screener, simulation, or book would be assuring.

Your help would be greatly appreciated!

I’m really hoping someone at P123 can finally help me out with this. I’m struggling to reproduce what we see in the AI feature using any other part of the platform, and it’s been incredibly frustrating to the point where I think we may be wasting our time here. There have been a lot of changes, like the screening and historical testing now being limited to just one year without any explanation, and I keep asking for a way to make sense of it all. So far, I haven’t gotten any response that empirically reconciles what this feature shows, and I’m at a loss.

I really need someone to show me how to do this. I can’t keep guessing.

Thanks.