Small but impactful issues with NA handling and percentile ranking

Thank you, ZGWZ — that’s really helpful.

To be honest, I haven’t read the full paper—especially since I don’t currently have the option to use LightGBM’s native NA handling or experiment with alternative imputation methods in P123. The same goes for implementing NDCG within P123—I don’t think it’s feasible at this point.

Using NDCG properly would require setting up bins (or gain labels), which is more than just a tweak to the JSON. It would take real work, and I’m not asking P123 to implement it.

So as useful as this discussion is (and I’ve genuinely appreciated it), it’s mostly academic in my case. In medicine, we often say: “If a test won’t change the treatment plan, don’t order the test.” That’s kind of where I’m at here—nothing in this thread would change my current setup in P123.

That said, I do have the ability to use ndcg_score via downloads at home, and that’s where my focus is. I’d still like to test LightGBM’s native NA handling with cross-validation at some point—just quietly on my own, without needing to run it through a formal feature request or forum discussion.

So no feature request here. I’m stepping back from that debate and just keeping the idea in the background for now.

The NDCG discussion is really meant for those working with downloads—and I’ll share my cross-validation results with @hbee once they’re ready. They’re running now and may take a couple of hours.
Maybe the results will help Hbee decide how aggressively to pursue this—or whether downloads are worth trying out, since that may be the only way to work with NDCG in the near future.

Thanks again for your thoughtful input!

I’m currently in the middle of cross-validating a different method, so I didn’t want to switch gears and start cross-validating LightGBM at the same time.

That said, NDCG is a very promising metric, and I did take a single shot at using it as part of the strategy I’m currently cross-validating.

One caveat: to really do this properly, you’d want to perform a grid search on the bin sizes (or gain levels) used in the NDCG calculation. I didn’t do that—so this should not be taken as definitive in any way.

With the single set of bin sizes I tested, the returns were slightly lower, but the ranking was noticeably more stable week to week.

That wasn’t something I had predicted, but in hindsight it makes sense—a model trained to optimize ranking will naturally produce more stable ranks.

So while I don’t want to draw firm conclusions yet, I do think this approach—especially with proper bin tuning—has real potential and is worth exploring further before being dismissed.

For now, though, it’s not a top priority for me, so I don’t expect to post much more about this in the immediate future.

@hbee BTW: your post is number 10 on Google Search with "Rank NDCG LightGBM and stocks." Someone out there is probably going to run with your ideas!

1 Like

If you like to buy lottery tickets in the stock market.

Very impressive results especially considering that's before hyperparameter tuning! Thank you Jrinne! I was chatting with Andreas about this and we both agreed that to an institution managing > $1BN AUM, a lower turnover is alpha in and of itself...

2 Likes

TL;DR: Don’t median-fill up front. Instead, rank only the stocks with valid values, then assign NA stocks a fixed rank (e.g., 50). That preserves missingness as signal (as Hbee suggests), works better with models like LightGBM, and restores meaning to slope in Rank Performance charts and linear regression models in P123's AL/ML implementation.

Hbee,

I think P123 could implement something very close to what you’re advocating—without needing a major overhaul of the ranking system.

Here’s the idea:

  1. When calculating feature ranks, skip the stocks with NAs. Don’t include them in the sorting or percentile math.
  2. After the ranks are assigned, go back and assign the NA stocks a fixed rank—say rank 50 (or whatever percentile makes sense)..

This might sound like median-fill, but it’s completely different:

  • You're preserving the information that the feature is NA (instead of pretending it's 0 or average).
  • Models like LightGBM can treat missingness as a useful signal and learn whether it belongs on the left or right of the split.
  • The ranks assigned to actual data aren't distorted by the inclusion of placeholders.

This mimics how XGBoost and LightGBM handle NAs internally—learning at each split whether missing values should go left or right. By assigning a neutral rank like 50 (rather than a hardcoded value), you're letting the model figure that out naturally.

Note: It’s not a perfect match—there could be cases where LightGBM would want to split at, say, rank 60 and route NAs to the right—but those situations are likely rare. And even with that limitation, this approach is a big improvement over the current system.

And a bonus benefit: it fixes charts like this one :point_down:

When you zero-fill or median-fill NAs before ranking, you get messy, flat quantiles in the middle. It destroys slope interpretation in Rank Performance tests.

But when you exclude NAs from the rank calculation, the slope becomes much more meaningful. You’re not diluting it with featureless stocks.

This slope issue isn’t limited to the Rank Performance test—it’s also a key reason regressions like Ridge Regression often underperform in P123’s AI/ML module. Including zero-filled or median-filled NAs can flatten patterns that would otherwise help in modeling. Slope is key to linear regressions in other words.

So this change could:

  • Preserve information from missing values
  • Mimic LightGBM’s NA handling in a meaningful way
  • Improve interpretability of core P123 tools like Rank Performance
  • Improve predictions for P123 AI/ML models—especially linear regression models.

P123 could consider just giving us the option to skip NAs during ranking and assign them consistently afterward. That alone would be a powerful upgrade for advanced users and might not represent a significant coding challenge.

Might be something that your enterprise customer would want too.

3 Likes

Research has shown that the improvement of machine learning over linear algorithms is not in the nonlinear part.

We find that the superiority of regression trees and neural networks comes from two points: their strong regularization mechanism and their capacity to capture interaction effects. The non-linear component of the marginal predictions on the other hand has no predictive power.

1 Like

I think linear algorithms can do well. I funded one for a time so I clearly think this. And in many cases, the appearance of nonlinearity might just be coming from how missing values are handled. At least sometimes, you’d get a pretty clean slope and intercept if you weren’t dealing with whatever’s happening in the middle (e.g., zero or median-filled noise).

And as you point out—tree-based models like LightGBM can’t always capture that kind of pseudo-nonlinearity anyway, especially if the pattern of missingness isn’t stable across time. We’re dealing with 24 years of evolving data here, and that kind of inconsistency makes things even trickier.

That said, I don’t think there’s just one right way to handle missing values. I just happen to like the approach above because it’s conceptually simple, transparent, and might not be a big lift to implement. But I completely agree—it’s not perfect. If someone has a better way, I’d be the first to support it.

I talked with the developers about this, and when we perform the scaling we only consider non-NA values, and fill them afterwards with zero. And we adjusted all scaling methods so that zero is always in the middle. So it strikes us that this issue has already been addressed. Let me know if I'm wrong.

1 Like

This is exactly the kind of thing I think we should reconsider.

Filling with a value (like zero) is very different from assigning a median rank after the ranking is done. The former forces the model to treat missingness as a numeric input—one that likely distorts the distribution and muddles the slope. The latter preserves missingness as signal and keeps the meaningful shape of the rank distribution intact.

Something like this is the result of filling with a value:

This chart is for FCF%AssetsQ. The upper end of the fitted line ends around 3, but the real value is 10.
That factor might still be helpful overall—but it's also misleading when used in a Ridge Regression.

So this isn't just about information loss—it’s about distortion and induced bias. It misrepresents the true distribution, and I believe it muddles the slope, which is key both for interpretation and for linear models like Ridge Regression.

What would be ideal is a cleaner slope with less noise in the middle ranks—and I think handling NAs this way (ranking first, then assigning median rank and return) is a step in that direction.

If someone has a better method, I’d be the first to support it. But at the very least, this seems like a simple, interpretable alternative that could be valuable.

I’m genuinely open to other viewpoints—and I don’t yet know whether this materially affects my own strategies—but right now, I’m not sure why anyone would want to keep the current approach unless I’m missing something.

Just trying to proved a helpful suggestion for chart like that.

In my experience with it, that's exactly what happens.

In some cases, perhaps. But as I said, we've adjusted the values so that zero is always precisely in the middle. So there's no difference in this case.

1 Like

Thank you for your response, Yuval.

It’s possible I misunderstood the exact mechanism—maybe it’s a sparsity issue within some of the surrounding buckets or something else under the hood. But I think the precise cause may not matter as much as the outcome we’re seeing.

So my core question is this:

Is this behavior something we actually want?

If so, I apologize for suggesting what I thought was a possible solution to something if it is not a problem to begin with. But from what I’m seeing, the behavior of the Rank Performance test raises some concerns—especially if this is what P123's AI/ML is doing with the data for linear regressions and other ML methods. Hbee's original concern was with LightGBM which is more difficult to visualize but this may not be ideal data behavior for tree methods either.

Here’s what I mean:

Example 1 — Flat Line in the Middle with "NAs Set to Neutral":

Zeros stretch from ranks 70 to 130. It's symmetric and not confined to bucket 100 (corresponding to the 50th percentile rank)**.

Example 2 — "Percentile NAs Negative" setting:

In this case, a large cluster of NAs ends up on the negative side, and again, not just in one quantile.

The slope is clearly affected (0.040 vs 0.108 respectively in this example). I think a lot of people use slope in some capacity.

So I may not understand the exact cause but...

Can you help me understand why this is desired behavior—even if it’s not directly caused by NA handling in the way I initially thought?
One concern I would have with this is it would be tedious to select the NA handling for each feature to get a reasonable slope for features in any linear regression and we may be limited in our ability to created different handling of NAs for each feature.

I will stop taking your time if this is desired behavior and leave it to P123 programmers to make any changes to this behavior if there is something that does not seem right to them. Thank you again for your thoughtful response.