Small but impactful issues with NA handling and percentile ranking

  1. Currently NAs are zero filled and user has no say. This results in a severe information loss. Let the user preserve NAs because LightGBM handles missing values natively, which outperforms zero fill or removing all incomplete rows. Forcing an error and the user has to use IsNA() to zero fill, change the error into a warning.

LightGBM is designed to handle missing values natively, splitting them into their own branch (Ke et al., 2017). This behavior often outperforms simplistic imputation or discarding rows (Little and Rubin, 2019). Substituting missing values (NA) with zero (0) can be harmful if zero is not a valid domain-specific value—LightGBM will treat these as actual data points, potentially skewing splits. Meanwhile, wholesale removal of rows with NA can lead to a substantial loss of information. In practice, if the missing data is truly random (MCAR) or can be plausibly modeled (MAR), letting LightGBM handle NA values or using a domain-specific imputation approach often yields better results than converting them to zero or removing all incomplete rows. If zero is semantically meaningful, that might be acceptable, but for purely missing data, preserving NA is generally preferred.

  1. P123's percentile ranking in its "stock ranking" module show a rank between 0 and 1. But AI Factor documenation says feature and target are percentile ranked between [-1, 1]. User attempts to rank its own target but gets error, "invalid formula, ensure you're using future data" but the input IS FUTURE data FRank("Future%Chg(260)"). Please change this to a warning, so user can use the correct percentile ranking function within portfolio123.

  2. Adding support for LightGBM's XENDCG objective function would significantly boost AI Factor model's prediction accuracy and boost out of sample performance. Feature requested here: Add Support for LightGBM’s XENDCG Objective (Better Ranking)

Ke, Guolin, et al. “LightGBM: A Highly Efficient Gradient Boosting Decision Tree.” Advances in Neural Information Processing Systems 30, 2017.
Little, Roderick, and Donald Rubin. Statistical Analysis with Missing Data. 3rd ed., Wiley, 2019.

2 Likes

It’s just a fact that feature requests like this often go unanswered.

Other examples include early stopping in XGBoost or the ability to adjust the gap size at the front or back of k-fold cross-validation. There’s no theoretical or practical reason these should be treated the same, yet these kinds of requests often don’t gain traction.

Historically—and still today—there have been many requests and only limited resources. P123 has had no choice but to prioritize, which is completely understandable.

Just a guess, but I’d wager P123 hasn’t yet looked into LightGBM’s XENDCG ranking method or made any progress on a more advanced handling of missing values (NAs), which has been raised before—let alone decided where either might fall in the long list of priorities. And again—just speculating—it’s probably well behind bigger initiatives like adding support for Asian stocks or refining the feature selection process in the AI/ML module. Honestly, I might make the same call. I’m not here to debate priorities.

Personally, I’ve addressed this by downloading the data. Once I have it, I can do whatever I want—no need to wait or ask permission.

The only real downside is that testing new features this way is slower and more tedious than using P123’s integrated AI/ML tools.

At one point, P123 experimented with Python execution in Colab, allowing users to run code while still honoring FactSet data restrictions. That was a smart and creative solution—limited at the time, but perhaps worth revisiting now with improvements.

Today, with LLMs increasingly able to run code natively—often without even needing Colab—it might be the perfect time to explore API integration with tools like ChatGPT or Gemini 2.5. An LLM could run lightweight code, or securely offload it to AWS or another cloud service in a sandboxed environment.

Just to be clear: I’m not unhappy with the current platform. Far from it—I’m genuinely impressed and finding it incredibly useful. But I’m not using the AI/ML module, and it seems like some who are do run into frustration.

There’s also a related issue: many of the most advanced users have professional or near-professional methods they prefer not to share publicly. They’re understandably reluctant to reveal those methods just to justify a feature request on the forum. A secure, sandboxed coding environment would allow them to implement and test their ideas privately—without needing to explain or defend their methods.

This post is intended purely as a constructive suggestion. In the meantime, I’ll continue downloading data and building models locally. But when it comes to specific feature requests for the AI/ML module, continuing to ask often feels counterproductive—for everyone, including P123.

That said, I do feel I now have a good degree of freedom to pursue my own methods, and I genuinely thank P123 for that. My only point is that maybe this freedom can be expanded even further.

And thank you, Hbee, for sharing your methods in the forum—I might give your approach a try via downloads.

As a final thought: it’s hard to imagine there aren’t a number of hedge fund managers or institutional users with serious coding skills—or dedicated quant teams—who would value the flexibility to build directly within the platform. Instead of relying on P123 to build every feature request, they could implement and test advanced methods themselves. People like Chaikin may have once needed custom development, but with today’s cloud tools and secure APIs, that level of dependency may no longer be necessary.

Offering this kind of power—without the retail discount—could make P123 even more attractive to enterprise clients, while letting individual users opt in at their own pace.

To some degree, I’m simply pointing out that P123 played a key role in building Chaikin’s original strategy—one that went on to generate millions. They could do more of that by enabling this kind of innovation, increasing revenue through volume, or finding new ways to better monetize what they’re already doing for enterprise clients—especially given the increased demand this kind of flexibility would likely create.

Lastly, I know there’s been discussion about licensing algorithms within P123. The obvious challenges—NDAs, IP protection, and privacy—have always made that difficult to implement. But maybe it’s time to revisit the idea in a new way:

Imagine a secure environment where users can privately publish algorithms, and other members can securely integrate them with their own custom features or models—without exposing any proprietary methods.
Think of “Bob’s Random Forest on Steroids” meeting “Jane’s Subtle Feature Interactions.” Jane might have brilliant feature engineering but weak modeling skills. Bob, on the other hand, has exceptional modeling capabilities but no domain expertise. They could collaborate—or simply license and combine elements—while revealing only what they choose to share.
True collaboration: private, secure, and scalable—with complete control over proprietary methods.

1 Like

Research in this area has shown that almost the best interpolation algorithms have no provable advantage over simple cross-sectional averages in stock data.

https://arxiv.org/pdf/2207.13071

XENDCG is based on rank correlation, which in itself is very likely to lead to worse results than the default regression target. And discounting methods are of very dubious relevance on this issue.

I haven't heard of Chaikin and p123 being so closely related, let alone closely causally related. However, I'll just say that in finance, professional practitioners are actually not as good as professional researchers, on average, just like in medicine

Edit:

1 Like

@ZGWZ. Thx for sharing the paper. Chen & McCoy argue that simple mean imputation performs comparably to more complex EM/factor methods for cross-sectional return predictors primarily because (1) missingness appears in large blocks by time, and (2) correlations among predictors are weak. Under these conditions, there is little cross-sectional information to exploit, so advanced procedures add noise without much benefit.

This actually supports the need for mean imputation rather than zero fill! This is because in p123, normalization takes place AFTER NA handling. If we normalize the data first (z-score, percentrank) before NA handling, then zero fill would be equivalent to mean fill.

I.e. If our features are standardized (mean zero, unit variance), then filling with zero is effectively mean imputation. Hence it achieves similar performance to more advanced methods for data resembling their asset-pricing context (e.g., large block missingness, low cross-predictor correlation). However, if zero is not our actual mean, we risk introducing bias. When zero has real meaning in our domain (like “no activity”), treat it carefully so the model does not conflate “truly zero” with “missing.”

So to simplify it for the p123 team, the request, is simply to use mean-fill (fill with average value) as the default, in place of zero-fill.

1 Like

Just a quick clarification on how XGBoost and LightGBM handle missing values, since this keeps coming up in discussions about NA treatment in tree-based models.

XGBoost handles missing values natively and intelligently. During training, it learns the optimal default direction (left or right) to send missing values at each split—based on which path improves model performance. This is not a static rule; it’s learned dynamically per split, depending on the data and objective function.

This means that the fact that a value is missing can carry predictive signal, and XGBoost is able to leverage that directly. It treats "missingness" as potentially meaningful, without requiring a separate indicator variable or special encoding.

By contrast, when you impute missing values—for example, by filling with zero or using the mean—you erase that signal. The model no longer knows the value was ever missing. That kind of dynamic, split-specific handling of missingness isn’t possible once you’ve imputed, because the absence of a value becomes indistinguishable from a real one.

LightGBM handles missing values the same way—by learning the optimal direction to route NAs at each split (Ke et al., 2017). It’s a core feature of how both algorithms work.

Anyone with control over NA handling in their models would naturally want the option to preserve this information, especially when it can meaningfully boost model performance.

—

"The presented algorithm... learns the best direction to handle missing values."
— XGBoost: A Scalable Tree Boosting System (Chen & Guestrin, 2016)


1 Like

As I recall, P123's method is similar to Simple Median, not zero-filling. Since Simple Mean can be distorted in many cases by data whose distribution is grossly non-normal, there is no reason to think that Simple Median is worse than Simple Mean in this respect.

@Jrinne. Very thoughtful response from you. THANK YOU.

APIs and downloading of data would be sufficient for me. Most advanced users like us, would have the ability to spin up their own "rent a server" instance on their own.

Personally, I run into issue with downloading because p123 caps the number of features that can be downloaded at once. Which means I need to split my feature set into many smaller ones.

Your suggestion of a "marketplace" is valuable. I'd just caution this this type of business model has been tried by other businesses like Quantopian, and TradingView recently, with mixed results. The issue is that advanced users aren't going to need it. And basic users will find it too intimidating. Where this type of business model really works out is in the form of pod shops at Citadel, and most other trading firms. This setup wouldn't really apply to a software vendor like p123.

Lmk how your testing of XENDCG goes. I'm very interested to know if it improves OOS performance.

1 Like

2025-03-29 13_18_37-12mLnReturn_00-24-ALLCAP-macro-v1.5c - Copy - AI Factor - Portfolio123

@ZGWZ. I'm aware that p123 uses median fill as "FALLBACK" to handle NAs in their formulas module. I'm referring to how NAs are handled within the AI Factor module, which is zero-fill, not median or average fill (see above).

1 Like

I hadn't seen this hint when I used it during the test period, and it acutally does require median interpolation.

TL;DR: Constructing long-short stock portfolio with a new listwise learn-to-rank algorithm
"...out-of-sample annual return of 38% with the Sharpe ratio being 2."

So putting this paper into the context of previous P123 discussions: @yuvaltaylor has mentioned that he likes the ranks P123 Classic provides. In that same spirit, I’ve long advocated for Spearman Rank Correlation as a possibly useful metric, which we now have in the Rank Performance test.

Yuval has a point, doesn’t he? At the end of the day, we want to know which stocks are best (or ranked first). Right now, we typically calculate predicted returns—then sort those to generate ranks.

But is that the best approach? Is it better to predict ranks directly instead of returns? That seems like a fundamental question, especially for a machine learning-oriented site that has historically placed a lot of emphasis on ranks. And still sorts and ranks the ML predictions running it though P123 Classic which uses ranks a lot. I’ve done some minor exploration of this idea. It’s far from definitive, and I may have made errors—so I encourage others to check this out and rerun or adapt the code as needed.

Thanks again, hbee—your post is highly relevant to the work many of us are doing! My apologies for the rough/incomplete nature of what follows.

Python output for the code below: Overall Average NDCG: 0.8790

I'm still learning so I asked ChatGPT what it thought of the score:

Yes, 0.879 is a very encouraging result. It suggests your ranking model is quite strong out-of-sample — especially if you’re using GroupKFold and haven’t peeked into the future.

My understanding is that NDCG is a bit like Spearman's Rank Correlation, but with extra emphasis on correctly ranking the top items —which seems ideal for portfolio selection when we care most about the top-ranked stocks.

Here is some code ran for me as a start for those interested:


import pandas as pd
import numpy as np
import lightgbm as lgb
from sklearn.model_selection import GroupKFold
from sklearn.metrics import ndcg_score

# --------------------
# Load & Concatenate Data
# --------------------
csv_paths = [f'~/Desktop/DataMiner/xs/DM{i}xs.csv' for i in range(1, 9)]
dfs = [pd.read_csv(path) for path in csv_paths]
df = pd.concat(dfs, ignore_index=True)

# --------------------
# Clean Target & Parse Dates
# --------------------
df['ExcessReturn'].fillna(0, inplace=True)
df['Date'] = pd.to_datetime(df['Date'])  # <-- Your actual column

# --------------------
# Create Weekly Group IDs
# --------------------
df['group_id'] = df['Date'].dt.to_period('W').astype(str)  # group by week

# Rank within each group by ExcessReturn (descending = better)
df['rank_within_group'] = df.groupby('group_id')['ExcessReturn'].rank(method='first', ascending=False)

# --------------------
# Convert Rank to Relevance Labels
# --------------------
def assign_relevance(group):
    q = group['rank_within_group'].rank(pct=True)
    bins = [-0.01, 0.2, 0.4, 0.6, 1.0]
    labels = [3, 2, 1, 0]  # 3 = top 20%, 0 = bottom 40%
    group['Relevance'] = pd.cut(q, bins=bins, labels=labels).astype(int)
    return group

df = df.groupby('group_id', group_keys=False).apply(assign_relevance)

# --------------------
# Select Features & Labels
# --------------------
features = [Your Features here]

X = df[features]
y = df['Relevance']
groups = df['group_id']

# --------------------
# Initialize LightGBM Ranker
# --------------------
ranker = lgb.LGBMRanker(
    objective='xendcg',
    n_estimators=100,
    min_child_samples=2000,
    colsample_bytree=0.3,
    random_state=1,
    n_jobs=8,
    verbose=1,
    label_gain=[0, 1, 3, 7]
)

# --------------------
# GroupKFold Cross-Validation by Week
# --------------------
gkf = GroupKFold(n_splits=5)
ndcg_scores = []

for fold, (train_idx, val_idx) in enumerate(gkf.split(X, y, groups=groups)):
    X_train, X_val = X.iloc[train_idx], X.iloc[val_idx]
    y_train, y_val = y.iloc[train_idx], y.iloc[val_idx]
    train_groups = df.iloc[train_idx].groupby('group_id').size().tolist()
    
    ranker.fit(X_train, y_train, group=train_groups)
    y_pred = ranker.predict(X_val)

    # Evaluate NDCG
    val_df = df.iloc[val_idx].copy()
    val_df['y_pred'] = y_pred
    group_ndcgs = []

    for _, group_data in val_df.groupby('group_id'):
        true_rel = group_data['Relevance'].values.reshape(1, -1)
        pred_scores = group_data['y_pred'].values.reshape(1, -1)
        ndcg = ndcg_score(true_rel, pred_scores)
        group_ndcgs.append(ndcg)

    fold_ndcg = np.mean(group_ndcgs)
    ndcg_scores.append(fold_ndcg)
    print(f"[Fold {fold+1}] NDCG: {fold_ndcg:.4f}")

print(f"\nâś… Overall Average NDCG: {np.mean(ndcg_scores):.4f}")

BTW, you are right. you cannot download everything at once. I have downloaded 3-year blocks and concatenate those when I upload them into the Jupyter Notebook as you can see in the code. I have also created a single file on my desktop but it cannot be done with Excel as you note. It did not seem to make anything easier for me so I saw no need.

This was a fun challenge, but I will probably stop here. If anyone uses the ideas from this post and continues to explore this using P123 downloads: please, share your results. These are not really my ideas and I do not claim to have fully explored this. Clearly, Hbee and the first reference deserve any credit for originating this idea. So negative findings more than welcome to better understand the potential value of this.

A few references:

Constructing long-short stock portfolio with a new listwise learn-to-rank algorithm

Learning-to-rank with LightGBM (Code example in python)

Investment Ranking Challenge : Identifying the best performing stocks based on their semi-annual returns

The math on NDCG (technical): Introduction to Learning to Rank

Stock Ranking Analysis using AI

A Novel Interpretable Stock Selection Algorithm for Quantitative Trading

Replace " Predicting Road Cycling Race Outcomes" with stock returns and seems highly relevant: A Learn-to-Rank Approach for Predicting Road Cycling Race Outcomes

A Practical Guide to Normalized Discounted Cumulative Gain (NDCG)

1 Like

This will cause the algorithm to love lottery-like stocks

Edit:

That's why its applicable to search-based application scenarios and not others. Because you only want links on the first page of search results that apply specifically to you, but you shouldn't be chasing stocks that will make it to the top of the month's gainers list.

@ZGWZ . upon a closer second reading of the paper you shared,

There is one setting in which mean imputation can lead to significant underperformance. Using GBRT, mean imputation can lead to noticeably lower returns compared to EM, perhaps because regression trees deal poorly with the degenerate distributions that result from mean imputation. One should not overinterpret this result, however, since tree-based algorithms often have their own specialized methods for handling missing values, which we do not employ because of our goal of evaluating “general-purpose” imputations.

So it's clear that there it might be valuable to have the option to retain NAs, rather than median-fill or zero-fill.

@Jrinne great work running the test. How is the result compared to your own tests? Is it better, worse? By how much percentage wise?

In my testing of the AI system, LightGBM did not produce particularly poor results like those in this paper, but rather the opposite. Therefore, the data about LightGBM here is not representative.

As to what conclusions the authors are attempting to draw, it's irrelevant.

Edit:

Honestly, I don't know why people even consider the authors' conclusions more important than the data that contradicts them.

Exactly. There’s no reason to discard the potential information carried by missingness.

Both XGBoost and LightGBM handle NAs natively by trying missing values on both sides of each split and selecting the direction that improves the objective function the most. This decision is made dynamically at each split, based on the training data.

In other words, being NA can be informative.
For example, if a stock has no analyst coverage, the absence of earnings estimates might suggest it's under-followed or illiquid. That could be a good or bad sign—but you want the model to have access to that signal, learn from it, and use it when it’s predictive.

XGBoost and LightGBM can do exactly that—but only if you let the NAs remain. If you zero-fill or median-fill, you overwrite that signal, and the model loses the opportunity to learn from it.

1 Like

Chance does not mean effect, much less positive effect.

Just because it's theoretically possible or there's some sort of theoretical narrative, it's far from being practically feasible or effective.

What you're saying is really just an expanded and reworded "but Boosting can be done with NA as input" and a "if so it's better" narrative constructed on that basis. But that doesn't mean anything.

Edit: Since Boosting methods still require implicitly inferring the meaning of missing values to utilize NA inputs, the empirical conclusions of this paper are still valid, regardless of what the authors say. In fact, this author even claimed that the factorial approach made no practical investment sense in a paper where the data implied that the factors were valid.

True—but is it always just chance?

It can be random, but missingness can also carry meaningful information.

Take the example of earnings estimates: if a stock has no analyst coverage, that absence isn’t just noise—it’s a signal. It might indicate the stock is a micro-cap, illiquid, or flying under the radar. That could be a good thing or a bad thing, but it’s still informative, not just random.

Even worse would be assuming that a micro-cap stock with no coverage has the same earnings estimate as the medianin an all-cap universe. That’s not just noise—it introduces systematic bias, which is more dangerous than randomness.

Let the model determine whether the missingness is meaningful—through cross-validation. XGBoost and LightGBM are designed to do exactly that—but only if you allow the NAs to remain.

If I had full control, I’d likely use a hybrid approach: impute where it makes sense, and preserve NAs where the missingness itself might carry signal.

In general, you should train your strategy according to the size of the market capitalization you are trading.

Beyond that, what you're talking about is your repeating theoretical narrative that requires empirical research to show what it means. And the specific example you gave can be addressed with analyst quantity factor or NA number factor.

1 Like

I agree—you could definitely incorporate features like analyst count, market cap, or an NA indicator to help the model.

But just so I understand: if we choose to include earnings estimate features, and a micro-cap stock has no analyst coverage, what value would you recommend imputing for that missing estimate?

If we’re using simple imputation, what do you and Hbee think works best in that kind of situation? I believe Hbee is at least asking for the option to use the median—but is median always the best choice, or would you prefer more flexibility in selecting the method?

Personally, I’d be in favor of having a range of options. I’d probably choose to impute in some cases and let LightGBM handle NAs natively in others—depending on the feature and the context.

And for sure, in the context of a feature request, I wouldn’t want anyone locked into my approach.
If I were to make a request, it would simply be for more flexibility in NA handling—and greater control over the modeling process overall. That could include Hbee’s request for median imputation, along with the ability to retain NAs or apply other strategies depending on the nature of the feature and the goals of the model.

Let users decide what works best for their workflow—and let cross-validation sort out what actually performs.

And I agree: people often spend a lot of time debating theory or papers—sometimes using data that doesn’t really match their own use case—when they could simply cross-validate the method on their own data and strategy to get a practical answer.

Flexibility in the code would allow us to explore different approaches, discover what truly works for our specific use cases—and reduce the need for theoretical debates. That would probably be a good thing.

Now, if you're an enterprise customer and pay something like $100,000, you can bypass all of this—get exactly what you want, custom-built, no forum debates needed. That’s what Marc Chaikin did a while ago. And he went on to build a successful brand and become widely known in the investment world: Chaikin Analytics.

That was probably good for Chaikin—and probably good for P123, too—but I do sense a bit of lingering frustration on P123’s side whenever this is brought up.

The thing is, I suspect what Chaikin did back then would be considered fairly trivial by today’s standards. Users like @hbee and @pitmaster seem to be operating at a much more advanced level, building better models on home machines using downloads—without any need for proprietary licensing.

So is P123 still relying on the best model? Maybe. If charging $100,000 for access to features like NDGC or custom NA handling strikes the right supply-demand balance, maybe that's the right call.

But it makes me wonder—could a different kind of enterprise model work even better? One that gives power users more flexibility without needing to go full-custom or enterprise-tier just to access basic improvements?

The factors used in the paper include analyst-related factors like EPS Forecast Revision.

1 Like