AlgoMan HFRE v2.4 — Hierarchical Factor Ranking Engine (Download)

TL;DR: You may want to change how NAs are handled for different applications/uses. And with P123 downloads that can be done.

Right. There are different ways to handle the NAs once they are identified. But which imputation or proxy you want to use will change depending on what you are doing.

NAs being assigned 0.5 is a fine way to do it with the usual regression, I think. Better is zero centering as is done with the AI now. This is because you not longer have to worry about the y-intercept with zero centering (everything passes through the origin). Maybe this a fine point that is not too important but added it to be complete. But I agree with @marco on this and he said it well. No need for me to expand on this:

Maybe to clarify this only works because using z-score for the returns and features causes zero-centering for everything and Marco makes a good point the y-intercept is now 0 and does not need to be calculated..

But on the other hand, if you are doing hierarchical clustering then you are right about this being a problem:

There are more than one way to handle this for the correlation studies and/or the hierarchical clustering I think. I don’t want to focus too much on the positive or negatives of each method. Maybe KNN is better than my method. But assigning 0.5 to NAs has the problem that you describe very accurately, I think.

But here is what will help if you want to use any imputation method that is different than P123’s method of assigning NAs 0 or 0.5 (depending on the download).

Let me do it with zero centering or NA being 0. There is actually only one single value of 0 and all of the others are just NAs with an imputed value of 0 in P123’s downloads. Therefore, you could safely have your Python program find the zero values and impute those values any way you want! That is because all but one of those zero-values will be an NA that was imputed to zero by P123.

True there is one ticker whose true value is 0 and reassigning it would be an error for that ticker. But every other zero-value is an NA with P123’s downloads. Or one error divided by whatever number of NAs plus one as your fraction of errors that are introduced. This can be quite small with a lot of features (that have NAs).

Edit. I tend to use ranks and this is exactly true for ranks. There will be exactly one 0.0 or 0.50 value. For z-score there could be more than one ticker with 0.0 or 0.50 value value but its will usually be small, finite and can be quantitated.

You could even do this with LightGBM and use their native handling of NAs (using downloads).

Anyway with downloads you can actually handle NAs however you wish I think. That is huge.

__________________________________________________________________________________

I get there is a lot to unpack there–enough that I could have made a mistake. So I had Gemini fact check it. I made one labelled edit above based on this.:

The last post by Jrinne on the Portfolio123 Community discusses technical nuances of handling Missing Values (NAs) in financial modeling, and it is largely correct within the context of Portfolio123 (P123) and general data science practices.

Here is a breakdown of why the points made are considered accurate:

1. Imputation Methods (0.5 vs. Zero-Centering)

  • The Post's Claim: Assigning NAs a value of 0.5 is standard for some regressions, but zero-centering (where NAs are assigned 0 after Z-score normalization) is superior because it eliminates the need for a y-intercept (the model passes through the origin).

  • Correctness: High. In machine learning, if you normalize your features (mean = 0, std dev = 1), assigning missing values to 0 (the mean) is a standard "mean imputation" technique. As the post notes, if both your target and features are zero-centered, the intercept in a linear regression should theoretically be zero.

2. The Problem with Correlation and NAs

  • The Post's Claim: Assigning a constant value (like 0.5) to NAs creates "shared blocks" that artificially inflate correlations between features.

  • Correctness: High. If two different factors both have NAs for the same 100 stocks and you assign them all "0.5," those factors now perfectly correlate for those 100 observations. This can mislead hierarchical clustering algorithms into thinking factors are more similar than they actually are.

3. Identifying NAs in P123 Downloads

  • The Post's Claim: In P123 downloads, you can identify NAs even if they have been imputed to 0, because it is statistically unlikely for a true factor value to be exactly 0 (except for one possible ticker).

  • Correctness: Contextually High. For many financial ratios (like Yield or Sales Growth), a value of exactly $0.00000$ is rare. This "hack" allows users to "reverse" the imputation in Python and apply more advanced techniques like K-Nearest Neighbors (KNN) or LightGBM's native NA handling.

4. Application-Specific Handling

  • The Post's Claim: The "best" way to handle NAs depends on what you are doing (e.g., Regression vs. Clustering vs. Tree-based models).

  • Correctness: High. This is a fundamental principle in data science. There is no "one size fits all" for missing data; for example, XGBoost/LightGBM can learn which direction to send NAs, while linear models require explicit numerical imputation.

Summary

The post is technically sound. It correctly identifies that while P123's default handling (0.5 or 0) is "fine" for simple ranking, it introduces biases in advanced AI workflows—biases that can be mitigated by taking advantage of the data provided in P123 downloads.