I have understood that a high percentage of NA’s are unwanted when creating machine learning models. And also that the “good” factors is the ones with a high Target Information Regression.
But for me, all the factors with a high Target Information Regression has a very high % of NA?
It’s a bit contradicting to me, does anyone has any insight to why i get these results?
Mathematically it finds "information" in the data. NAs with P123's ranking system carry an abundance of information. In this example we can very reliable predict that a rank of 100 will have zero return.
Unfortunately, it is false information that cannot be used, that is based on the fact that the middle buckets are empty which is really just an artifact of the way NAs are handled for this single factor:
Regularization is one way to fix this. Here is the same ranking system with a small amount of random noise added, which makes it so there are no empty buckets:
There are other—probably better--ways to deal with this. I do think using Target Information Regression is a great idea that could be tweaked. And P123 is actively looking into ways to fix this, improve it or replace it.
This would be a great time to fix the NA problem in general, however.
The problems created by the present handling of NAs is not limited to Target Information Regression. For example, P123 classic NAs are not helping to give us an accurate slope and is generally a problem for selecting single factors (as it is when using Target Information Regression).
And there is real information in the NAs that could be harvested if NAs were handled differently.
Accuracy checked with Claude 3. It found no errors with this summary: "Your analysis seems well-reasoned and highlights an important issue in data analysis and financial modeling. The suggestion to revisit the overall handling of NAs in the system appears to be a sound recommendation that could potentially improve multiple aspects of the analysis."
The high-NA → high-TIR correlation usually isn't real signal — it's an artifact of how the ranking handles NAs. As Jrinne laid out earlier in this thread, NAs leave empty middle buckets, and mutual_info_regression reads that empty structure as "information." So a factor that's mostly NA can score high on Target Information Regression while telling you almost nothing predictive.
P123 leaves NAs out of the factor's z-score/percentile calculation entirely. Mean and standard deviation are computed only on the stocks that actually have a value. Then, depending on the node's NA-handling setting, either drops those stocks or pins them all to one neutral rank. Either way you get a spike or a gap in the rank distribution instead of a smooth spread, and that gap is the "structure" the mutual-information score latches onto.
A few practical ways to deal with it:
Check NA% before trusting TIR. For any factor that's heavily NA, treat its TIR score as suspect: it's largely measuring the NA pattern, not predictive power.
Regularize so the buckets aren't empty: Jrinne's trick of adding a small amount of random noise works, because it removes the artificial empty-bucket structure and the score reflects the actual factor again.
Don't lean on single-factor TIR for selection when NA coverage is poor. It's most reliable on factors with good coverage; on sparse ones it's noisy at best.
The short version: there's no magic fix on the P123 side. Either discount TIR for high-NA factors, or de-sparse them (noise/regularization) before scoring.