I have understood that a high percentage of NA’s are unwanted when creating machine learning models. And also that the “good” factors is the ones with a high Target Information Regression.
But for me, all the factors with a high Target Information Regression has a very high % of NA?
It’s a bit contradicting to me, does anyone has any insight to why i get these results?
Mathematically it finds "information" in the data. NAs with P123's ranking system carry an abundance of information. In this example we can very reliable predict that a rank of 100 will have zero return.
Unfortunately, it is false information that cannot be used, that is based on the fact that the middle buckets are empty which is really just an artifact of the way NAs are handled for this single factor:
Regularization is one way to fix this. Here is the same ranking system with a small amount of random noise added, which makes it so there are no empty buckets:
There are other—probably better--ways to deal with this. I do think using Target Information Regression is a great idea that could be tweaked. And P123 is actively looking into ways to fix this, improve it or replace it.
This would be a great time to fix the NA problem in general, however.
The problems created by the present handling of NAs is not limited to Target Information Regression. For example, P123 classic NAs are not helping to give us an accurate slope and is generally a problem for selecting single factors (as it is when using Target Information Regression).
And there is real information in the NAs that could be harvested if NAs were handled differently.
Accuracy checked with Claude 3. It found no errors with this summary: "Your analysis seems well-reasoned and highlights an important issue in data analysis and financial modeling. The suggestion to revisit the overall handling of NAs in the system appears to be a sound recommendation that could potentially improve multiple aspects of the analysis."