This is a wonderful idea founded on solid research with a nice implementation. It iw a well-estblished idea elsewhere: e.g., SelectKBest.
But I find some variables that are clearly, consistently helpful (like earnings estimate revisions) with top buckets that are very significantly elevated on a rank performance test. The factors are significant in a practical and statistical sense yet are zero with this metric.
Why?
-
I am fooled and features I thought were important are useless in reality.
-
The NAs create enough noise that the importance of the top buckets is masked for the mutual_information_regression metric.
I am sure it is the latter. The NAs affecting the rank performance test has been a continuous problem that has been difficult to solve. But with tree based models you have a trivial solution.
The link to the solution is here: Suggestions for improvements - #5
I note @pitmaster gave the idea a thumbs up. Without involving me you might contact Pitmaster and get his frank ideas. I am sure he would share his ideas on the topic with you. Think about it in greater depth.
To sumarize, making NAs into a negative rank completely separates them from the rest of the data, For tree models, this preserves meaningful information about NAs. A tree model can easily split out that data about NAs or include it when "deciding" whether to do a split.
If Pitmaster really the after thinking about it for a while, it could solve a lot of continuous problems that will keep popping up. Or maybe he will find a problem I have not thought of with my idea. I would be happy to join any discussion with your AI expert, Pitmaster, bobmc and others if you pursue this idea further.
For example one example of a particular feature that would benefit, implementing SelectKBest. is absolutely brilliant!!! Of course, it would be nice if it worked to separate out meaningless features with no information (as it was designed to do). I think the NAs are the problem.
Of course, i might be missing something about mutual_information_regression and if so I would love to better understand.
Jim