Gini Index, Moro Index, Shannon Entropy

portfoliologic · July 29, 2025, 3:52am

I'm wondering if other users can say what range of values they are gettingfor Gini, etc. When I ask chatGPT:

I'm generally getting values < 0.4 for Gini with ~100 features models.

SZ · July 29, 2025, 11:54am

Are you using Gini to assess feature importance? It is mostly used for classification say win/loss rather than magnitude of win loss. I would not use Gini solely.

portfoliologic · July 29, 2025, 12:30pm

Yes, feature importance. So far I've been using AI exclusively on large cap models and have a few that have significant excess returns. Before P123 launched AI, my AI knowledge was about zero. When I ask one of the AIs what is the ideal number of features to use (20 years of data fundamental data on 5000 stocks) they typically say 100 +/- 50. From what I've read, I need to be concerned about having too many features (over-fitting), so I'm spending a lot of time on feature selection. This seems to be going well based on my results but when I look at the Gini and Moro Index values they are not good. I really have no context for comparison with other P123 users, and for all I know chatGPT, grok, etc, might be wrong to begin with both for number of features and how to measure their importance. Thanks

SZ · July 29, 2025, 12:47pm

I would be curious to know as well what coefficients people are seeing

SZ · July 29, 2025, 1:45pm

Some color as the discussion unfolds:

The gini coefficient is a measure of how good the SOLO feature is at splitting winners and losers at each split across all trees. (Win/loss composition. Gini is typically used for inequality studies in economics)

The goal of the algo is to reduce MSE for the final model. A feature with a low gini can still improve MSE if that solo feature interacts well with others so if I ever remove one based on gini I would want to compare MSE or results before and after.

The other useful measure is mean standard error reduction or variance reduction (also SOLO) across splits and trees for each feature. This is separate from the final MSE.

SZ · July 29, 2025, 1:50pm

I have been trying to learn more as well. If anyone has any corrections on this it would be helpful to improve my understanding as well. Basically we cannot disregard low gini features without further review

portfoliologic · July 29, 2025, 2:48pm

To add some detail to this discussion, here is an example. The following results were obtained with this setup:

universe = SP 1500
period = max
AI Factors used = Core Combo

Core Combo has 65 features. After running the Target Information Regression it showed Gini = 0.3701 and Moro = 0.3833

I then used Extra Trees ll using K Fold (5 folds) and got:

Next I saved as new model, only imported the top 2/3rds of the features (so now 44 features) as defined by the regression and repeated everything.

Now both Gini and Moro have declined significantly to 0.2276 and 0.2114.

However, the returns have improved slightly:

What does all this mean? I'm really not sure.....

SZ · July 29, 2025, 3:17pm

When removing features i would remove them one by one otherwise you really would not know if you let a good one go by deleting in bulk. At least until we have SHAP numbers. In other words, you would have no proper case by case comparison in your experiment. Another thing that can complicate things is each training is different by chance so models in theory might not be exactly the same even with same settings unlike traditional models without many equal ranks. Right now I want to make sure I understand everything perfectly before I start running trainings, but in theory these things should hold.

Jrinne · July 29, 2025, 3:33pm

Portfoliologic,

Have you looked into P123’s “Importance” metric for features? As SZ pointed out, P123’s method for measuring importance aligns with those typically used in regression models. Gini is more common for classification models. And unless I missed something, all of P123’s machine learning models are regression-based—there hasn’t been any announcement about classification models, as far as I know.

That’s a common and solid approach, I believe. One practical method is to remove the feature with the lowest importance score, then continue removing features one at a time, checking for improvements after each step.

If you’re working with downloaded data, this can be automated. Here’s an example using recursive feature elimination with cross-validation (RFECV) in scikit-learn:

Recursive feature elimination with cross-validation

SZ · July 29, 2025, 3:39pm

Thanks for sharing Jrinne. Do you also agree each run is unique and might provide different results even for the same settings? I have not tested this yet, but makes me wonder if randomness could turn a marginally additive feature into a marginally subtracting one if one goes by overall results.

Jrinne · July 29, 2025, 3:56pm

SZ,

Yes, if you’re running a Random Forest in P123, there’s inherent randomness. Each tree is built on a random bootstrap sample of stocks (rows), and if max_features is set to less than 1.0, a random subset of features is used at each split. So even with the same settings, different runs can yield slightly different results.

When using scikit-learn with downloaded data, you can ensure reproducibility by setting random_state to a fixed integer—e.g., random_state=42.

More importantly, the impact of randomness tends to diminish as you increase the number of trees. The default is n_estimators=100, but if your compute resources allow, bumping this up to something like n_estimators=3000 will cause variability to asymptotically approach zero.

Personally, I run n_estimators high enough that I don’t worry about randomness—usually 500 or 1000 does the trick. For final runs, I set it as high as my machine can handle. There’s no risk of overfitting from doing this.

-Jim

SZ · July 29, 2025, 4:00pm

I agree on the number of trees. I figured some might try “quick” runs with less intensity which could be ill advised based on the randomness (light gbm too given the subsampling hyperparameters). Thanks for confirming. Goes to show one needs to be careful when ruling out features.

portfoliologic · July 29, 2025, 5:56pm

Jrinne,

The Importance metric being the bar graph for each feature after running the regression? If so, yes. Thanks

yuvaltaylor · July 30, 2025, 5:07pm

Just a few comments and questions. I'm just starting with AI, but I understood that with decision-tree-based models, the feature importance doesn't reflect what might be going on under the hood, as it were, and that eliminating factors based on feature importance might not be as good as eliminating overly correlated factors. From the little I understood, decision-tree-based models can frequently focus in on little-used factors to make decisions in certain instances. Of course, I might be mistaken.

To eliminate highly correlated factors, one method is to simply take your ranking system, set all nodes equal (e.g. assign them 0 weight), and then use the "Ranks" tab with the "columns" box set to "Include Composite & Factor Ranks." Then download the spreadsheet and, in Excel, use the "Data Analysis" tool and select "Correlation." In a new sheet it'll give you a correlation table. You can then highlight the cells with the highest values and eliminate highly correlated nodes that way.

What I've been doing so far with AI systems is much easier, though maybe it won't work that well. I create AI factors for five different random universes using Mod(StockID, 5) = 0, 1, 2, 3, or 4 in the rules of those universes. For each AI factor I choose a random set of 170 factors out of the 300-odd that I could use. At the end, I combine the five predictors into a ranking system, giving each one 20%. I figure I'm avoiding overfitting that way. Maybe this is a terrible idea, though. Someone with more knowledge of machine learning might correct me. I use the Extra Trees algorithm, BTW.

It was my understanding that decision trees are not regression-based. Again, though, I could be totally wrong.

Jrinne · July 30, 2025, 5:53pm

Hi Yuval,

Thanks for sharing your approach!

To address a few of your questions:

Random Forests, including RandomForestRegressor, can be used for either regression or classification tasks. From what I understand, Portfolio123 is currently using scikit-learn’s Random Forest Regressor under the hood.

I believe P123’s implementation of XGBoost is also used in regression mode (though the algorithm itself supports classification as well). As far as I know, P123 isn’t currently using classification models.

Regarding feature importance, there are two common approaches:

Frequency-based: how often a feature is selected to split a node across all trees.
Impurity-based: the average reduction in the splitting criterion (e.g., mean squared error) when a feature is used.

Both methods help illuminate what the model is doing, though impurity-based importance is generally considered more informative.

It’s worth noting that Random Forests are relatively robust to collinearity (correlated features). Highly correlated variables may divide importance or be used interchangeably, but this typically doesn’t harm model performance — unlike in linear models, where multicollinearity can be more problematic.

In those cases, regularization methods like Ridge, LASSO, or Elastic Net are helpful, as they shrink or remove correlated features automatically.

That approach does make intuitive sense — partitioning your data like that can help reduce overfitting. That said, if you’re using tree-based models like Random Forest or Extra Trees, I’d recommend relying on bootstrapping, which is already built into these algorithms and specifically designed to reduce variance and combat overfitting.

Bootstrapping trains each tree on a random subset of the data. It essentially accomplishes what Mod() is intended to do — in a statistically grounded and far more scalable way.

In fact, you can bootstrap 500 times relatively quickly — a significant increase in the number of random universe compared to using Mod(StockID,5)!

Thanks.

Best,

Jim

SZ · July 30, 2025, 7:56pm

I was learning about hyperparameters a few weeks back and learned that light gbm has these hyperparameters:

They represent the percentage of your universe you are randomly subsampling. It works similarly to your idea as it randomizes the population in one of them and the features in the other (col)

Your idea to combine models is something i will attempt to do later as well as there is evidence that model ensembles work better than just one. Usually researchers will change say number of leaves between the models or other parameters.

There is also a method where you can use random forest (or in this case extra randomized trees) to pick robust (general) features and then use those in a gradient boosting algo.

SZ · July 30, 2025, 8:06pm

If you then choose "subsample_freq": 1} each tree will have its randomized universe

Jrinne · July 30, 2025, 8:43pm

Thanks, SZ — I think that’s a great point and one that’s clearly supported in the literature.

When using gradient boosting with subsampling, it’s referred to as Stochastic Gradient Boosting. Here’s a link to the original 1999 paper by Friedman:

Stochastic Gradient Boosting (PDF)

Jrinne · July 31, 2025, 5:29pm

Yuval,

I hadn’t realized you were using ExtraTreesRegressor—somehow I missed that! It’s one of my favorite models as well, and I think I can replicate much of what you’re doing using the built-in parameters of the algorithm.

You mentioned:

You can achieve something very similar within ExtraTreesRegressor by setting:

bootstrap=False and max_samples=0.2 This randomly selects 20% of the training set for each tree—essentially mimicking the effect of using one-fifth of the data, as you do with Mod(StockID, 5). But unlike the fixed five universes, this gives you a fresh random subset for every tree, allowing you to scale up to thousands of trees if desired. Note with these setting it takes a sample without replacement and is a true subsample as you are doing with Mod() and not bootstrapping which samples with replacement.
max_features=170/300 This sets each split to consider about 170 out of your 300 features. While this doesn’t limit the entire model to 170 features like you do in your AI factor creation, it introduces randomness at the split level, which is exactly how ExtraTrees builds diversity across trees.

So with just those two parameters, you’re essentially replicating your multi-universe + randomized-feature setup in a scalable and automated way. The only meaningful difference is that ExtraTrees samples features at the split level, rather than fixing a set of features for the entire model as you do. But this design allows you to scale the process to thousands of randomized trees without sacrificing performance—and, as far as I can tell, without any real downside.

As I said ExtraTreesRegressor is one of my favorite models. It is far more lightweight to tune than models like XGBoost or LightGBM. Fewer hyperparameters not only makes it easier to use, but also reduces the chance of overfitting.

For reference, here are the full set of tunable parameters ExtraTreesRegressor documentation

Note: Since Portfolio123 uses JSON to define model parameters, there may be subtle syntax differences—for example, using true instead of Python’s True (cap sensitive). If any hyperparameter doesn’t seem to be accepted, it may be worth checking with Dan to confirm the correct formatting or whether that specific parameter is currently supported.

I really like what you are doing. Maybe some of it can automated and scaled up.

-Jim

yuvaltaylor · August 1, 2025, 1:14am

Jrinne:

You can achieve something very similar within ExtraTreesRegressor by setting:

bootstrap=False and max_samples=0.2 This randomly selects 20% of the training set for each tree—essentially mimicking the effect of using one-fifth of the data, as you do with Mod(StockID, 5). But unlike the fixed five universes, this gives you a fresh random subset for every tree, allowing you to scale up to thousands of trees if desired. Note with these setting it takes a sample without replacement and is a true subsample as you are doing with Mod() and not bootstrapping which samples with replacement.

max_features=170/300 This sets each split to consider about 170 out of your 300 features. While this doesn’t limit the entire model to 170 features like you do in your AI factor creation, it introduces randomness at the split level, which is exactly how ExtraTrees builds diversity across trees.

So with just those two parameters, you’re essentially replicating your multi-universe + randomized-feature setup in a scalable and automated way. The only meaningful difference is that ExtraTrees samples features at the split level, rather than fixing a set of features for the entire model as you do. But this design allows you to scale the process to thousands of randomized trees without sacrificing performance—and, as far as I can tell, without any real downside.

Wow. Thanks so much, Jim. This is incredibly helpful.