Gini Index, Moro Index, Shannon Entropy

Two quick observations:

  1. Subsampling will speed things up significantly. Using max_samples=0.2 means each tree is trained on just 20% of the data. That can cut runtime by roughly a factor of 5 compared to using the full dataset.
  2. You can likely reduce max_features even further. For regression trees, max_features=0.3 is a common default (and works well in Random Forests). For classification trees, max_features="sqrt" is standard — and in my experience, "sqrt" also works well for regression in Extra Trees.

You can think of it this way:

Even though "sqrt" evaluates to just 17 features (out of 300), you’ll be training at least 100 trees. Across the full ensemble, you’ll still be using a large portion of the feature set. And if you increase the number of trees to 300 or more, you’ll likely end up using more features overall than you would by using your current method.

Also, keep in mind that max_features only limits the number of features per split, not per tree. So you’re still building trees that explore a wide range of features, just more efficiently. You’re almost certainly using more features more often than with a static manual subset.

So:

  • Reducing from 300 to 17 features per split gives nearly a 20× speedup at the split level.
  • Combined with subsampling, you could see a huge boost in training speed — possibly 100× faster, depending on your settings.

Literally, you could run this and come back in an hour instead of waiting 5 days!

In practice, you could run a grid search to tune max_samples and max_features. But if that’s too time-consuming, I’d suggest starting with:

  • max_features="sqrt"
  • A small grid search on max_samples, such as:max_samples=0.2, max_samples=0.5, and max_samples=0.7

This setup should run quickly and still perform just as well.

And when using a large number of estimators (n_estimators=300+), these changes may be essential to finishing your runs in a reasonable time.

1 Like

Thanks, Jim. I'll work on this this weekend and see what happens. It sounds like a good idea!

Thanks, Yuval. I’ll be very interested in any findings you’re able to share.

Extra Trees Regressor has the promise of not needing much feature selection, since it chooses the feature that provides the most improvement at each split — and in theory, it shouldn’t use weak features very often.

It’s a really interesting experiment, especially with a large set of features. I don’t pretend to know what you’ll find, but I’m genuinely curious to see how it plays out.

Let me know if you’d like suggestions on any of the other parameters.

1 Like

That failed. FAILURE - ValueError: `max_sample` cannot be set if `bootstrap=False`. Either switch to `bootstrap=True` or set `max_sample=None`.

Sorry Yuval,

I relied too much on ChatGPT here.

Just to clarify — are you opposed to bootstrapping (i.e., sampling with replacement)? On average, it draws about 62.3% of the original samples per tree, with some observations repeated.

Setting max_samples allows further reduction of that percentage, while still using replacement — so duplicates still occur.

Historically, bootstrapping came first, supported by elegant proofs showing it can increase the effective sample size (Bootstrap Methods: Another Look at the Jackknife). Subsampling without replacement came later, mainly as a computational shortcut — not necessarily a theoretically superior method.

As for ExtraTreesRegressor (short for Extremely Randomized Trees), its “extra” randomness comes from selecting split points at random, rather than optimizing for MSE as Random Forests do — and this proves quite effective. Most users don’t enable bootstrapping or subsampling to introduce additional randomness, and for good reason: it can significantly slow down the algorithm (which has also been my experience). In my own testing (independent of ChatGPT), enabling bootstrapping hasn’t made any meaningful difference in performance. The default setting (bootstrap=False) reflects the general consensus that it’s unnecessary.

That said, I double-checked and don’t see an easy way to implement true subsampling—rather than bootstrapping— within P123's AI. It would be a nice feature request. Apologies again for the confusion.

To be complete without misleading anyone, I don’t want to give the impression that either method is ideal for time-series or stock data — we really should be using block bootstrapping (or blocked subsampling). Here is a recent post with a video which is largely about block bootstrapping with some of the reasons for the superiority of block bootstrapping for stock data: Number of stocks? - #10 by Whycliffes

-Jim

Well, I compared out-of-sample results between the two approaches.

The first approach was to divide the universe manually into five subuniverses using Mod(StockID,5) and to feed the machine a randomly selected subset of 170 out of the 370 factors in my list. I came up with five different AI factors, all using the Extra Trees II parameters. I put those in a ranking system, equally weighted.

The second approach was to run all 370 factors on the entire universe but use different parameters: Hyperparams: {"n_estimators": 400, "criterion": "squared_error", "max_depth": 8, "min_samples_split": 2, "bootstrap": true, "max_samples": 0.2, "max_features": 0.3}.

I used an unusal target for these because I was trying to come up with a system that would be good for choosing stocks to buy put options on. It is eval(future%chg_d(71)>-15,-15,future%chg_d(71)). This is the same as min(future%chg_d(71),-15), except the latter didn't work because of its N/A handling.

I created predictors based on 2006 to 2020 and then ran rolling screens for rank > 98 (lower better) with a 100-day holding period for the last five years, screening every 4 weeks. In only 26 out of the 62 time periods, the old system with five systems had lower returns (lower returns are what I'm looking for). The average return using five systems was 0.31% while the average return using the new system was –0.81%. The number of weeks with a return under –15% was 21 with the old system versus 19 with the new. The alpha of the old system was –10.75% while the alpha of the new one was –11.40%. So on three out of four measures, the new system had a better performance than the old, but not by a whole lot.

I like the new system better because it’s a lot less work and a lot faster to run. So thanks, Jim, for the tips. If you think my hyperparameters are a bit off, please let me know. I don’t believe in running a whole bunch of hyperparameters to see which perform better. I would rather use the hyperparameters that make the most sense in terms of what I’m trying to do.

2 Likes

Thank you for presenting your approach.
With "min_samples_split": 2, your model allows to split the node (and return predicted return) based on only 2 data points (two stocks).

Imagine you developed ET model, with {"max_depth": 3, "min_samples_split": 2} and one of your tree (estimator) has this structure:

If ROE > 10, (60 stocks)

  • FCFYield < 2% (2 stocks),
    • P/E > 10 (1 stock) pred. ret = 250% (leaf 1.1)
    • P/E <= 10 (1 stock) pred. ret = -50% (leaf 1.2)
  • FCFYield >= 2% (58 stocks), pred. ret = 10% (leaf 2)

if ROE <= 10 (40 stocks)

  • Sales Growth > 10 (20 stocks), pred. ret = 4% (leaf 3)
  • Sales Growth <= 10 (20 stocks), pred. ret = -2% (leaf 4)

The model created by chance non sensical node ROE > 10 & FCFYield < 2% & P/E > 10 that predicts high return based outliers in the data. This is why setting a higher min_samples_split is crucial—it acts as a guardrail to force the model to learn from statistically meaningful groups of stocks, not random noise.

2 Likes

I agree with Piotr — setting min_samples_split=2 can make your model more sensitive to outliers, since a single outlier can heavily influence a split when only two data points are involved.

To help reduce the impact of outliers, I think it’s worth experimenting with larger values for max_samples.

The cool thing about ExtraTreesRegressor is that, in my experience, it’s quite robust to a wide range of parameter settings. You’ll often find that there’s a large plateau — say, min_samples_split values anywhere from 80 to 200 — where performance doesn’t change much. So, you’re not getting good results just because you’re hyper-tuning every possible setting. I share your concern about that!

That said, I do recommend doing a simple grid search on max_samples. If you’re using the full dataset for each tree, values between 400 and 1000 can work well. But if you’re using max_samples=0.2, that’s effectively sampling just 20% of your data — so it makes sense to also try values like 80 or 200. Including both lower and higher values helps ensure stability across different data scenarios.

In my experience, tuning max_samples gives meaningful control over how outliers affect the model — and it’s worked well for me.

I’d also recommend not setting max_depth. Instead, control tree growth through a larger min_samples_split. If you leave max_depth=None (which is the default in Python), trees will continue to grow until further splits are blocked by min_samples_split. This gives you more direct and transparent control over model complexity.

Setting both max_depth and min_samples_split creates redundant constraints — and worse, can lead to unpredictable interactions. In one run, min_samples_split might halt the split; in another, max_depth might do it. That introduces too much uncontrolled randomness, in my view.

1 Like

Thanks for sharing. Something worth testing would be an ensemble of say 3 of the models built using your new framework. The idea being if one of the models learns a weird quirk the other two perhaps did not and this repeats itself in different situations.

The ensemble acts like “wisdom of crowds” where combined predictions smooth out individual biases/errors since odds are lower that all three will make the same error. This would be more of an apples to apples with the old one anyways since by combining the other ones you were already using an ensemble.

In other words, ensembling is potentially an additive edge on its own and therefore your improved results might be even better than initially apparent.

1 Like

Very true. And if you think about it Extra Trees Regressor is an ensemble of trees. 400 tress with Yuval's hyperparameters (n_estimators = 400) which is one of the reasons it works so well and is universally used with tree models (and other models).

True on a very fundamental level!

@yuvaltaylor Extra Trees Regressor being an ensemble model (the usefulness highlights by SZ) ALONG WITH your consistent principle of using a subset of the universe and a subset of the feature is what makes this alll come together as a powerful model. It being automated is a nice bonus.

1 Like

If that's the case, why is the default for Extra Trees I and Extra Trees II "min_samples_split": 2? The only reason I chose it was because I like the OS performance of Extra Trees II, so wanted to make this new one similar. ChatGPT says this isn't a problem since the max_depth is only 8. If I'm using a universe of close to 3,000 stocks, I don't think I'd get close to that limit of 2 unless I were to increase the max_depth quite a bit, right?

I haven't tried the "grid search" but the validation results in the P123 setup don't work for me at all. I don't really care much about accuracy--what interests me is performance at the very end of the curve. The performance of the bottom bucket reflects a one-week holding period, which I couldn't care less about. I would need to run out-of-sample rolling screens on each model, which takes about an hour each. That said, your other suggestions make a lot of sense to me. Thank you.

Deleted by author

I've noticed that ExtraTreesRegressor models often create leaf nodes containing only a single sample, even with a large dataset and a limited max_depth.

I ran a simulation to demonstrate this, using 1,000,000 random samples and the following parameters: n_estimators=50, max_depth=8, min_samples_split=2.

Summary Statistics:
Minimum of minimums across all trees: 1
Maximum of minimums across all trees: 12
Average minimum samples in leaves: 1.52

In other words, across all 50 trees, the average size of the smallest leaf was only 1.52. This confirms the model often creates leaves with just one sample.

The correct way to control this behavior is to set the min_samples_leaf parameter to your desired minimum size.

Here is the code so you can run the simulation yourself in Jupyter Lab, Google Colab, etc. It also prints the structure of the first few trees, but you can disable this by commenting out the print_tree_structure(...) line.

import numpy as np
import pandas as pd
from sklearn.ensemble import ExtraTreesRegressor
from sklearn.datasets import make_regression
from sklearn.tree import _tree # We need this to access the tree structure

# ==============================================================================
# NEW FUNCTION: To print the tree structure
# ==============================================================================
def print_tree_structure(decision_tree, feature_names, node_id=0, depth=0):
    """
    Recursively prints the structure of a single decision tree.
    """
    # Get the underlying tree structure
    tree = decision_tree.tree_

    # Indentation for readability
    indent = "  " * depth

    # Check if the current node is a leaf node
    if tree.children_left[node_id] == _tree.TREE_LEAF:
        # A leaf node has no further splits
        # For regression, the 'value' is the predicted output at this leaf
        predicted_value = tree.value[node_id][0][0]
        samples = tree.n_node_samples[node_id]
        print(f"{indent}Leaf: predict = {predicted_value:.2f} (samples = {samples})")
    else:
        # This is a split node (an internal node)
        # Get the feature and threshold for the split
        feature_index = tree.feature[node_id]
        feature_name = feature_names[feature_index]
        threshold = tree.threshold[node_id]
        samples = tree.n_node_samples[node_id]

        # Print the decision rule
        print(f"{indent}if {feature_name} <= {threshold:.2f}: (samples = {samples})")
        
        # Recurse down the left branch (condition is True)
        left_child_id = tree.children_left[node_id]
        print_tree_structure(decision_tree, feature_names, node_id=left_child_id, depth=depth + 1)
        
        # Print the 'else' part of the rule
        print(f"{indent}else:")
        
        # Recurse down the right branch (condition is False)
        right_child_id = tree.children_right[node_id]
        print_tree_structure(decision_tree, feature_names, node_id=right_child_id, depth=depth + 1)


# ==============================================================================
# MAIN SCRIPT
# ==============================================================================

# 1. Generate sample regression data
X, y = make_regression(n_samples=1000000, n_features=50, noise=0.1, random_state=42)

# Create human-readable feature names
feature_names = [f"Feature_{i}" for i in range(X.shape[1])]

# 2. Fit Extra Trees Regressor
print("Fitting ExtraTreesRegressor (with min_samples_leaf=1)...")
model = ExtraTreesRegressor(
    n_estimators=2,
    max_depth=8,
    #min_samples_leaf=8, # <-- Intentionally commented out to show the default behavior
    min_samples_split=2,
    random_state=42,
    n_jobs=-1)

model.fit(X, y)

# 3. Analyze individual trees
min_samples_per_leaf_list = []
print("\n" + "="*50)
print("ANALYZING INDIVIDUAL TREES")
print("="*50)

for i, tree in enumerate(model.estimators_):
    # Print the structure for the first 2 trees
    if i < 2:
        print(f"\n--- Structure of Tree {i} ---\n")
        print_tree_structure(tree, feature_names)

    # Get the number of samples in each leaf node
    leaf_indices = np.where(tree.tree_.children_left == -1)[0]
    samples_in_leaf = tree.tree_.n_node_samples[leaf_indices]
    min_samples = samples_in_leaf.min()
    min_samples_per_leaf_list.append(min_samples)

# 4. Summary statistics
print("\n\n" + "="*50)
print("SUMMARY STATISTICS OF LEAF SIZES")
print("="*50)
print(f"Minimum of minimums across all trees: {min(min_samples_per_leaf_list)}")
print(f"Maximum of minimums across all trees: {max(max(min_samples_per_leaf_list), 1)}") # max with 1 in case list is empty
print(f"Average minimum samples in leaves: {np.mean(min_samples_per_leaf_list):.2f}")
1 Like

Deleted and put into edit above.

I took your advice and ran the following parameters: {"n_estimators": 600, "criterion": "squared_error", "max_depth": null, "min_samples_leaf": 12, "min_samples_split": 24, "bootstrap": true, "max_samples": 0.2, "max_features": 0.2} The result scored higher than the previous time. I haven't done my usual 5-year rolling-screen out-of-sample test, but I feel pretty sure that your advice was good and I'm just going to use this model instead. So thank you!

4 Likes

Hi everyone, when you are setting up parameters in extra trees: bootstrap for example, you could configure it through porftolio 123 tool?or python tool?. In portfolio123 trees I watched only the vars: estimators, criterion, sqr error, max depth, min samp split.

I passed this through ChatGPT and here is a link where you can find what it thinks about what I said below and ask it to expand if you wish: ChatGPT - ExtraTreesRegressor tuning. I believe it got a few things wrong (e.g., monotonic_cst is available on the most recent version), but it is generally a pretty good confirmation.

Those are the main things to look at I think. And this is one of the great advantages of ExtraTreesRegresssor: relative to XGBoost for example there are not that many parameters to be concerned about.

Of the ones you mentioned, I would set max_depth = None; in Python (it may be different in P123’s JSON). Instead of controlling max_depth it is better in my opinion to set min_samples_split and also min_samples_leaf. Rather than stopping at a set number of splits let the program keep splitting until you reach your desired min_samples_split. Once it reaches its last split (determined by in_samples_split ) you do not want a super-small leaf. You can control that by setting min_samples_leaf to a reasonable number.

I believe Pitmaster said something similar about turning off max_depth above.

I believe P123 has grid-search now. I have not used it in P123 but I have used it with P123 downloads and ExtraTreesRegressor. Do that yourself perhaps but I would suggest putting pretty large numbers into the search for min_samples_split and min_samples_leaf. The theory, at least, is that even relatively large outliers have less impact when the final leaf size is large.

You will probably have to keep criterion='squared_error' or criterion = “friedman_mse” Friedman_mse is very similar to squared_error in theory (it is squared_error with a minor correction). In practice it will make no difference which you use.

Criterion = “absolute_error” would be nice to try. It almost certainly would help with outliers but it is so slow that I have never finisher a run. I just turned it off. If you have a supercomputer you should try it!

If you set max_features=0.3 ,ExtraTreesRegressor will take about 70% less time to run.

If you want the program to give similar answers with each run it is best to increase
n_estimators but this will make it run longer. n_estimators =300 is a nice sweet spot ,perhaps. Maybe set n_estimators =500 or 1,000 for a final run before funding (larger numbers are always better but take more time to run).

if you have monotonically increasing features then monotonic_cst can be tried

n_jobs = -1 if you are using your own computer. ExtraTreesRegressor can run each tree in parallel. To large extent how fast it runs will depend on the number of parallel processors you can dedicate to the task.

bootstrap =False ,which I think is the default, is good. bootstrap =True is okay but it slows things down and makes little difference.

oob_score is interesting as a cross-validation method but suffers from data leakage and it probably is not helpful or even worse: can give a wrong impression.

Full set of hyperparamaters and defaults hereL ExtraTreesRegressor ChatGPT will also have a good opinion on this.

1 Like

Good point. Also just to add, extra trees and random forest do not learn with sequential trees and the randomness and lack of sequential learning from extra tress for example “lets” you get away with a small min child size vs anything you would use in light gbm/boosting. Quite small..

1 Like