Gini Index, Moro Index, Shannon Entropy

Jrinne · August 3, 2025, 9:09pm

Sorry Yuval,

I relied too much on ChatGPT here.

Just to clarify — are you opposed to bootstrapping (i.e., sampling with replacement)? On average, it draws about 62.3% of the original samples per tree, with some observations repeated.

Setting max_samples allows further reduction of that percentage, while still using replacement — so duplicates still occur.

Historically, bootstrapping came first, supported by elegant proofs showing it can increase the effective sample size (Bootstrap Methods: Another Look at the Jackknife). Subsampling without replacement came later, mainly as a computational shortcut — not necessarily a theoretically superior method.

As for ExtraTreesRegressor (short for Extremely Randomized Trees), its “extra” randomness comes from selecting split points at random, rather than optimizing for MSE as Random Forests do — and this proves quite effective. Most users don’t enable bootstrapping or subsampling to introduce additional randomness, and for good reason: it can significantly slow down the algorithm (which has also been my experience). In my own testing (independent of ChatGPT), enabling bootstrapping hasn’t made any meaningful difference in performance. The default setting (bootstrap=False) reflects the general consensus that it’s unnecessary.

That said, I double-checked and don’t see an easy way to implement true subsampling—rather than bootstrapping— within P123's AI. It would be a nice feature request. Apologies again for the confusion.

To be complete without misleading anyone, I don’t want to give the impression that either method is ideal for time-series or stock data — we really should be using block bootstrapping (or blocked subsampling). Here is a recent post with a video which is largely about block bootstrapping with some of the reasons for the superiority of block bootstrapping for stock data: Number of stocks? - #10 by Whycliffes

-Jim