Who is reading my technical posts: 3rd in google search

Jrinne · March 22, 2025, 11:13am

I understand that not everyone in the P123 community is drawn to technical topics like block bootstrapping. Still, I’ve noticed that some of these discussions may be reaching beyond the forum.

One of my posts now ranks third on Google for the phrase “block bootstrap sklearn random forest” — right behind Stack Exchange and Stack Overflow, and even ahead of major data science blogs and scikit-learn’s own documentation. That suggests it’s being read, bookmarked, and possibly implemented — even if engagement here is minimal.

It makes me wonder who’s really paying attention. Feedback has slowed, perhaps because the ideas have moved beyond the usual comfort zone. But that doesn’t mean they’re not being absorbed — possibly by quants, funds, or serious data-driven investors.

At this point, I’m weighing the balance between open sharing (which has helped me tremendously) and safeguarding a method that’s become sophisticated enough to carry real-world value.

My method is now clearly beyond a few disconnected feature suggestions — and P123, thank you for adopting so many of those feature suggestions over the years!

To everyone who’s contributed along the way: thank you. I wouldn’t have progressed to this point without you.

Hmm….Ranked above Scikit-learn even:

ZGWZ · March 22, 2025, 12:44pm

The low level of feedback may be due to the fact that whether there is a significant advantage to this approach has not yet been supported by any evidence.

Also, from my impression, the best nonlinear method is LightGBM, and LightGBM's bagging is not typical of bootstrap aggregation

github.com/microsoft/LightGBM

Bagging does not perform a bootstrap sample

opened 08:57AM - 03 Sep 17 UTC

closed 09:26AM - 03 Sep 17 UTC

PetterS

When enabling `bagging_fraction`, each element of the data is picked 0 or 1 time…s. Bagging means [boostrap aggregation](https://en.wikipedia.org/wiki/Bootstrap_aggregating), so this is quite confusing. In a bootstrap sample, elements are picked with replacement, i.e. potentially many times. It then does not make sense to require `bagging_fraction `≤ 1, since the bootstrap sample could be larger than the data (this is discussed in Breiman’s paper on Random Forests).

Jrinne · March 22, 2025, 12:47pm

Interestingly, that is valuable feedback!

Thank you for all your thoughts and ideas, ZGWZ.

For those who’d like to explore further: XGBoost typically implements subsampling, a related idea, as part of Stochastic Gradient Boosting. Here’s a link to Friedman’s original paper, which includes cross-validation with different subsample proportions:

Stochastic Gradient Boosting (Friedman, 2002)

Note that this paper doesn’t cover block bootstrapping, as it assumes i.i.d. (independent and identically distributed) data — not time series or stock data, which is typically dependent. That distinction matters: most wouldn’t recommend simple bootstrapping on stock data for that reason.

It’s also worth noting that Random Forests use simple bootstrapping and bagging by default, and that bootstrapping is optional with estimators like Extra Trees Regressor.

One possibility is that bootstrapping (and subsampling in XGBoost) doesn’t help some users because it assumes independent data — an assumption that doesn’t hold for stock time series.

ZGWZ · March 22, 2025, 3:14pm

The more fundamental issue at stake here is that the bootstrap method may not be inherently superior.

Edit: My personal opinion is that the cost of data acquisition is too high in almost all existing cases, which makes it not cost-effective to go beyond the overly complex methods beyond the basic linear approaches as what have been developed in recent decades in data science/advanced applied statistics. And, predictability in real-world problems often equals a lack of exploitability.

Even the overly complex linear methods developed during after WWII (e.g., GLM, SVM, GMM, 2SLS, IV) were largely limited to overfitting and exaggerating effect sizes. And note that seemingly rigorous mathematical arguments and reasoning in statistics are basically used for post hoc rationalization, similar to seemingly rigorous economics models (which are then denied by empirical data).

The few exceptions are mainly in the areas of text processing (large language modelling) and image recognition (autopilot technology). In these domains, it is easy to capture large amounts of data at low cost.

Edit2: Cases of lack of practical effectiveness include the fact that in polygenic scores, the advantage of complex machine learning algorithms over simple threshold screening methods is almost exclusively seen between families, rather than within families. That is, complex statistical methods basically just capture more of the correlation between genetic patterns and family backgrounds than actual useful causal relationships.

Jrinne · March 22, 2025, 5:29pm

Serious question. If someone were to code what Yuval has done with taking portions of universes and then aggregating the results for the final model what would you add or do different?

If you wanted to automate that what would you add or do different that block bootstrspping does not already do?

I would like to know so I can consider using it myself. And what would you do to make it faster and parallelize it if you went for the automated version?

I think some of the rest of Yuval’s good ideas could be automated too. It has already been done elsewhere in some cases.

Kind of like an independent discovery. Newton rediscovered Calculus. Not sure who was first here. And if it is a good idea it does not matter. It’s not like you can patent it.

Even if they are different things, I think Yuval is on to something. I think some interesting things can be done with P123 classic. Have been done already with P123 classic. Whichever the case may be.

I’ll spare everyone the feature request and use the downloads. Thank you P123 for the downloads of rank data!