Improving Sklearn’s Random Forest for Stock Data with Block Bootstrapping

Random forests use bootstrapping, which helps address variance issues. However, traditional bootstrapping assumes the data is stationary and independent—assumptions that don’t hold for stock market data, which is inherently non-stationary and time-dependent.

The Problem with Standard Bootstrapping

Think of bootstrapping as creating multiple alternative market histories. Unfortunately, simple bootstrapping randomly samples from the same market, breaking the sequential structure of financial data.

The Solution: Block Bootstrapping

Block bootstrapping preserves temporal dependencies by sampling contiguous blocks of data rather than individual points. This helps random forests generalize better when applied to financial data.

How to Use It

To see if this improves your investment model, you could integrate block bootstrapping into a grid search cross-validation for Random Forests.

Here’s a Python implementation of a Random Forest with Block Bootstrapping, which runs as-is:


import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split

def block_bootstrap(X, y, block_size=10, n_samples=None):
    """
    Perform block bootstrapping on time-series data.

    Parameters:
    - X: Feature matrix (pandas DataFrame or numpy array)
    - y: Target variable (pandas Series or numpy array)
    - block_size: Size of each bootstrap block
    - n_samples: Number of total samples to generate (default is len(y))

    Returns:
    - X_boot: Bootstrapped feature matrix
    - y_boot: Bootstrapped target variable
    """
    if n_samples is None:
        n_samples = len(y)

    num_blocks = len(y) // block_size
    block_indices = np.arange(num_blocks)
    chosen_blocks = np.random.choice(block_indices, size=(n_samples // block_size), replace=True)

    X_boot = np.vstack([X[i * block_size:(i + 1) * block_size] for i in chosen_blocks])
    y_boot = np.hstack([y[i * block_size:(i + 1) * block_size] for i in chosen_blocks])

    return X_boot, y_boot

# Simulated stock data (Replace this with real stock features)
np.random.seed(42)
n_samples = 500  # Total data points (e.g., 500 trading days)
X = np.random.randn(n_samples, 5)  # 5 Features (e.g., technical indicators)
y = np.random.randn(n_samples)  # Target (e.g., stock returns)

# Perform block bootstrapping
X_boot, y_boot = block_bootstrap(X, y, block_size=10)

# Split bootstrapped data into train/test sets
X_train, X_test, y_train, y_test = train_test_split(X_boot, y_boot, test_size=0.2, shuffle=False)

# Train Random Forest model
rf = RandomForestRegressor(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)

# Predictions & performance
y_pred = rf.predict(X_test)
mse = mean_squared_error(y_test, y_pred)

print(f"Mean Squared Error: {mse:.4f}")
1 Like

Hi Pitmaster,

I’m sure you already thought of this, but out-of-bag (OOB) validation usually doesn’t work well because the withheld sample is too similar to what’s being trained on—since it’s shuffled data in most implementations.

With block bootstrapping, contiguous blocks of data can be held out, making it more like k-fold validation with shuffle=False in that regard.

Using multiple randomly selected validation sets has some clear advantages over standard k-fold validation.

I coded this without gaps for now. I haven’t found gaps to be crucial with short target periods, but it’s something to consider as a potential limitation. Maybe I’ll add them later.

Either way, I think OOB performance could be much better when used with block bootstrapping (compared to the usual OOB method) —possibly to the point of being useful. Maybe as part of a nested time-series validation when being rigorous.

Beyond just validation, I think the training data itself is better this way—with higher variance, which may be the real advantage—while still providing a cross-validation method that potentially equals or surpasses k-fold validation.

Technical note: Bootstrapping uses ~63.2% of the data for training on average, leaving ~36.8% for validation—compared to the typical 20% in 5-fold validation. Since each iteration randomly samples from the entire dataset, this also ensures broader exposure to different market conditions. This means a larger validation set with no reduction in training data , which might be another key advantage.

Jim