Maybe you can "catch a falling knife" using ML

Jrinne · July 13, 2024, 12:17pm

I found some things about this threat interesting: My factors have been underperforming badly the past month and a half. I did not want to hijack the thread with too much ML or by being too focused on what may interest me alone.

But maybe you will agree that this is interesting for those of us using small-cap models that have been underperforming recently.

Maybe IWM reverts to the mean. And maybe this data does suggest mean-reversion with some ability to predict returns using regression models (KNN here):

ML Model: KNeighborsClassifier

The data is daily adjusted close of IWM downloaded from Yahoo. com. I think I took care to remove any data leakage. Specifically, by using next close for the start of future returns and purging the train/test split but would very much appreciate any corrections to my code:


from sklearn.model_selection import TimeSeriesSplit
import pandas as pd
import numpy as np
from sklearn.model_selection import GridSearchCV
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import root_mean_squared_error, r2_score
import matplotlib.pyplot as plt

# Load and preprocess the data
data = pd.read_csv('~/Desktop/knn.csv', parse_dates=['Date'])
data.set_index('Date', inplace=True)

# Calculate 3-month historical returns
historical_returns = data['IWM'].pct_change(periods=60)  # Assuming 60 trading days in 3 months

# Calculate 1-month forward returns using next close to avoid data leakage and make it tradable. 
forward_returns = (data['IWM'].shift(-21) / data['IWM'].shift(-1)) - 1

# Combine into a DataFrame
df = pd.DataFrame({
    'historical_returns': historical_returns,
    'forward_returns': forward_returns
})

# Remove any rows with NaN values
df = df.dropna()

# Sort by date to ensure chronological order
df = df.sort_index()

# Define the split point (e.g., 75% train, 20% test, 5% purge)
train_end = int(len(df) * 0.75)
test_start = train_end + 21  # 21 trading days purge period

# Split the data
X_train = df['historical_returns'].iloc[:train_end].values.reshape(-1, 1)
y_train = df['forward_returns'].iloc[:train_end].values

X_test = df['historical_returns'].iloc[test_start:].values.reshape(-1, 1)
y_test = df['forward_returns'].iloc[test_start:].values

# Define the parameter grid
param_grid = {
    'n_neighbors': [200, 300,350, 400, 450, 500],
    'weights': ['uniform', 'distance'],
    'p': [1, 2]  # 1 for manhattan_distance, 2 for euclidean_distance
}

# Create the GridSearchCV object
#grid_search = GridSearchCV(KNeighborsRegressor(n_jobs = -1), param_grid, cv=5, scoring='neg_mean_squared_error')

tscv = TimeSeriesSplit(n_splits=3)  # Reduce from 5 to 3
grid_search = GridSearchCV(KNeighborsRegressor(n_jobs=-1), param_grid, cv=tscv, scoring='neg_mean_squared_error')

# Perform the grid search
grid_search.fit(X_train, y_train)

# Get the best model
best_knn = grid_search.best_estimator_

# Print the best parameters
print("Best parameters:", grid_search.best_params_)
# Make predictions
y_pred = knn.predict(X_test)

# Calculate metrics
rmse = root_mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f"Root mean Squared Error: {rmse}")
print(f"R-squared Score: {r2}")

# Visualize the results
plt.figure(figsize=(12, 6))
plt.scatter(X_test, y_test, color='blue', label='Actual')
plt.scatter(X_test, y_pred, color='red', label='Predicted')
plt.xlabel('IWM 3-month Historical Return')
plt.ylabel('IWM 1-month Forward Return')
plt.title('KNN Regression with Single Feature (IWM Historical Returns)')
plt.legend()

# Add a smooth curve to show the general trend of predictions
X_plot = np.linspace(X_test.min(), X_test.max(), 100).reshape(-1, 1)
y_plot = knn.predict(X_plot)
plt.plot(X_plot, y_plot, color='green', label='KNN Prediction Curve')

plt.legend()
plt.show()

smian · July 13, 2024, 7:18pm

Hi Jim,

Thank you for sharing the interesting diagram. As I'm still a newbie in machine learning, I would appreciate it if you could help me interpret the diagram. Specifically, I am looking to understand how to identify significant areas on the digram that demonstrate mean reversion at work. Thank you!
-Sam

Jrinne · July 13, 2024, 7:58pm

Mainly. It is the downward sloping curve.

I did this with a linear regression which was also downward slopping but showed a lower R2 which means a little less predictive.

I mention this, however, because with a line that has a negative slope it is correct to say: “the most recent 3-month returns are inversely correlated with returns over the next month.”

This is the basic meaning of mean-reversion.

So I think it is still a good question as to whether this is meaningful.

I have a couple ways of answering that. None of them definitive. But when one talks about mean reversion, here it is in one graph.

The other thing that interest me about this is k-nearest neighbors (KNN) being similar to kernel regression. Which sounds wonky because it is.

KNN is easier. My personal question: Easier and just as good?

My answer: “With weighted distance isn’t it pretty much the same?” And a grid search of KNN hyperparameters suggests I don’t need to use weighted distance anyway. So possibly better.

Maybe more than you need for now, but KNN really is easy for anything non-linear. It seems to give good results for ranked features which come standardized already (standardization is important for KNN). It can be computer resource expensive but computers get better every day.

Anyway to get back on topic the negative slope of the prediction line means IWM is mean reverting over this text period at least.

Jrinne · July 14, 2024, 12:44pm

TL;DR: Maybe you can catch a falling knife after all.

This is a nice looking 4D-graph using 3 features (plus the target making it 4D total) that adds some perspective and information about what has happened in the past after large 3-month drawdowns of major indices. Large drawdowns for the indicies are on the far left of the graph and every single one has positive excess returns going forward!!! Not just for one anecdotal previous situation but there is a fair sample-size because of KNN. The metrics show how predictive historical price data for these indices is, in general.

The predictions using KNN:

Checked for accuracy by Claude 3. A bit of an exaggeration I think. But still, could we be close to a good market entry, sometimes, by using ML?

"Your conclusion that 'Maybe you can catch a falling knife after all' is a reasonable interpretation of the data. It suggests that, historically, there has been a tendency for returns to improve following significant drawdowns across multiple indices."

Addendum:

The metrics for predicting one-month returns for SPY were poor (i.e., a negative r2-score).

QQQ future one-month returns show a similar pattern (r2_score = 0.0050) :

Future work: I would be interest in what happens with sectors, subsectors and industries using different features.

smian · July 15, 2024, 1:09am

Now this is quite insightful. Thank you for your contributions!