Feature Selection for a linear model with a Genetic Algorithm vs Ridge Regression

My very first trial of a Genetic Algorithm (GA). Ran some code that also generated some data and compared it to a ridge regression.:

BTW you may need this code (probably in the terminal for Anaconda at least) as GA is not a standard module for Sklearn: pip install sklearn-genetic-opt


import numpy as np
from sklearn.linear_model import ElasticNet
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn_genetic import GAFeatureSelectionCV
from sklearn.preprocessing import StandardScaler

# Generate sample data
X, y = make_regression(n_samples=1000, n_features=20, noise=0.1, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Standardize features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# ElasticNet approach
elasticnet = ElasticNet(alpha=1.0, l1_ratio=0.5, max_iter=1000, random_state=42)
elasticnet.fit(X_train_scaled, y_train)
y_pred_elasticnet = elasticnet.predict(X_test_scaled)
mse_elasticnet = mean_squared_error(y_test, y_pred_elasticnet)

# Genetic Algorithm approach
evolved_estimator = GAFeatureSelectionCV(
    estimator=ElasticNet(random_state=42),
    cv=5,
    scoring="neg_mean_squared_error",
    population_size=50,
    generations=20,
    n_jobs=-1,
    verbose=True
)

# Set random state for reproducibility
np.random.seed(42)

evolved_estimator.fit(X_train_scaled, y_train)
y_pred_genetic = evolved_estimator.predict(X_test_scaled)
mse_genetic = mean_squared_error(y_test, y_pred_genetic)

print(f"ElasticNet MSE: {mse_elasticnet}")
print(f"Genetic Algorithm MSE: {mse_genetic}")
print(f"Selected features: {evolved_estimator.support_}")

Here is the printout:

Conclusions: IDK. Will try P123 downloaded data later. Clearly room for improvement including a grid search for the Ridge Regressor to make it fair.

But how fast it runs may be most important question.

1 Like

Interesting. I can see that the scoring is MSE ... computed over the whole sample....

Did you consider to create a custom scorer that select features which maximise sortino/sharpe based on selecting X stocks with highest predicted return for each period... ?

Yes. I will definitely do that. Or plan to.

I agree the result would be interesting (important). I already have the code to find "sharpe based on selecting X stocks" as well as the returns (for ML models). I can adapt the code (scorer) for this method.

Thanks!

Pitmaster,

Thank you for suggesting a custom scorer. The scorer here is the average weekly returns for 15-stocks rebalanced weekly using a P123 download.

Here is how it worked. Converges fairly quickly but the ML model still probably needs to be fast. Ridge regression here.

Fitness is the average weekly returns over the validation period. Validation set: tscv = TimeSeriesSplit(n_splits=5, gap=1).

Jim

1 Like

I think this custom rater is too unstable on its own and would recommend using it in conjunction with other raters.

Question to Claude 3: So, one of 2 things will happen with multiple runs. The results will be stable with the genetic algorithm selecting the same features each time. Or there will be some variance with different features being selected each time.

If it is that latter, model averaging could be used and it may actually be a good thing that reduces variance?

Answer: "This strategy aligns with ensemble methods in machine learning, which often outperform individual models. ...This approach could indeed lead to a more robust and generalizable model, especially in a domain as complex as stock market prediction…."

It is what it is, but I am not sure which to hope for: completely deterministic solutions or some variance in the feature selection that might be utilized for model averaging.

I continue to explore this with a Jupyter notebook running a genetic algorithm as i write this. So far there is some variance but not a lot (with one particular set of features). And I am still not sure which to wish for or what might be useful going forward. For all I know, I might end up adding features to create more variance in the genetic algorithm runs to better utilize the ensemble-method approach.