ML Script Performance vs. Genetic Algorithms & CMA-ES**

I've been experimenting extensively with various approaches to build a machine learning model using offline data from the "factor download" feature, but for some reason, I'm struggling to achieve strong results.

Below is my latest script, and I'm not quite sure what I'm doing that's preventing it from working well. So far, I'm getting decidedly better resultsβ€”both in out-of-sample and sub-universe testsβ€”by using a hybrid approach: Genetic Algorithms for factor selection and CMA-ES algorithms for factor weighting .

I'm wondering if anyone has tips on how they effectively use this downloaded data? I also tried another script where I ran 10,000 random simulations, collected the quick stats and year-by-year returns into a JSON file, and then applied XGBoost to that data. The formulas were selected from a database of about 3000 criteria. But again, the returns were decent but not impressive. I feel a bit stuck, as none of the ML algorithms I've tested on the offline data have yielded compelling results yet.

My experience so far has actually been that the best solution is consistently to run as many live simulations as possible directly on Portfolio123, using Genetic Algorithms combined with CMA-ES for weighting.

Here is my latest ML script:

Part 1: Setup and Data Preparation

  1. Configuration: Defines key settings: the data file path, the target variable to predict (SPYFUTURE), the prediction horizon (12 weeks), and the number of factors to include in the final ranking system.

  2. Load Factor Definitions: Reads a built-in 'dictionary' that translates cryptic data column names (like Pr52W%ChgInd) into the human-readable names and full formulas used by Portfolio123.

  3. Parse Factor Definitions: Processes the factor dictionary to create efficient lookup tables. This makes it fast to convert between data names, formulas, and tags when building the final XML file.

  4. Load and Clean Data: Loads the main historical stock data from a CSV file. It immediately cleans all column names, removing special characters to make them compatible with the machine learning model.

  5. Handle Missing Values (NaN): Intelligently fills missing data gaps. It first forward-fills data for each stock, then uses the median value for that specific date across all stocks to fill any remaining holes.

  6. Reduce Noise: Improves model robustness by "clipping" the most extreme 1% of returns. This prevents rare outlier events from disproportionately influencing the model's learning process.

Part 2: The Machine Learning Process

  1. Split Data into Three Sets: Chronologically divides the data into a training set (for learning), a validation set (to prevent overfitting), and a final test set for an unbiased performance evaluation on unseen data.

  2. Define the Model: Configures a LightGBM machine learning model. The parameters are chosen to create a robust model that generalizes well, rather than one that is perfectly tuned only to past data.

  3. Train the Model: The model analyzes the training data to learn the complex patterns that link financial factors to future stock returns, using 'early stopping' to find the optimal training duration.

  4. Evaluate Performance: After training, the model's predictions on the unseen test set are evaluated. It calculates the R-squared score to measure how accurately it predicted the actual stock returns.

  5. Quantile Analysis (Reality Check): This is the most crucial test. It groups stocks into five buckets (quintiles) based on predictions and checks if the highest-predicted group actually had the highest returns.
    6.Part 3: Building the Ranking System*

  6. Identify Key Factors: Extracts the 'feature importances' from the trained model. This creates a ranked list showing which of the hundreds of financial factors were most influential in predicting performance.

  7. Select Top Factors: Based on your setting in step 1 (e.g., 50), it selects the most important factors from the ranked list. These will form the foundation of the new ranking system.

  8. Assign Weights and Direction: Calculates a weight for each selected factor based on its importance and determines the ranking direction (Higher is better vs. Lower is better) by checking its correlation with returns.

  9. Generate P123 XML File: Assembles the selected factors, their calculated weights, and their ranking directions into a perfectly formatted XML file, ready for you to copy and paste directly into Portfolio123.

import pandas as pd
import numpy as np
import lightgbm as lgb
from sklearn.metrics import r2_score
from pandas.tseries.offsets import DateOffset
import io
import re

# ==============================================================================
# SECTION 1: CONFIGURATION (ROBUST HIGH-YIELD MODE)
# ==============================================================================
# CHANGE THIS TO FALSE FOR A FULL, FINAL RUN
QUICK_TEST_MODE = False
QUICK_TEST_FRACTION = 0.1 

# Paths and names
FILE_PATH = 'C:/Users/mywag/Documents/YT/ML2025/200-BESTE-US-TIL-ML-MED-FUTURE-RETURN.csv'
TARGET_COLUMN = 'SPYFUTURE'
PREDICTION_HORIZON_WEEKS = 12

# β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…
# β˜…  HERE YOU CAN ADJUST THE NUMBER OF FACTORS IN THE FINAL RANKING SYSTEM  β˜…
# β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…
NUM_FACTORS_IN_RANKING_SYSTEM = 50 # Recommended starting point: 50-75


# ==============================================================================
# SECTION 2: FACTOR DEFINITIONS FROM PORTFOLIO123 (UNCHANGED)
# ==============================================================================
factor_definitions_text = """COPY STOCK FACTOR FROM FACTOR DOWNLOAD
"""

# ==============================================================================
# SECTION 3: SCRIPT EXECUTION STARTS
# ==============================================================================
# --- STEP 3.1: PREPARING FACTOR METADATA ---
def clean_string_for_column(s):
    """Cleans strings to be used as column names."""
    return str(s).strip().replace('"', '').replace('\t', ' ')

def clean_string_for_formula(s):
    """Cleans formula strings from CSV input."""
    s = str(s).strip()
    if s.startswith('"') and s.endswith('"'):
        s = s[1:-1]
    s = s.replace('""', '"')
    return s

# --- NEW, ROBUST PARSER FOR FACTOR DEFINITIONS ---
print("Parsing factor definitions manually to avoid errors...")
lines = factor_definitions_text.strip().split('\n')
header = [h.strip() for h in lines[0].split('\t')]
parsed_data = []
for i, line in enumerate(lines[1:]):
    parts = line.split('\t', 2)
    if len(parts) == 3:
        parsed_data.append(parts)
    else:
        print(f"WARNING: Skipping line {i+2} in factor definitions, found {len(parts)} fields instead of 3.")

name_map_df = pd.DataFrame(parsed_data, columns=header)
# ----------------------------------------------------

name_mapping, original_names, formula_map = {}, {}, {}
for _, row in name_map_df.iterrows():
    tag = clean_string_for_column(row.get('tag', ''))
    name = clean_string_for_column(row.get('name', ''))
    formula = clean_string_for_formula(row.get('formula', ''))
    
    lgbm_safe_tag = re.sub(r'[^A-Za-z0-9_]+', '_', tag)
    
    name_mapping[lgbm_safe_tag] = name
    original_names[lgbm_safe_tag] = tag
    
    if formula != tag:
        formula_map[lgbm_safe_tag] = formula

print(f"{len(name_mapping)} factor names were mapped successfully.")


# --- STEP 3.2: LOADING AND PREPARING MAIN DATA ---
print(f"\nLoading data from: {FILE_PATH}")
df = pd.read_csv(FILE_PATH, header=0)
df.columns = [clean_string_for_column(col) for col in df.columns]
column_rename_map = {col: re.sub(r'[^A-Za-z0-9_]+', '_', col) for col in df.columns}
df = df.rename(columns=column_rename_map)
TARGET_COLUMN = column_rename_map.get(TARGET_COLUMN, TARGET_COLUMN)
df['Date'] = pd.to_datetime(df['Date'])
df = df.sort_values(by=['Ticker', 'Date']).reset_index(drop=True)
print(f"Data loaded. Original number of rows: {len(df)}")

if QUICK_TEST_MODE:
    print("\n--- QUICK TEST MODE ACTIVATED ---")
    unique_tickers = df['Ticker'].unique()
    sample_size = max(1, int(len(unique_tickers) * QUICK_TEST_FRACTION))
    sampled_tickers = np.random.choice(unique_tickers, size=sample_size, replace=False)
    df = df[df['Ticker'].isin(sampled_tickers)]
    print(f"Using a sample of {len(sampled_tickers)} stocks ({len(df)} rows).")
    
if TARGET_COLUMN not in df.columns:
    print(f"ERROR: Cannot find the target column '{TARGET_COLUMN}'. Exiting.")
    exit()
df = df.dropna(subset=[TARGET_COLUMN])

# --- Handling missing values ---
print("Handling missing values (NaN)...")
columns_to_exclude_original = ['Date', 'P123 ID', 'Ticker', 'Price', '64_1_Day_Returns']
columns_to_exclude_cleaned = [column_rename_map.get(clean_string_for_column(c), re.sub(r'[^A-Za-z0-9_]+', '_', clean_string_for_column(c))) for c in columns_to_exclude_original]
feature_columns = [col for col in df.columns if col not in columns_to_exclude_cleaned and col != TARGET_COLUMN]
df[feature_columns] = df.groupby('Ticker')[feature_columns].ffill()
date_medians = df.groupby('Date')[feature_columns].transform('median')
df[feature_columns] = df[feature_columns].fillna(date_medians)
df[feature_columns] = df[feature_columns].fillna(0)
print("Missing values have been handled.")

# --- BACK TO ROBUST: Reintroducing clipping of the target variable ---
print("Reintroducing clipping of the target variable to reduce noise from extreme outcomes.")
lower_bound = df[TARGET_COLUMN].quantile(0.01)
upper_bound = df[TARGET_COLUMN].quantile(0.99)
df[TARGET_COLUMN] = df[TARGET_COLUMN].clip(lower=lower_bound, upper=upper_bound)

X = df[feature_columns]
y = df[TARGET_COLUMN]

# --- STEP 3.3: SPLITTING DATA (BACK TO ROBUST: Reintroducing validation set) ---
print("\n--- Data Splitting (with validation set for robustness) ---")
test_split_ratio = 0.8
unique_dates = sorted(df['Date'].unique())
test_split_index = int(len(unique_dates) * test_split_ratio)
split_date = unique_dates[test_split_index]

train_val_df = df[df['Date'] < split_date]
test_df = df[df['Date'] >= split_date]

val_split_ratio = 0.8
unique_train_val_dates = sorted(train_val_df['Date'].unique())
val_split_index = int(len(unique_train_val_dates) * val_split_ratio)
val_split_date = unique_train_val_dates[val_split_index]

train_df = train_val_df[train_val_df['Date'] < val_split_date]
val_df = train_val_df[train_val_df['Date'] >= val_split_date]

X_train, y_train = train_df[feature_columns], train_df[TARGET_COLUMN]
X_val, y_val = val_df[feature_columns], val_df[TARGET_COLUMN]
X_test, y_test = test_df[feature_columns], test_df[TARGET_COLUMN]
test_info = test_df.loc[:, ['Ticker', 'Date']]

print(f"  Training set:    {len(X_train):>7} rows (before {val_split_date.date()})")
print(f"  Validation set:  {len(X_val):>7} rows (between {val_split_date.date()} and {split_date.date()})")
print(f"  Test set:         {len(X_test):>7} rows (after {split_date.date()})")

# --- STEP 3.4: TRAINING THE MODEL (BACK TO ROBUST: Balanced hyperparameters) ---
lgbm_params = {
    'objective': 'regression_l1',
    'metric': 'rmse',
    'n_estimators': 5000,
    'learning_rate': 0.02,
    'feature_fraction': 0.8,
    'bagging_fraction': 0.8,
    'bagging_freq': 1,
    'lambda_l1': 0.1,
    'lambda_l2': 0.1,
    'num_leaves': 31,
    'verbose': -1,
    'n_jobs': -1,
    'seed': 42,
    'boosting_type': 'gbdt',
}
model = lgb.LGBMRegressor(**lgbm_params)
print("\nTraining a robust LightGBM model with early stopping...")
model.fit(X_train, y_train,
          eval_set=[(X_val, y_val)],
          eval_metric='rmse',
          callbacks=[lgb.early_stopping(100, verbose=True)])
print("Model training complete.")

# --- STEP 3.5: EVALUATING THE MODEL ---
predictions = model.predict(X_test)
results_df = test_info.copy()
results_df['actual_return'] = y_test.values
results_df['predicted_return'] = predictions
r2 = r2_score(y_test, predictions)
print(f"\nR-squared score on the test set: {r2:.4f}")
print("\n--- Quantile Analysis (results on the test set) ---")
results_df['quantile'] = results_df.groupby('Date')['predicted_return'].transform(
    lambda x: pd.qcut(x, 5, labels=False, duplicates='drop') if len(x.dropna()) >= 5 else pd.Series([0]*len(x), index=x.index)
)
quantile_mean_returns = results_df.groupby('quantile')['actual_return'].mean().sort_index(ascending=False)
print(f"\nAverage actual return over {PREDICTION_HORIZON_WEEKS} weeks per quintile (4=best, 0=worst):")
for quantile, avg_return in quantile_mean_returns.items():
    print(f"  Quintile {int(quantile)}: {avg_return:.2f}%")
if len(quantile_mean_returns) > 1:
    spread = quantile_mean_returns.iloc[0] - quantile_mean_returns.iloc[-1]
    print(f"  Spread (best - worst): {spread:.2f}%")

# --- STEP 3.6: GENERATING THE PORTFOLIO123 RANKING SYSTEM ---
print(f"\n--- Generating robust P123 Ranking System ---")
feature_importances = pd.Series(model.feature_importances_, index=X_train.columns).sort_values(ascending=False)

top_features = feature_importances.nlargest(NUM_FACTORS_IN_RANKING_SYSTEM)
print(f"Building the system on the {len(top_features)} strongest factors.")

correlations = X_train[top_features.index].corrwith(y_train, method='spearman')
results_for_xml = pd.DataFrame({
    'importance': top_features,
    'correlation': correlations
}).dropna()

# RETAINING "BOOSTING" TO FOCUS ON RETURNS
print("Boosting the weight of momentum and sentiment factors...")
results_for_xml['boost_factor'] = 1.0
momentum_like_keywords = ['MOM', 'M-', 'PRICE', 'RET', 'RSI', 'ESTIMATE', 'REVISION', 'REC', 'SENTIMENT', 'SHORT', 'INSIDER', 'SURPRISE']
for idx in results_for_xml.index:
    p123_tag = original_names.get(idx, idx)
    if any(kw in p123_tag.upper() for kw in momentum_like_keywords):
        results_for_xml.loc[idx, 'boost_factor'] = 1.5 # Reduced boost from 2.0 to 1.5

results_for_xml['importance_boosted'] = results_for_xml['importance'] * results_for_xml['boost_factor']
results_for_xml['weight'] = (results_for_xml['importance_boosted'] / results_for_xml['importance_boosted'].sum() * 100).round(2)
results_for_xml['rank_type'] = np.where(results_for_xml['correlation'] > 0, 'Higher', 'Lower')
results_for_xml = results_for_xml.sort_values('weight', ascending=False)

# --- CORRECTED, SIMPLE, AND ROBUST XML FORMATTING ---
def format_formula_for_xml(formula_str):
    """Replaces only the necessary characters for XML to avoid errors."""
    return formula_str.replace('&', '&amp;').replace('<', '&lt;').replace('>', '&gt;')

xml_output = '<RankingSystem RankType="Higher">\n'
xml_output += f'\t<Composite Name="ML Robust Top {len(results_for_xml)} Factors" Weight="100" RankType="Higher">\n'
for lgbm_name, row in results_for_xml.iterrows():
    p123_tag = original_names.get(lgbm_name, lgbm_name)
    formula = formula_map.get(lgbm_name, None)
    weight = row['weight']
    rank_type = row['rank_type']
    
    if weight <= 0: continue
    
    p123_tag_xml_safe = format_formula_for_xml(p123_tag)
    
    if formula:
        formula_xml_safe = format_formula_for_xml(formula)
        xml_output += f'\t\t<StockFormula Weight="{weight}" RankType="{rank_type}" Name="{p123_tag_xml_safe}" Scope="Universe">\n'
        xml_output += f'\t\t\t<Formula>{formula_xml_safe}</Formula>\n'
        xml_output += '\t\t</StockFormula>\n'
    else:
        xml_output += f'\t\t<StockFactor Weight="{weight}" RankType="{rank_type}" Scope="Universe">\n'
        xml_output += f'\t\t\t<Factor>{p123_tag_xml_safe}</Factor>\n'
        xml_output += '\t\t</StockFactor>\n'
xml_output += '\t</Composite>\n'
xml_output += '</RankingSystem>'

file_name = "ml_robust_ranking_system.xml"
with open(file_name, "w", encoding='utf-8') as f:
    f.write(xml_output)

# --- SECTION 4: PRINTING THE FINAL RESULT ---
print("\n----------------------------------------------------------------------")
print("DONE! A new, ROBUST ranking system has been generated.")
print("\nHere is the complete XML code for the ranking system:\n")
print(xml_output)
print("\n----------------------------------------------------------------------")
print(f"The file has also been saved as: '{file_name}'")
print("Good luck with your testing!")
print("----------------------------------------------------------------------")

Thank you for sharing your expansion on my genetic algorithm idea and for introducing CMA-ES to the forum. I think this is both sophisticated and powerful!

You might also consider integrating a couple of other ideas to see if they help:

  1. @benhhorvath’s thoughts on evaluation metrics
  2. @AlgoMan’s code for feature selection β€” which could be useful if you’d like to prune more features after running the genetic algorithm.

Reading through the description and code what sticks out is Section 1.6 defining outliers as 1% on each end. We may think of this as small, but true outliers are measured in single digit basis points at best. Assuming the universe is 8,000 stocks then you are discarding 160. Try 0.1% or less and see what happens.

Happy Hunting,

Rich

1 Like