I think Marco is right about this: A literal validation of P123's direction

I believe the concern that many individuals are considering, and which you are rightly bringing to attention, revolves around the challenge of anticipating the performance of a particular strategy beforehand. Relying solely on pure randomness may yield outcomes similar to the one you mentioned here.

This topic has been widely discussed, encompassing considerations such as steering clear of data-mining and weighing beta against alpha. Consequently, I consistently recommend an analysis of designer models, as it provides an ideal platform to examine the effectiveness of the P123 as applied by its members. In this context, I find the evidence to be quite conclusive.

Korr and all,

First, I really do think @marco is on the right track. He understands this and is working on letting us get started on this while working on VALIDATION. Awesomely cool!!! Full stop.

I say that because I do not want the below to be misinterpreted as being critical. Rather it is an expansion (I believe) on why Marco finds validation to be a meaningful topic.

Also, Korr has a point. One could look at the designer models.

So I don’t know if people are going to think this question is overly complex or too simple. Maybe both. But here it is: “If we are going to look at the designer models then what are they, exactly, in machine learning terms?”

I submit that they constitute a validation set. THE DESIGNER MODELS CAN BE CONSIDERED A VALIDATION SET. On topic as what I presented above is: A VALIDATION SET.

Also on topic in that I think Marco is looking (rightly so in my opinion) to provide VALIDATION SETS within P123’s ML/AI.

So just to continue on Korr’s point of how you would use P123’s designer models. Validation sets (which are what the out-of-sample results the designer models are I believe) are meant to be used to SELECT A MODEL.

That is what you will probably do with P123’s ML/AI. You will try what P123’s offers. For now regressions, support vector machines, random forest, XGboost and neural-nets. You will try all of them and everything else being equal (equally easy to implement, equally transparent or not so much of a black box, equally complex etc) you pick the model with the best validation set. Simple right?

Then in machine learning you may have a HOLD-OUT SET. That may mean paper-trading for some. Or maybe you trained 2000 - 2010, validated 2010 - 2017 and 2017 - now is a hold-out test set.

So this is probably clearer put in terms of the designer models. You have some out-of-sample results that I have called a validation set. If you are a hardcore machine learner (and everything else is equal with the models like no opinion on the designer etc) you then select the 5 best models. The 5 models with the best validation set. Again, all else being equal that is what validation sets are for.

Then stick with just those 5 models—never finding some retrospective reason to change those 5. And see how those do. This is now your hold-out test set. Meant to represent how you might do after selecting the 5 best designer models going forward.

Without a doubt Yuval’s discussion of regression toward the mean (mine too as I agree) will be important. The 5 will not—in general—do as well in the hold-out test as they did in the validation set. This is also an example of the multiple comparison problem. It is a statistical law and hard to fight. And I believe it is a law of nature and therefore impossible to fight.

But after you do this you will have some idea of what to expect if you invest in the top 5 designer models. A real idea of what your retirement might look like.

Maybe you run a Monte Carlo simulation on the holdout test set. Plan on the 2 bedroom condo beach house based on the lower interval of the Monte Carlo simulation and dream of the 10 bedroom mansion the upper bound of the Monte Carlo simulation suggest is at least possible.

Anyway, I apologize for the length. But I really do believe @Marco gets this and he has been “validated” in every sense of the word. I am being supportive and I think this is an important concept that everyone using ML/AI with P123 will want to understand.

In my case at least: Now for the holdout test set. I am doing pretty good one year out BTW, These are the only stock models I have run for the past year. No survivorship bias in other words. Clearly not statistically significant but postitive evidence. Median alpha for these is 17. The median annualized return is slightly higher at 17.9 (no coincidence I would guess):

Note, my port have had multiple changes, starting at 30 stocks, then 15, now twenty. Soon to go to 15 based on some of the validation sets above. And different (but similar) ML models. But if what I have done with my ports is a hold-out test set it does not accurately reflect any single validation set.

TL;DR; The evidence I have does suggest that Marco is indeed on the right track and there is no reason to think a skilled programmer could not do better than I have with this. My second point is the designer models could be considered a validation set. And finally, a hold-out test set is the only way to get a true idea of how a model is likely to perform.



A very detailed response to Korr.

I also noticed that a lot of the designer models are not performing very well. Maybe they can also be validated with the AI/ML rollout.

I hope we can see better performing designer models in the future (and without being deleted and replaced with new ones within a short time due to under performance).


Better performing DMs would be a nice test of the ML approach. Hopefully, in a couple of years we’ll know the answer. It could become a powerful selling point for p123. My concern if that w/o a methodology, these ML tools could be less effective than expected.

It’s also a bit unfortunate that DMs are just about the only public-facing displays of p123 capabilities. My private models are doing much better. Unfortunately, b/c of liquidity constraints they can’t be shared.

Walter, that reminds me I sent you a question re: your designer models via the forum’s direct messaging a couple weeks ago, but I don’t think you saw it. Please check your forum inbox when you get a chance. Thanks!

I’m eagerly anticipating any updates on the expected release date. The prospect of employing AI/ML to assist in model building is very exciting. A big thank you to Marco and the team for facilitating the download of normalized features, which is greatly appreciated.

Hi Jim,

I’m trying to repeat your test. just struggling a bit as I’m starting with AI/ML Models.
I’m using lightgbm with regression and then directly selecting the stock from the predicted target.
My question, could you explain a bit which kind of targets you are using?
it looks like 5,10,20 day returns do not work, sharp ratio seems a bit better.
Second, how much days in the future are you using for the target? 5-10days? or much longer?


Hi Carsten,

I use one-week excess returns. Excess returns relative to the universe. Nothing else has worked for me. For example, returns alone without subtracting out the universe returns has not worked. I get no results with this—possibly just too much market noise to find the signal. Excess returns relative to a benchmark does not work for me either (even with a pretty correlated benchmark). I have my doubts about using a cap-weighted benchmark like Russell 3000 for a Russell 3000 universe. I only tried that once without definitive results. But I have my doubts about that working.

If you download using DataMiner you may have this column-head: “Future 1wkRet”

[I appologize for the lenght but I am not sure of Carsten’s Python level]

This code will give you a column with excess returns:

import pandas as pd

Read the CSV file

df = pd.read_csv(‘~/Desktop/DataMiner/DM8.csv’)

Ensure that the “Date” column is a datetime object

df[‘Date’] = pd.to_datetime(df[‘Date’])

Calculate the mean returns for each date

mean_returns = df.groupby(‘Date’)[‘Future 1wkRet’].mean()

Subtract the mean returns for each date from the individual returns

df[‘ExcessReturn’] = df.groupby(‘Date’)[‘Future 1wkRet’].transform(lambda x: x - x.mean())

Now, df[‘ExcessReturn’] contains the excess returns for each ticker and date

It is possible you are experiencing other problems but if you are not using excess returns relative to the universe I would not expect it to work for you based on my experience.

You can also email me.



On a separate topic, I would have expected Boosting (e.g., LightGBM) to be on of the best models and I can get it to work.

But suprprisingly (or at east surprisingly to me) ExtraTreesRegressor has performed better for me.

Extra-Tress regressor is more sensitive to feature selection. You have to have good features. But if you do it runs much faster and there are probably only two hyper parameters to adjust ( min_samples_split=x and max_features)

For boosting or an Extra-Trees regressor you should keep the “min_samples_split” way higher than you are used to (like 3000). and just use 'max_fetures = ‘sqrt’ if you try Extra-Trees regressor.

I do think boosting will work for you if you want to stick with it and it will select features at each split so it might be better for you (depending on the features you are using). But you might consider cross-validating a few methods to see what works for you and your features. Extra-Tress or a random forest may take a change in just one line of code to try (commenting out regressor =LlghtGMB and replacing it with regressor = ExtraTreesRegressor (min_samples_split = 2000, max_features = ‘sqrt’) for example.


My approach is similar to Jrinne.
The only difference is that I use median to avoid outliers, and then I zscore excess returns for each period. I also use 4weeks returns for training and shorter/longer period for validation. I use only linear models with non-linear transformations of predictors to allow for non-linearity between predictors and the target but keeping full interpretability of the final model.

1 Like

Edit: I meant to say I like linear models or just give Pitmaster a heart. And perhaps that I would be interested in his “non-linear transformations.”

Hi Jim,

great, thanks, the execs return works much better, but still not there.
Just programming more Factors (with the help of Chat GPT wink: ) and see if it improves. Takes a bit more time than anticipated, will let you know.

@Jrinne @pitmaster you are for sure aware that you mean or median the return of the universe? Theoretical it should be like add 1 to all returns, do the product of all and then -1 to have universe return. Why you use:
mean_returns = df.groupby(‘Date’)[‘Future 1wkRet’].mean() ???
in my case I have a multi index of date and ticker so, this is what I’m using:
combined_data[‘1d_uni_ret’] = combined_data.groupby(‘date’)[‘close’].transform(
lambda x: (1 + x.pct_change(1)).prod() - 1

@pitmaster interrsting: zscore excess returns for each period.

some way to share my notebooks with you?


Indeed, you would do something like that to get the log returns. For one week the difference is not great but it is theoretically more correct to use log returns (or excess log returns) as you say, I think. And really no disadvantages other than a little math.

I agree. Good point.



you were mentioning that you were selection the best X factors to go forward, may be you are using this to select the most important parameters:

model = lgb.train(best_params, lgb.Dataset(X_train, label=y_train), 100)

def get_feature_importance(model, importance_type=‘split’):
fi = pd.Series(model.feature_importance(importance_type=importance_type), index=model.feature_name())
return fi/fi.sum()

feature_importance = (get_feature_importance(model).to_frame(‘Split’).
join(get_feature_importance(model, ‘gain’).to_frame(‘Gain’)))

and than use Splitt or Gain or a sum of them.

But should it be the same to use the final predicted target?
or is there to much noise in the production and the top X most important Factors are cleaner?



I am not sure I fully understand your question. But for a random forest and I think for boosting I would use recursive feature elimination sklearn.feature_selection.RFE if I had the computing power (which I think I do but it would not run quickly).

So Python (for sure with random forests as I have done it and ChatGPT says for LightGBM also) will do feature importances and remove one feature at a time (starting with the least important feature from feature importances output). It will do this for linear model also (removing the features with the smallest coefficient).

I, usually, am pretty selective in which features I use (screening them ahead of time). I have only done this recently with linear models and not found any of my features needed to be eliminated (using RFE).

One could also use LASSO regression for linear models;

Sorry, if this is not a direct answer to your question. But before funding a random forest I am sure I would see if it runs on my computer and even pay for faster Colab to do RFE.


Hi Jim,

yes that’s the answers to my question, so you are doing sorting totally different.
Need to read that tomorrow, a bit late now.

So what im doing, maybe you can understand it if I post the code
a) build a model with an BayesianOptimization, quite fast…
b) than train and predict.
c) Finale I used the stored predicted values as a single factor to select stocks.
Here is that part of code (hopefully its ok to post it here)

My Spearman correlation between target and predict is really bad, around 0.3…

Best Carsten

def lgb_optimization(X_train, y_train):
    def lgb_crossval(max_depth, num_leaves, min_data_in_leaf, learning_rate, bagging_fraction):
        params = {
            'objective': 'regression',
            'metric': 'rmse',
            'verbosity': -1,
            'boosting_type': 'gbdt', 
            'max_depth': int(max_depth),
            'num_leaves': int(num_leaves),  # Convert to integer
            'min_data_in_leaf': int(min_data_in_leaf),  # Convert to integer
            'learning_rate': learning_rate,
            'bagging_fraction': bagging_fraction
        # Create a Dataset object inside this function
        cv_result = lgb.cv(params, lgb.Dataset(X_train, label=y_train), nfold=5, seed=42, stratified=False, metrics=['rmse'])
        # Check if 'valid rmse-mean' exists in cv_result
        if 'valid rmse-mean' not in cv_result:
            print("valid rmse-mean not found in cv_result. cv_result keys:", cv_result.keys())
            return float('-inf')

        return -np.min(cv_result['valid rmse-mean'])  # Minimize rmse

    optimizer = BayesianOptimization(lgb_crossval, {
        'max_depth': (1, 7),
        'num_leaves': (1, 2 ** 7), # Make sure the range is in integers
        'min_data_in_leaf': (250, 1000),  # Make sure the range is in integers
        'learning_rate': (.01, .3),
        'bagging_fraction': (0.8, 1)
    }, random_state=42)

    optimizer.maximize(init_points=2, n_iter=5)
    return optimizer.max['params']

def optimize_hyperparameters(data, feature, target):
    X_optimize = data[feature]
    y_optimize = data[target]

    # Optimize hyperparameters
    best_params = lgb_optimization(X_optimize, y_optimize)

    # Convert necessary parameters to integer
    best_params['num_leaves'] = int(best_params['num_leaves'])
    best_params['max_depth'] = int(best_params['max_depth'])
    best_params['min_data_in_leaf'] = int(best_params['min_data_in_leaf'])

    return best_params

# Select the features to use for training

# Usage
optimization_size = int(len(factors_data) * 0.8)  # Using 20% of data for optimization
optimization_data = factors_data.iloc[:optimization_size]
evaluation_data = factors_data.iloc[optimization_size:]

# Optimize hyperparameters
best_params = optimize_hyperparameters(optimization_data, feature, target)

def train_and_evaluate(data, feature, target, best_params, iterations=10):
    time_splits = TimeSeriesSplit(n_splits=iterations)
    predictions_list = []  # List to store predictions with indices
    spearmanr_list = []  # List to store Spearman correlations

    for train_index, test_index in time_splits.split(data):
        train = data.iloc[train_index]
        test = data.iloc[test_index]

        # Define train and test sets
        X_train = train[feature]
        y_train = train[target]
        X_test = test[feature]
        y_test = test[target]
        # Train model with best parameters
        model = lgb.train(best_params, lgb.Dataset(X_train, label=y_train), 100)

        # Make predictions and store them with the test index
        y_pred = pd.Series(model.predict(X_test), index=test.index)

        # Calculate Spearman correlation for this split and store it
        correlation, _ = spearmanr(y_test, y_pred)
        print(correlation,'Spearman correlation for this split')

     # The average Spearman correlation across all time splits
    average_spearmanr = sum(spearmanr_list) / len(spearmanr_list)
    print(round(average_spearmanr,5),'The average Spearman correlation across all time splits')

    # Combine all predictions into a single DataFrame
    all_predictions = pd.concat(predictions_list)
    return all_predictions

# Example usage

# Get the predictions
predictions = train_and_evaluate(factors_data, feature, target, best_params)

# Append the predictions as a new column
target_pred = target[0] + '_pred'
factors_data[target_pred] = predictions

# Save the updated DataFrame back to the HDF5 file
factors_data.to_hdf(file_path, key='factors', mode='a')


So maybe a couple of preliminaries and then one thing I would do:

  1. This is very advanced. Nice work. Full stop.

  2. I am not sure this is a bad Spearman’s rank correlation. What is a good number depends on the context. For a rank performance you want I higher Spearman’s rank correlation but rank performance aggregates all of the data into however many many buckets, If you pull that out and start looking at ranks of individuals stocks (for the week) and the returns for the week this number will go down.

I am not sure what a good Spearman’s Rank correlation is for what you are doing. But I think, probably, 0.3 is not bad and maybe good. Maybe very good

I think Spearman’s rank correlation can be useful for comparing and selecting models. But consider getting a sense of what your method would do in live trading with the long code that I have sent you in and email (very long).

So if I remember, this will give you Spearman’s rank correlation, Correlation AND RETURNS for a k-fold validation or a test set. So, out-of-sample returns it you are using a test set.

Additionally, this does that with an embargo so getting pretty close to telling you how the model will do if you are using a true test set. De Prado generally gets credit for the idea of an embargo. He think it gets rid of (or at least helps with) the problem of information leaking from training with future data. At least it is future data where any trend (reversion to the mean etc) is long gone (because of the embargo period) and therefore is not “autocorrelated.”

You could probably make a for-loop to make this code much shorter but with the embargo even ChatGPT had trouble with that. So I said “What the heck” and just wrote it out.

This has multiple machine learning methods. All but one are commented-out in this code. But you should have no trouble putting your own boosting method in or trying a random forest.

Finally, I think I removed my features and you will want to include a string of your own features.

I am sure there are mistakes in the code that you will have no trouble correcting. But also considering you programming skill, you could add a few for-loops. And recursive feature elimination (if you like it). Run it for about a week on whatever system you use and have the perfect, fully optimized (but cross-validated) system and start considering your early retirement. Absolutely no joking in an any of that. But also it could take a couple weeks of playing with this before you are ready for that. Easier written than done. I have absolutely zero (zero Kelvin, absolute absence of) programming skill and have managed to get this far so I know you can do something that works well for you pretty quickly.

Oh, and BTW, this code does use the Geometric mean (as you suggested above). You are right about that—especially compounded over multiple periods (as done in the code I emailed you).

Thank you for your interest and comments. :slightly_smiling_face: