AI Factor functions added to the reference (sorry for the delay)

marco · January 12, 2025, 1:09am

Dear All,

The two main functions for accessing predictions from an AI Factor model have been added to the reference. You will find them in the reference section of screener, buy/sell rules, and ranking system rules in the section Advanced Functions→AI Factor.

Click on ▶Full Description to read full documentation, or go to the reference in screens, ranking or buy/sell rules to see the latest version.

Sorry it took so long. Let us know what other improvements are needed to help users get started with AI Factors.

Cheers

AIFactor("AI Factor Name", "Predictor Name")

Returns the predictions from a trained AI Predictor. Mainly for use in 'live' strategies to rebalance, or to screen for new ideas. It can also be used in backtests with some restrictions (see below). Please note that AIFactor() can return N/A if more than 30% of the features are missing.

Full Description

Parameters

“AI Factor Name”: The name of the AI Factor.
“Predictor Name”: The name of the AI Factor Predictor.

How to use

Create a predictor on the Prediction tab in the AI Factor. When the training is complete, click on the clipboard icon next to the predictor's name. This will copy the formula to the clipboard for ease of use, as well as pop up a dialog with additional information. The formula can be used in ranking system formulas and buy/sell rules.

Usage notes for backtests

Although AIFactor() primary use case is for inference, it can be used for limited backtesting to verify, for example, the results with more recent market cycles. Please note the following:

Prediction is an expensive, slow running operation since it involves sending the dataset for the chosen universe to the AI backend, normalizing the data and calculating the predictions for the universe. For this reason, in backtests it can only be used up to 5 years in the past. To run longer backtests using AI predictions, please see AIFactorValidation().

When backtesting a machine learning model, it is crucial that the backtest does not overlap with the training data to ensure the evaluation of the model's performance is realistic and unbiased. Overlap can lead to overfitting and artificially inflated returns. To prevent this, usage of the AIFactor() dates which overlap the training dates of the AI Factor will generate an error.

Examples

1 ) In a Ranking system, simply add a Stock Formula and use the AIFactor() function as the Formula. You can use multiple AIFactors() in a ranking system or combine AIFactor() with other typical ranking system rules.

2 ) To run a long/short screen using the best/worst decile predictions, do the following:

Go to the screen settings and choose “Long/Short” in Method

Select “Quick Rank” for Ranking, and enter the prediction formula

Set the Quick Rank Method to “Percentile NA Neutral”

Go to Rules and in the Long Rule enter Rank > 90

Got to Hedge and in the Short Rule enter Rank < 10

Run a backtest

3 ) In a Screen rule to get the top decile:

FRank(`AIFactor("My AI Factor", "lightgbm II predictor")`,#All, #Desc) > 90

4 ) In a Screen rule to get the bottom decile:

FRank(`AIFactor("My AI Factor", "lightgbm II predictor")`,#All, #Desc, #NANeutral) < 10

Note the use of #NANeutral for the incl_na parameter

AIFactorValidation("AI Factor Name", "Model Name"[, "dup_id"])

Returns the saved predictions from a validation of a model. This function can be used to access historical predictions in ranking system performances, screen backtests, and strategy simulations. It allows for much longer simulations, with lower execution time, than those using AIFactor(). However, certain restrictions apply (see below), and you must enable “Save Predictions” before doing the cross validation of the model.

Full Description

Parameters

"AI Factor Name": The name of the AI Factor.
“Model Name”: The name of the validation model.
“Dup_id”: Optional in case there are multiple copies of the same model

How to use

When starting the cross validation of an AI model, enable the option to "Save Predictions". When the validation is complete, click the clipboard icon next to the model's name, which should be enabled. This will copy the function call you need to the clipboard, as well as pop up a window with additional usage information (see below).

NOTE: If you forgot to enable “Save Predictions”, you will need to delete the model (or create a duplicate) and redo the validation.

Usage restrictions

In order to use stored validation predictions, the following is necessary:

The universe must match

The dates used in the backtest must exist in the AI Factor's validation holdout data.

The rebalance period must match

Examples

1 ) In a Ranking system, simply add a Stock Formula and use the AIFactorValidation() function as the Formula. You can use multiple AIFactorValidation() in a ranking system or combine AIFactorValidation() with other typical ranking system rules.

2 ) In a Screen rule to get the top decile:

FRank(`AIFactorValidation("My AI Factor", "lightgbm II")` ,#All, #Desc) > 90

3 ) In a Screen rule to get the bottom decile:

FRank(`AIFactorValidation("My AI Factor", "lightgbm II")`,#All, #Desc, #NANeutral) < 10

Ju · January 22, 2025, 9:25am

Hello,

I do not quite understand the difference in between both.
Can AIFactorValidation be used for a live model?

marco · January 22, 2025, 2:00pm

No. This function simply accesses the predictions that were stored during a cross validation. It does nothing more than open a file in the AI backend, that's why it's very fast.

AIFactor loads a trained model from the Predictor Tab, sends it data, and runs inference. Live models must use AIFactor. However AIFactor can also be used for limited backtest for further validation.

Ju · January 22, 2025, 2:31pm

Thank you for the quick answer.
Is the predictor fixed (meaning it gives a weight to each of the factors and does not move from here) or is moving (for e.g. looks at what happened previous and changes weight depending on conditions)?
I am level -1, and trying to get a sense of what it does.

marco · January 22, 2025, 4:13pm

The Predictor (a.k.a. a trained model that can be used for inference) does not change, it's weights are fixed.

Please note that some algorithms have a random nature. They generate models that evaluate differently using the same dataset.

In other words, a trained model is just one possibility of many that comes out of the same training data. When cross validating you should make duplicates of the model to get a range of results.

PS. We also discovered recently that simply the order in which the dataset is stored can produce different models. We're still investigating this for LightGBM. More on this soon.

Ju · January 22, 2025, 6:28pm

When saying to make duplicate, you mean re running everything? Or just the predictor? Or model validation?

marco · January 22, 2025, 6:35pm

Either. If you make a duplicate of a model and, using the exact same dataset, you

re-run validation you might end up with different results
train a new Predictor you might get different inferences

That's just how it is for some models. We tagged them with #random so you are aware. For example Extra Tress has the #random tag.

PS. currently LightGBM is not tagged as #random. Looks like we will have to

Jrinne · January 22, 2025, 6:53pm

It will not work if the data is in a different order (or specifically when the index is different), but have you set the "seed" in LightGBM or "random_state" for Sklearn's Random Forest with an integer value when you do not want this behavior?

Screenshot from Sklearn's Random Forest:

Maybe you already know this.

BTW, many coders use 42 as the random seed placeholder (can be any integer). This is because of the significance of 42 in the "Hitchhiker's Guide to the Galaxy."

Jrinne · January 23, 2025, 10:03am

LightGBM has multiple seeds to control if one wants the same results on the same data-set (in the same order). Code to set all of the seeds:

import lightgbm as lgb

params = {
    'data_random_seed': 42,  # Fix the seed for data randomization
    'bagging_seed': 42,
    'feature_fraction_seed': 42,
    'seed': 42,
}

You can also use deterministic mode to accomplish the same thing:


params = {
    'deterministic': True,  # Forces determinism
    'seed': 42,             # Sets global seed
}

Note: These settings ensure reproducibility only if the dataset order remains unchanged. LightGBM’s histogram-based methods aggregate feature values into bins. If the dataset order changes, the sequence of aggregation can lead to slight differences in split calculations due to floating-point precision.

Reduce Sensitivity to Dataset Order by Turning Off Bagging:

Bagging introduces randomness by sampling subsets of rows during training. Even when a bagging seed is set, there will be different randomization if the row order is not the same. While bagging improves generalization, it is not an essential feature for boosting. To disable bagging:


params = {
    'bagging_fraction': 1.0,  # Use 100% of the data
    'bagging_freq': 0,       # Do not perform bagging
}

Combined Configuration : For reproducibility with no bagging:

params = {
    'deterministic': True,  # Forces determinism
    'seed': 42,             # Sets global seed
    'bagging_fraction': 1.0,  # Use 100% of the data
    'bagging_freq': 0,       # Disable bagging
}

Note : Disabling bagging may lead to overfitting in some small datasets, as it removes a form of regularization. However, this tradeoff may be acceptable when the need for consistency, such as during a grid search or hyperparameter tuning, is prioritized. Reproducibility ensures consistent results when evaluating different parameter combinations, making it easier to identify the optimal configuration.

marco · January 23, 2025, 2:36pm

Thanks @Jrinne . Not sure we should use seeds to hide the fact that LightGBM is in essence random. The model it produces is but one of many, and should be treated as such.

Currently we don't have a proper way to report results of random models. We did plan to group, for ex, 5 copies of the same model and the report them together so you can easily see the variation. And use the median for ranking vs. others.

Jrinne · January 24, 2025, 10:28am

It is easiest for me to just reduce the variance rather than spend a lot of time measuring it:

For Extra Trees Regressor or Random Forests, you can reduce the variance among models by simply increasing n_estimators. The standard error can be lowered to any level you want, limited only by the amount of computing time you’re willing to invest. As n_estimators approaches infinity, the variance across multiple runs effectively goes to zero.

On a modern Mac, it’s practical to set n_estimators = 5,000 before finalizing and funding a model.

Another way to think about this: Extra Trees Regressor with the default n_estimators=100 is just averaging the results of 100 trees within its algorithm. You could run 50 separate models with n_estimators=100 and calculate the mean and variance of their outputs. However, running a single model with n_estimators=5,000 will produce the exact same mean, and the variance will be significantly reduced. By increasing n_estimators further on a faster machine with more power, you can reduce the variance to any level you desire.

The mathematical proof of this last statement is airtight. Bootstrapping—used in Random Forests—is specifically designed to control variance. The more bootstrapping rounds you run, the better the variance is controlled. Increasing n_estimators effectively applies this principle by incorporating more bootstrapped results into the overall average. It’s not a new concept that I can take credit for—but it is far simpler for me to just set n_estimators to a high number and not worry about measuring the variance.

Jrinne · January 25, 2025, 8:43am

Clarification on Variance Reduction Across Models

While I believe what I said about Extra Trees Regressor and Random Forests is correct, it does not extend to LightGBM. Because boosting in LightGBM is sequential, the associative property of addition (and the reduction in the standard error of the mean with larger sample sizes) does not hold as it does for Random Forests.

Here’s how Claude 3 summarizes this:

"The mathematical proof using associative properties and variance of means is elegant and correct for Random Forests, as each tree is independent and identically distributed."

And the complexity of trying to apply the same logic to LightGBM:

"For LightGBM - yes, you can achieve similar variance reduction by increasing n_estimators while decreasing the learning rate proportionally. This keeps the effective learning rate constant while averaging over more trees. However, unlike Random Forests where trees are independent, LightGBM trees are sequential and dependent, so the variance reduction properties aren't identical."

In summary, I was using the associative property of addition for Extra Trees Regressor and Random Forests, which cannot be applied to LightGBM. So, there is no proof that increasing n_estimators reduces variance for LightGBM. It’s as simple as that: I am unaware of any similar proof for LightGBM.

—

Key Clarifications

No Change to P123 Needed:

If anyone finds my earlier points about increasing n_estimators in Random Forests or Extra Trees useful, they can apply that on their own. P123 doesn’t need to change a thing.

Variance Measures Already Available in P123:

P123 already provides meaningful ways to measure variance, particularly in cross-validation results. For instance, the mean and standard deviation of validation results capture the largest source of variance: changes in market regimes.
These statistics are readily accessible in P123’s simulation statistics and other features. For most users, simply reviewing these is likely sufficient.

Optional Visual Enhancements:

While a box-and-whiskers plot could be a nice addition for visualizing the variance of cross-validation results and other sources of variance, it is entirely unnecessary in my view. P123 already provides excellent tools for understanding model variability.

—

Conclusion

P123 has an excellent product, in my opinion. My post was just one perspective on how to use its features effectively. While exploring ideas like median and variance of LightGBM results might be interesting, P123 already offers innovative and reliable tools for dealing with variance. No feature request is necessary in my opinion..

—

TL;DR: Increasing n_estimators reliably reduces variance for Random Forests and Extra Trees Regressor, which can be proven using the associative property of addition. This approach is already easily implemented with P123's AI. However, this logic does not apply to LightGBM due to its sequential nature. For LightGBM, P123 provides effective tools for measuring variability, making additional features unnecessary, in my opinion. P123's AI is already excellent and continues to evolve.

AlgoMan · January 25, 2025, 12:03pm

What I have seen is that the turnover increases with the more n_estimators you use with LightGBM

Jrinne · January 25, 2025, 12:08pm

Intersting!

I would just add that anyone new to this with LightGBM may need to decrease the learning rate when increasing n_estimators. Maybe put both into a grid-search if you try it. I have not found "more is necessarily better" for LightGBM either.

aschiff · February 5, 2025, 9:13pm

A new AI backend has been rolled out which injects a value for "random_state" for the LightGBM models when no seeds are present in the hyperparameters. This includes predefined LightGBM models. Prior results can be reproduced by specifying these hyperparameters: "data_random_seed": 1, "feature_fraction_seed": 2. Additionally, datasets are now sorted by StockID to eliminate order of observations as a variable affecting training results.

korr123 · February 6, 2025, 3:42am

All my AI models show an updated as of 1/30/25. It's making it difficult to find models as I typically think it terms of when I worked on something. Can you roll back the original, correct dates?

valmarv · February 6, 2025, 6:25pm

We have rolled back the update date changes. Thanks for the heads up.