AI Factor functions added to the reference (sorry for the delay)

Dear All,

The two main functions for accessing predictions from an AI Factor model have been added to the reference. You will find them in the reference section of screener, buy/sell rules, and ranking system rules in the section Advanced Functions→AI Factor.

Click on ▶Full Description to read full documentation, or go to the reference in screens, ranking or buy/sell rules to see the latest version.

Sorry it took so long. Let us know what other improvements are needed to help users get started with AI Factors.

Cheers

1 Like

Hello,

I do not quite understand the difference in between both.
Can AIFactorValidation be used for a live model?

No. This function simply accesses the predictions that were stored during a cross validation. It does nothing more than open a file in the AI backend, that's why it's very fast.

AIFactor loads a trained model from the Predictor Tab, sends it data, and runs inference. Live models must use AIFactor. However AIFactor can also be used for limited backtest for further validation.

1 Like

Thank you for the quick answer.
Is the predictor fixed (meaning it gives a weight to each of the factors and does not move from here) or is moving (for e.g. looks at what happened previous and changes weight depending on conditions)?
I am level -1, and trying to get a sense of what it does.

1 Like

The Predictor (a.k.a. a trained model that can be used for inference) does not change, it's weights are fixed.

Please note that some algorithms have a random nature. They generate models that evaluate differently using the same dataset.

In other words, a trained model is just one possibility of many that comes out of the same training data. When cross validating you should make duplicates of the model to get a range of results.

PS. We also discovered recently that simply the order in which the dataset is stored can produce different models. We're still investigating this for LightGBM. More on this soon.

1 Like

When saying to make duplicate, you mean re running everything? Or just the predictor? Or model validation?

Either. If you make a duplicate of a model and, using the exact same dataset, you

  1. re-run validation you might end up with different results
  2. train a new Predictor you might get different inferences

That's just how it is for some models. We tagged them with #random so you are aware. For example Extra Tress has the #random tag.

PS. currently LightGBM is not tagged as #random. Looks like we will have to

1 Like

It will not work if the data is in a different order (or specifically when the index is different), but have you set the "seed" in LightGBM or "random_state" for Sklearn's Random Forest with an integer value when you do not want this behavior?

Screenshot from Sklearn's Random Forest:

Maybe you already know this.

BTW, many coders use 42 as the random seed placeholder (can be any integer). This is because of the significance of 42 in the "Hitchhiker's Guide to the Galaxy."

1 Like

LightGBM has multiple seeds to control if one wants the same results on the same data-set (in the same order). Code to set all of the seeds:

import lightgbm as lgb

params = {
    'data_random_seed': 42,  # Fix the seed for data randomization
    'bagging_seed': 42,
    'feature_fraction_seed': 42,
    'seed': 42,
}

You can also use deterministic mode to accomplish the same thing:


params = {
    'deterministic': True,  # Forces determinism
    'seed': 42,             # Sets global seed
}

Note: These settings ensure reproducibility only if the dataset order remains unchanged. LightGBM’s histogram-based methods aggregate feature values into bins. If the dataset order changes, the sequence of aggregation can lead to slight differences in split calculations due to floating-point precision.

Reduce Sensitivity to Dataset Order by Turning Off Bagging:

Bagging introduces randomness by sampling subsets of rows during training. Even when a bagging seed is set, there will be different randomization if the row order is not the same. While bagging improves generalization, it is not an essential feature for boosting. To disable bagging:


params = {
    'bagging_fraction': 1.0,  # Use 100% of the data
    'bagging_freq': 0,       # Do not perform bagging
}

Combined Configuration : For reproducibility with no bagging:

params = {
    'deterministic': True,  # Forces determinism
    'seed': 42,             # Sets global seed
    'bagging_fraction': 1.0,  # Use 100% of the data
    'bagging_freq': 0,       # Disable bagging
}

Note : Disabling bagging may lead to overfitting in some small datasets, as it removes a form of regularization. However, this tradeoff may be acceptable when the need for consistency, such as during a grid search or hyperparameter tuning, is prioritized. Reproducibility ensures consistent results when evaluating different parameter combinations, making it easier to identify the optimal configuration.

1 Like

Thanks @Jrinne . Not sure we should use seeds to hide the fact that LightGBM is in essence random. The model it produces is but one of many, and should be treated as such.

Currently we don't have a proper way to report results of random models. We did plan to group, for ex, 5 copies of the same model and the report them together so you can easily see the variation. And use the median for ranking vs. others.

1 Like

It is easiest for me to just reduce the variance rather than spend a lot of time measuring it:

For Extra Trees Regressor or Random Forests, you can reduce the variance among models by simply increasing n_estimators. The standard error can be lowered to any level you want, limited only by the amount of computing time you’re willing to invest. As n_estimators approaches infinity, the variance across multiple runs effectively goes to zero.

On a modern Mac, it’s practical to set n_estimators = 5,000 before finalizing and funding a model.

Another way to think about this: Extra Trees Regressor with the default n_estimators=100 is just averaging the results of 100 trees within its algorithm. You could run 50 separate models with n_estimators=100 and calculate the mean and variance of their outputs. However, running a single model with n_estimators=5,000 will produce the exact same mean, and the variance will be significantly reduced. By increasing n_estimators further on a faster machine with more power, you can reduce the variance to any level you desire.

The mathematical proof of this last statement is airtight. Bootstrapping—used in Random Forests—is specifically designed to control variance. The more bootstrapping rounds you run, the better the variance is controlled. Increasing n_estimators effectively applies this principle by incorporating more bootstrapped results into the overall average. It’s not a new concept that I can take credit for—but it is far simpler for me to just set n_estimators to a high number and not worry about measuring the variance.

Clarification on Variance Reduction Across Models

While I believe what I said about Extra Trees Regressor and Random Forests is correct, it does not extend to LightGBM. Because boosting in LightGBM is sequential, the associative property of addition (and the reduction in the standard error of the mean with larger sample sizes) does not hold as it does for Random Forests.

Here’s how Claude 3 summarizes this:

"The mathematical proof using associative properties and variance of means is elegant and correct for Random Forests, as each tree is independent and identically distributed."

And the complexity of trying to apply the same logic to LightGBM:

"For LightGBM - yes, you can achieve similar variance reduction by increasing n_estimators while decreasing the learning rate proportionally. This keeps the effective learning rate constant while averaging over more trees. However, unlike Random Forests where trees are independent, LightGBM trees are sequential and dependent, so the variance reduction properties aren't identical."

In summary, I was using the associative property of addition for Extra Trees Regressor and Random Forests, which cannot be applied to LightGBM. So, there is no proof that increasing n_estimators reduces variance for LightGBM. It’s as simple as that: I am unaware of any similar proof for LightGBM.

Key Clarifications

  1. No Change to P123 Needed:
  • If anyone finds my earlier points about increasing n_estimators in Random Forests or Extra Trees useful, they can apply that on their own. P123 doesn’t need to change a thing.
  1. Variance Measures Already Available in P123:
  • P123 already provides meaningful ways to measure variance, particularly in cross-validation results. For instance, the mean and standard deviation of validation results capture the largest source of variance: changes in market regimes.
  • These statistics are readily accessible in P123’s simulation statistics and other features. For most users, simply reviewing these is likely sufficient.
  1. Optional Visual Enhancements:
  • While a box-and-whiskers plot could be a nice addition for visualizing the variance of cross-validation results and other sources of variance, it is entirely unnecessary in my view. P123 already provides excellent tools for understanding model variability.

Conclusion

P123 has an excellent product, in my opinion. My post was just one perspective on how to use its features effectively. While exploring ideas like median and variance of LightGBM results might be interesting, P123 already offers innovative and reliable tools for dealing with variance. No feature request is necessary in my opinion..

TL;DR: Increasing n_estimators reliably reduces variance for Random Forests and Extra Trees Regressor, which can be proven using the associative property of addition. This approach is already easily implemented with P123's AI. However, this logic does not apply to LightGBM due to its sequential nature. For LightGBM, P123 provides effective tools for measuring variability, making additional features unnecessary, in my opinion. P123's AI is already excellent and continues to evolve.

What I have seen is that the turnover increases with the more n_estimators you use with LightGBM

1 Like

Intersting!

I would just add that anyone new to this with LightGBM may need to decrease the learning rate when increasing n_estimators. Maybe put both into a grid-search if you try it. I have not found "more is necessarily better" for LightGBM either.

A new AI backend has been rolled out which injects a value for "random_state" for the LightGBM models when no seeds are present in the hyperparameters. This includes predefined LightGBM models. Prior results can be reproduced by specifying these hyperparameters: "data_random_seed": 1, "feature_fraction_seed": 2. Additionally, datasets are now sorted by StockID to eliminate order of observations as a variable affecting training results.

All my AI models show an updated as of 1/30/25. It's making it difficult to find models as I typically think it terms of when I worked on something. Can you roll back the original, correct dates?

1 Like

We have rolled back the update date changes. Thanks for the heads up.

1 Like