Preview of our v1.1 of AI Factor: intelligent features, grid search, categorical & macro

marco · November 7, 2024, 8:07pm

Dear All,

Here's a preview of what's coming for our AI Factors, We hope to release it all next week.

Intelligent Features

Our AI Factors right now use Machine Learning (ML) algorithms with tabular data. Not exactly revolutionary, but I think we packaged it well. However, our normalization options are non-existent: we only do cross sectional normalizations vs. either the entire dataset or by date.

In this new release, you will be able to add more sophisticated types of normalizations that are entirely orthogonal to the existing ones. These normalizations do additional processing before sending it all to the AI backend. In essence, we are shifting more intelligence into the features. Since our backend is relatively dumb (compared to AGI, LLMs, etc.) it makes perfect sense to give it a little help.

Here's the kind of normalizations you will be able to do, and what we are calling them:

Default Normalization: The current method, a.k.a. the Preprocessor
Normalized Ratio: Take the current value of a stock's ratio (like Pr2Sales) and compare it with historical values for the same stock. The normalization used depends on the Preprocessor setting (ex. zscore)
Regression Growth
Regression Surprise
Regression R²: Calculate the regression of the previous values for the stock, and use one, or all, of these statistics. See the LinReg() function for more info.

Below is a screen shot of the new Features page, all generated without writing a single formula using pre-defined features.

PS. These more intelligent features were always there for anyone to use, but you need to write very complex formulas, which is very error prone and frustrating.

Grid Search

A way to easily test hundreds of hyper-parameter combinations. The models are generated automatically and easily distinguishable. Further refinements to the Reports section will enable you to identify the best hyper-parameter combinations.

We're also adding very helpful reference for the hyper-parameters for creating your own models.

Screenshot of the grid search set up (we also added several pre-defined one).

Categorical & Macro

We're adding support for categorical features by allowing you to turn off preprocessing. For macro series we're adding the normalizations mentioned above, which might help for stock predictions.

That's all for now.

The only other major area we have not addressed for AI Factor is better feature engineering tools.

Cheers

korr123 · November 8, 2024, 5:39am

I'm excited about the upcoming release.

Would it be possible for part of the infrastructure to help identify and detect orthogonal signals? It seems we currently lack a way to empirically determine signal correlation—is that correct?

Jrinne · November 8, 2024, 7:39am

I just note these are completely different uses of the term orthogonal..

Perhaps Korr is thinking of something like a Principal Component Regression. Or Orthogonal Factor Analysis Regression. In a purely machine learning sense. I wouldn't discourage anyone from looking at those. Or from using principal component analysis (which creates orthogonal vectors) to reduce dimensionality.

So I now understand that orthogonal has a software meaning too. Nice that these methods are orthogonal in that sense if I understand.

marco · November 8, 2024, 12:01pm

Orthogonal meaning with 0 correlation, which is an exaggeration since the market is like the tide. For example if AMZN Pr2Sales is at it's 52 week low it probably means a lot of stocks are also at their lows.

We're never really explored ranking ratios purely longitudinally. We have a few pre-defined ones, like PERelative or PctFromHi that are essentially longitudinal ranks. But when placed in a ranking system, they are then ranked again. So double ranking: longitudinal then cross-sectional.

Not sure why longitudinal ranks never came up. After all, isn't that what analysts always talk about? They either say XYZ is cheap compared to ZZZ and ABC, or it's a good time to buy XYZ because it's cheap historically.

What we're introducing are simple ways to add: 1) purely longitudinally ranked features 2) ranked vs industry or sector 3) regressed stats vs prior values. An AI system maybe finds these useful. Some limited empirical tests combining these methods show promise.

PS Also something with can take to the ranking systems.

pfrommert · November 8, 2024, 3:45pm

Is it possible to get normalized values also for normal rankings? I just tested PERelative and it improved my ranking system, but i would like to have it for EV/EBITDATTM. Especially high quality stocks never really trade cheaply on a absolute basis, but on a relative basis to themselfs they do. Would be interested to also test longer or shorter timeframes than 5 years.

yuvaltaylor · November 8, 2024, 6:10pm

Here's one way to do this "manually." I'm using yield as an example, but you can use any factor here.

(Yield - FHistMin("Yield",65,4))/(FHistMax("Yield",65,4)-FHistMin("Yield",65,4))

marco · November 8, 2024, 6:17pm

@yuvaltaylor Next week we're releasing several new functions that do all of that but with only typing the formula once . "Yield" in your example .

These are the formula signatures. More details once we launch them

Loop Functions

LoopRel("formula(CTR)",iterations, start=0, increment=1, NA_value=NA, NAPct=20)
LoopZScore("formula(CTR)",iterations, start=0, increment=1, clip=3.5, NA_value=NA, NAPct=20)
LoopRank(“formula(CTR)”, iterations, start=0, increment=1, sort=#DESC, NA_value=NA, NAPct=20)
FRel(“formula”, scope=#All, outlier_trim=7.5, clip=TRUE, NA_value=NA)

FHist functions

FHistRel("formula", samples, weeks_increment=1, NA_value=NA, NAPct=20)
FHistZScore("formula", samples, weeks_increment=1, clip=3.5, NA_value=NA, NAPct=20)
FHistRank("formula", samples, weeks_increment=1, sort=#DESC, NA_value=NA, NAPct=20)

jlittleton · November 9, 2024, 1:07pm

Any chance you can/will add nested cross validation for this? I think that if we make it easy to run that many variations we will end up fitting to our validation data. While we can always hold out a few final years for test data I personally want to be able to have about 10 years of test data to get a better feel for the sustained performance. I suspect, but have not tested, that running hundreds of runs on the validation data has a potential to fit to it. Even with label trimming.

The nested cross validation method I use is that I train/validate on say 2003-2013 using a kfold, then test on 2014. Then I train/validate on 2003-2014, test on 2015... and so on. That gives me 10 years of test data that I only run once per revision/idea instead of hundreds or thousands of times for choosing my label, hyper parameters, preprocessing and so on.

Jrinne · November 9, 2024, 2:33pm

This is especially important for those of us who happen to already know what features have been working for the last few years. I.e., all of us.

If you can incorporate some sort of feature selection into this one can, in theory, begin to get a true idea of how a method of feature selection might perform out-of-simple.

That does not really work now. If we have selected factors that we know have worked over the last 10 years I am not convinced it matters what ML model (or P123 classic optimization method) we use.

Yuval had made this point about one of my Python Models that had fixed features in a post. He was right. I cannot speak to what Yuval was thinking exactly but for sure my results were suspect. I agree with him on that and I did not mind him pointing out the obvious.

Yuval makes this point here. I hope he does not mind me quoting it considering I think it is a great point and I think the context is the same:

Ii think Yuval was saying that the way I did it my results were not really out-of-sample in the least. That I was wrong in saying they were. Yuval is absolutely correct about that t I believe.

JLittleton has a fix for that when this feature is available with k-fold cross-validation. P123 would just have to add the walk-forward test set and we can have a solution to what Yuval too is suggesting is a weakness with what we are doing now.

jlittleton · November 10, 2024, 9:28am

Features selected for each sub-training period is something that would ideally be included in the nested cross validation and is actually the next step I am taking with my code before I consider funding a ML model. I am curious if anyone knows how much this affects the results, if not perhaps I can make a post on it if I find out.

Jrinne · November 10, 2024, 10:21pm

Short answer is no I have not finished a nested cross-validation using k-fold validation for feature selection as the inner loop. How far have you gotten? I did one 5-fold validation (RFECV) with Extra Trees Regressor for feature selection that took about 8 hours to run and I did not go any further.

In part, as I recall, because the final validation was not that much better than a k-fold validation with the initial factors (for this single trial). But also because of a limited number of factors. Limited by the fact that factors cost API credits basically.

In short, if the validation was not what I wanted I was not too hopeful the the test-set would give me what I was looking for.

In any case, this is as far as I got using RFECV with the Extra Tress Regressor for feature selection. The code (with the factors removed). I am sure you have done this or something like this already but others reading this may want to try this first (adapting the upload and any concatenation):

from sklearn.feature_selection import RFECV
from sklearn.ensemble import ExtraTreesRegressor
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error
import pandas as pd

# Hyperparameters for ExtraTreesRegressor
criterion = 'friedman_mse'
max_depth = None
min_samples_leaf = 100
min_samples_split = 2000
max_features = 'log2'
n_estimators = 500
bootstrap = True
subsample = None
max_leaf_nodes = None

# Reading and concatenating data
file_paths = [
    '~/Desktop/DataMiner/xs/DM1xs.csv',
    '~/Desktop/DataMiner/xs/DM2xs.csv',
    '~/Desktop/DataMiner/xs/DM3xs.csv',
    '~/Desktop/DataMiner/xs/DM4xs.csv',
    '~/Desktop/DataMiner/xs/DM5xs.csv',
    '~/Desktop/DataMiner/xs/DM6xs.csv',
    '~/Desktop/DataMiner/xs/DM7xs.csv'
]

df_list = [pd.read_csv(file_path) for file_path in file_paths]
df = pd.concat(df_list, ignore_index=True)

# Features and target variable
feature_cols = [ ]

X = df[feature_cols]
df['ExcessReturn'].fillna(0, inplace=True)
y = df['ExcessReturn']

# Standardizing features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Initializing the regressor
regressor = ExtraTreesRegressor(
    max_leaf_nodes=max_leaf_nodes,
    bootstrap=bootstrap,
    max_samples=subsample,
    n_estimators=n_estimators,
    max_depth=max_depth,
    min_samples_split=min_samples_split,
    min_samples_leaf=min_samples_leaf,
    max_features=max_features,
    criterion=criterion,
    n_jobs=-1,
    random_state=42,
    verbose=1
)

# RFECV for feature selection with a minimum number of features to select
min_features_to_select = 5  # Set this to your desired minimum number of features
selector = RFECV(regressor, step=1, cv=5, scoring='neg_mean_squared_error', min_features_to_select=min_features_to_select)
selector.fit(X_scaled, y)

# Display selected features and their rankings
selected_features = [feature for feature, selected in zip(feature_cols, selector.support_) if selected]
removed_features = [feature for feature, selected in zip(feature_cols, selector.support_) if not selected]
rankings = {feature: rank for feature, rank in zip(feature_cols, selector.ranking_)}

print(f'Selected Features: {selected_features}')
print(f'Removed Features: {removed_features}')
print(f'Feature Rankings: {rankings}')

# Model performance evaluation
y_pred = selector.predict(X_scaled)
mse = mean_squared_error(y, y_pred)
print(f'Mean Squared Error: {mse}')

jlittleton · November 12, 2024, 5:27am

I am running a nested loop now (inner kfold, outer time series), but only with Recursive Feature Elimination, label selection, hyper parameters... I run about 150 iterations right now and it takes a bit over 2 hours per inner loop. I have not implemented the feature "discovery" year filter yet as I need to go back and review my features for that information.

At first I thought my inner loop score would be a lot different than the outer, but I think the market is volatile enough so much that my scoring function sometimes returns better results and sometimes worse. I think I will take this discussion to another post though to not take over the preview too much more!

jlittleton · November 12, 2024, 5:28am

I have implemented one, but not with the feature discovery year. But I will start a new post so I don't take this one over more!

korr123 · November 20, 2024, 2:31am

Hi, Just a quick check to see when this feature will be released. Looking forward to it.

Thank you.

marco · November 20, 2024, 12:13pm

Currently being tested. There were last minute changes because it was not easy to use with the many new options. We also revamped the predefined features focusing on what makes sense to feed to an ML algorithm. There were also some performance challenges since it requires a lot more CPU resources. We hope to release this week.

Here is the updated documentation of the new normalization types. Currently only the first one is available.

Normalization Types (coming soon)

Feature normalization is a process that transforms feature values to a similar range, and reduces the skewing effect of outliers. Normalizations can be either be:

global: cross-sectional vs. other stocks in the entire dataset.
by-date: cross-sectional vs. other stocks on the same date.
local: longitudinal vs. historical values for the same stock.

Only the recommended normalizations are shown for a particular predefined feature. The complete list of normalization options is:

Default Normalization (global or by-date): The default normalization for the AI Factor which uses either a z-score, min-max or rank scaler. It's a cross-sectional normalization by-date, or the entire dataset.
Categorical: Values are passed directly to the algorithm.
Min/Max (global, no trim): Normalize from 0 to 1 globally. Outliers are not trimmed.
Normalized vs. Sector/Industry (by-date): Normalize cross-sectionally against the sector, sub-sector or industry for a particular date.
PIT Normalization (local): Normalize locally with historical Point In Time values using FHist() functions.
Loop Normalization (local): Normalize locally using Loop() functions.
Series Regression Growth (global): Calculate the regression growth of a series formula, then normalize it using the Default Normalization.
Series Regression Surprise (global): Calculate the regression surprise of a series formula, then normalize it using the Default Normalization.
Series Regression R² (local): Calculate the regression R² of a series formula, then feed it directly to the algorithm.

NOTE: See the LinReg() reference for details on the regression function.

moral_es · November 30, 2024, 10:54am

Hello everyone!

I am interested in incorporating macro variables and dummies into an AI Factor. I have seen that the option of skipping the normalization of macro variables is not launched at the moment. Do you know more or less the release date? Do you know if “Ranking” or only “MinMax” will be added for normalization?

What kind of dummies can be chosen? Thank you very much in advance.

Best regards

marco · December 3, 2024, 12:24pm

I will post an update now