Optimal time for training on factors based on fundamentals

I looking for some guidance, how long should be the optimal time length to train the a algo (RF for examples) on fundamental data.

Not sure if smaller than a year makes sense, as there would be only 4 new data publication a year.

On the other extreme, if I would train around 10y and longer, I might train on factors which are getting out of fashion.

Ideally the training time would capture the change in performance of factors.

Some suggestions, or some publications/sources?

Best
Carsten

Hi Carsten,

Please check AiFML, chapter 4.7 TIME DECAY and use it for each 'Date'.
My implementation uses C=0, which means that my observations in the oldest 'Date' has weight zero, my observations in the second oldest 'Date' has weight slightly above zero.
Let me know if you need any help with this.

1 Like

Hi Piotr,

thanks.
Conceptual is very logical, but operative it’s not so clear. Do I use it on the feature and label, or only on the label? Because the whole chapter only talks about labels.
So I guess it’s label, right?

Second question, assuming using as well C=0, what would be the duration until the label reaches 0?
I was testing, and without decay, a good training period could be 3 years? 1 year is definitely too short and 10 years looks as well not too good.
So, using decay, I guess something around 4-5 years for training period?

Best
Carsten

This is very tricky because the underlying distribution of stock returns is under constant change and there are no guarantees. This is what makes investing very difficult. Generally, more data is better but more recent data captures the current regime better. However, shifts can happen very quickly such as Covid and models do break. I might even remove that period from the data. I try to stay on top of the game and make most throughout tests with different periods, markets, factors, etc. and keep doing it as long as I invest.

The chapter 4.7 talks about stuff that is not always relevant for our panel data. But my understanding of this approach is as follows.
There are two alternatives:
. implicitly by passing an array of weights for each observation to sample_weights parameter in the fit method in RF. Weights will be used when calculating split gains (after bootstrapping).
. explicitly by randomly undersampling your whole dataset with time decay (for example, group df by Date and use df.sample with fraction that is smaller for older dates)

@pitmaster

I feel the implicit is less error prone, what do you think, see below:
(if I start manipulation the feature/label, not sure what's the outcome )

import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestRegressor

def getTimeDecay(date_index, clfLastW=1.):
    clfW = (date_index - date_index.min()) / (date_index.max() - date_index.min())
    if clfLastW >= 0:
        slope = (1. - clfLastW) / clfW.iloc[-1]
    else:
        slope = 1. / ((clfLastW + 1) * clfW.iloc[-1])
    const = 1. - slope * clfW.iloc[-1]
    clfW = const + slope * clfW
    clfW[clfW < 0] = 0
    return clfW

# Assuming you have the following data
X_train = ...  # Feature matrix as a pandas DataFrame with a multi-index (ticker, date)
y_train = ...  # Target variable

# Get the unique dates from the date index and sort them
unique_dates = sorted(X_train.index.get_level_values(1).unique())

# Calculate the time decay weights for the sorted unique dates
clfLastW = 0.1  # Weight of the last observation (adjust as needed)
date_weights = getTimeDecay(pd.Series(unique_dates), clfLastW)

# Create a dictionary to map dates to weights
date_to_weight = dict(zip(unique_dates, date_weights))

# Create a sample weight array based on the date index of X_train
sample_weight = X_train.index.get_level_values(1).map(date_to_weight)

# Create a Random Forest model
rf_model = RandomForestRegressor(n_estimators=100, random_state=42)

# Fit the model with sample weights
rf_model.fit(X_train, y_train, sample_weight=sample_weight)

Just checked, but commutation time goes up with the weights....

2 Likes

This is great code @Carsten1234 .
I modified this code for users who (like me) do not use multiindex.

You may see the charts showing how the observations are weighted.
The 1st shows all observations, the 2nd shows the first 5% observations (sorted by date). You can see that there is a step for each new date which indicates that weight for observations in each next date is increased linearly.

Modified code for single index df:

def getTimeDecay(date_index, clfLastW=1.):
    clfW = (date_index - date_index.min()) / (date_index.max() - date_index.min())
    if clfLastW >= 0:
        slope = (1. - clfLastW) / clfW.iloc[-1]
    else:
        slope = 1. / ((clfLastW + 1) * clfW.iloc[-1])
    const = 1. - slope * clfW.iloc[-1]
    clfW = const + slope * clfW
    clfW[clfW < 0] = 0
    return clfW

# Assuming you have the following data
X_train = ...  # Feature matrix as a pandas DataFrame with a single index (ticker) and sorted by date
y_train = ...  # Target variable

# Rank dates
unique_dates = df['Date'].rank(method ='dense').astype(int)

# Calculate the time decay weights for the sorted unique dates
clfLastW = 0.1  # Weight of the last observation (adjust as needed)
date_weights = getTimeDecay(pd.Series(unique_dates), clfLastW)

# Create a dictionary to map dates to weights
date_to_weight = dict(zip(unique_dates, date_weights))

# # Create a sample weight array based on the date index of X_train
sample_weight = unique_dates.map(date_to_weight)

1 Like

@pitmaster, still the looming question, what’s the length of the training data :grinning:
3-5 years?

Best
Carsten

Nobody knows :slight_smile:
If your model want to pick up some factor momentum - use short training period (1-2 years).
If your model should be 'all weather' type then training period should cover both good and bad times multiple times - use the longest possible period with time decay (10-15 years), and maybe use expanding window.

2 Likes

I would like to go for the first, but I’m wondering how much would be the minimum time I need to train.

As we have only 4 datapoints per stock per year, 2 years is really limited.
Cross sectional we have around 4000 stocks to fit the model.
At the end there are only 4000x8 datapoints to for the fit, sounds not too much….
If one get restricted to S&P500 and one year, there are only 2000 data points.
Not sure if this is enough to get a good prediction.

Are there any guidelines how to check how much datapoints do I need and how does the model performs?

Just for a very simple RF

Why only 4 datapoints per stock per year ?
You can sample data every 4 weeks to have 12 datapoints per stock per year.
In a perfect world when all stocks release earnings the same day, you could sample data every quarter, next day after the release to be sure that all stocks have fundamentals related to the same period. But in reality stocks (especially in US) release earnings every day.

1 Like

Value ratios are affected by the price changes each day. That includes the enterprise value calculation as well as market-cap if you use those, I think. Any unexpected value rations after an earnings report would get corrected in days or weeks perhaps but before the next quarter if it was unexpectedly low or high ratio and reflected a real change (not just an accounting gimmick). So the target is also changing more quickly than over the period of a quarter that could be picked up with a smaller target period, also..

Analysts' consensus estimates change frequently than once per quarter,

3 Likes

@pitmaster @Jrinne

yes and no.

Basically we get per stock only 4 datapoints per year - which are the quarterly publications.

Than we do some mumbo jumbo and apply to them a little noise aka price. (I’m exaggerating )

But at the end of the day we only have 4 new information per year on the state of the company.

I maybe oversimplified it but compared to, for example RSI, it’s pretty few information per year per stock.
(Just imagine your use 4 datapoints for price momentum)

Of course, due to 4000 stocks we might have in the universe, this we can multiply.
But if we only select 20 stocks, each of the is again selected by 4 datapoints..

That’s why I’m struggling a bit and questioning if this would be good enough to create a good decision tree - or is it just noise I’m creating.

That was my point.

Best
Carsten

1 Like

Hi Carsten,

(Just a note. As you know we have shared some code previously. And like @pitmaster i have been impressed. We have discussed your use of classes and you preference for definitions for example. I have trouble with both.)

I have a theory, though it may not be entirely correct. I believe that some of the changes we see in value metrics are simply due to regression toward the mean. This phenomenon, I think, is related to information entropy and is therefore akin to the second law of thermodynamics (an inescapable law of nature). In fact, every stock I've ever encountered with a rank of 100 has eventually reverted toward the mean. By definition, every stock I've seen with a rank of 100 has has reverted toward the mean as it has not maintained that extreme valuation indefinitely.

Knowing that a stock has an extreme value metric should not be ignored, according to my theory. The metric will change over time. One way for it to change is if the price, enterprise value, or market cap increases.

For me, this may be the whole point of value investing and analyzing these metrics.

I would add that by incorporating growth metrics, you are potentially increasing the odds that the value metric will revert to the mean through a correction in the stock price.

So, if a company has a very low price-to-free-cash-flow ratio, making it rank highly for that metric, I know that this extreme valuation will not persist forever (it will eventually revert). If every analyst is predicting improved cash flow and there is a blowout earnings surprise, then you might expect a price change that could cause that metric to revert to the mean. You might even consider a small investment in the stock, betting on this reversion with a price change as a probable cause.

I may be wrong, but that is my theory. I rebalance my portfolio weekly and use the upcoming week as a target for machine learning and for the P123 classic strategy.

By the way, I use the P123 classic strategy and machine learning (sometimes combining them). I optimize it differently than most members when using P123 classic, but I love P123 classic and think there is more than one good way to optimize this strategy. So for me, it's not just about machine learning.

Jim

1 Like

@Jrinne

Jim,

I totally agree with all you write.

I’m one step before, at the point to select which factor (or several) i should use.
After that, the second step is the mechanics by how the factors create profit, which is what you described. That would be my understanding as well.

So let’s go back to the first step, which factors to use.

I can do it manually, looking at the buckets, to see if I have a good monoticity over 10 buckets and during the last 20 years. Than selecting the best 30 factors and invest and don’t change anything.

But if I want to ride the best 10-20 factors per quarter, the approach would be different.

I was screening a lot of factors with alphalense.
It gives as well a good plot of the single factor performance during 20 years.
Interestingly its performance goes consistently up and down, a bit like a drunk sinus curve.
But not very rapidly, very long frequency.
If one could follow the ups and downs and just select the factor if it’s near his top performance, it would be a great strategy.

Now the question would be how to jump at the right time into the right factors?

If I assume the optimum of the factors might be in play for a period between 1-2 years, it should be selectable by a RF.

In order to get that working, the decision tree needs enough sample to select from the noise the correct factors.

Now the point I was making before, I only can base the performance of the factor on very few new arriving samples (4 per year).

Is there a chance that this could work at all, or is the time (available new datapoints) not enough and I’m just training on a to small sample size and just predict noise.

Analogy, if I want to measure a very turbulent flow, I can compute based on the estimated rms how much sample I need in order to achieve a certain confidence level that I want to have. If I want to measure the mean velocity with x% accuracy, I need at least y samples if the flow has z rms.

Now this should somehow apply as well for a RF.
Would be good to how one could estimate that.

Best
Carsten

Maybe @WalterW can help you with some of that. He uses PYMC or Bayesian methods. I don't think you will get statistical significance with 2 year of data. And confidence interval will be wide.

But you can get an odds ratio using longer period of data of even multiple factors as the prior. Using hierarchical Bayesian methods as an advanced method.

Just an idea and I cannot say I have made Bayesian methods work for that particular purpose myself. Also I have limited experience with PYMC but I have an environment setup in Anaconda and it seems to give good answers. I cannot say it is easy for me but it works. I think @Walter uses it a lot and may have some more practical or tested advice.

Jim

2 Likes

Just a quick note: there has been a lot of debate about whether factor momentum even exists. I don't believe it does--or if so, not much. Factors work for a while, then they don't, then they work again, and there's no predicting when they'll work and when they won't. That belief would favor a long training period. I would use about ten years if I were you.