NEW 'Factor List' tool for downloading data for AI/ML

Yes, just add Future%Chg(5) and turn on ‘Skip Normalization’ to see the actual value.

2 Likes

Add → Predefined

These are a mix of pre-built factors (the most common ones) and formulas in an easier to use format. More will be added. We will also use them in other parts of the site, like Fundamental Chart and Ranking System nodes.

1 Like

Ok, I had normalization enabled. When disabled, the returned field is blank. That should be OK.

Thank you for explanation. I checked and it works.

I was wondering if it would be possible to add a new parameter to Future%Chg_D factor, called e.g., ‘price type’ to be able select Open, Close or Average of Next High, Low, and 2X Close.

Many thanks again for this release.

I’d recommend something more standard high performance file format like Parquet.

If anyone wants to perform normalization using a method not currently covered in Factor List, use Scaling Z-Score, N/A Handling None, Trim % 0, and Outliers Retain. This will yield the the original distribution scaled to standard normal allowing one to do any sort of transformation.

The mean and standard deviation can also be preserved in subsequent downloads by using a fixed ML Training End date. (While this is generally true, any change in the universe or data would affect the result.) This unfortunately has a growing API cost since one must download the same data multiple times, but it’s the solution we currently have.

Some of this was mentioned here: No need for raw scores - #4 by marco

I was thinking about this and instead of downloading the entire dataset again all you need is one week of overlap. I think this is true as the zscores are a linear transformation of the original data and therefore the different zscores are just a linear transformation of each other.

Using chatgpt I have a method for two datasets where there is one week of overlap with all the same stocks. I feel like there should be a way that only requires two or so identical stock ids, but I have not figured it out yet. I did use chatgpt to check the method, so take with a grain of salt!

For two sets: set1 and set2 with one week of complete overlap

  1. Determine the Scale and Shift Factors:
  • Scale (a): The ratio of the standard deviations of set1 and set2 for the overlap week.
  • Shift (b): The difference in means of set1 and set2 for the overlap week.
  1. Calculate the new zscores using this equation:
    z_set2, transformed = a * z_set2 + b

All together:
z_set2, transformed = set1_wstd/set2_wstd * z_set2 + (set1_wmean - set2_wmean)

“By applying this transformation, the z-score from set2 is adjusted to be on the same scale as set1, making it comparable across the two datasets for the overlapping period. This is particularly useful when you want to compare or combine z-scores from different sources that were standardized differently.”

Jonpaul,

I think you could get away with that using just one independent variable.

I do not think would work fur a multivariate regression with multiple independent factors.

If you think about it for one week the value factors might have done great while growth factors underperformed significantly. You would suddenly be putting huge amount of weight on the growth factors in your regression (because you would be subtracting an excessively large mean for the week from the value factor).

Different, for sure, than if you have used the mean and standard deviation over the entire training set.

Also, if you use 20+ years of data, because you train on 20+ years of data, weekly rebalances are cost prohibitive.

I think Aschiff and Marco got it right. I am glad they understand this and are working on a solution (or something that uses fewer API credits).

Jim

There is a good chance I missed what you mean (I am tired today), but here is what I was thinking:

  1. Download your dataset from say 2010-01-01 to 2023-12-03 using zscores scaled for the entire time period. Scaled for the entire period is very important so that each week uses the same mean and std.
  2. Then when you want to download the next dataset you would use from 2023-11-26 until now once again scaling for the entire dataset.
  3. Then for each factor you would re-scale the second dataset using the overlapping week to get the scale and offset values. You would need to do so for each factor independently. Not all together.

Let me know if I am missing something in my logic, I have not well tested it to make sure it works.

I just think that won’t work. Try approaching this in multiple ways with ChatGPT. Making sure to let it know you are talking about multivariate regressions.

AND stock data that is not stationary where the means can be different for long periods of time. A sample of the means will not be enough.

I have the flu or maybe a mild Covid myself I suspect ChatGPT might be more clear than I would be. Plus I don’t want to monopolize the discussion. Especially since Marco a P123 are already working on it.

I think ChatGPT will be clear. I’ll do the same to make sure I am not missing anything.

Jim

If I’m reading this right, a recursive definition for variance and means should work w/ data updates. Right? From that, one could build new z-scores. The original d/l would need to include period mean/variance.

EDIT: seems like only a data license holder can do that trick

Walter,

From the paper: “You can estimate the variance by randomly sampling the dataset.”

That is why I specifically mentioned the non-stationary nature of stock data above.

You cannot take a small sample with non-stationary data I think.

If you were to sample a portion of the data for volatility would you be okay leaving out 2008. Maybe I think.

You might be comfortable going back ten years and just including the Covid crash to get an adequate sample of volatility.

Ten years of data is still a lot of data to download every week. Not really efficient use of P123 resources.

I think means are probably a greater problem. More of a problem than volatility I think. A factor can underperform for long periods. With value’s underperformance in 2018 lasting at least a couple of years as a possible example.

But that would work for i.i.d data that they are probably talking about here. Stock data is not I.I.d, however.

I hope you understand that I wish there were an easier way with z-score and would love to find I am wrong about that.

There could be some satisfactory workarounds. Maybe I am too much of a perfectionist.

That having been said, the established way to do this is to calculate the z-score of the data you want to predict using the mean and standard deviation of the data you trained with.

This is called a standardized regression.

Jim

Jim,

I was referring to the highest scoring answer. The one that has the recursive definitions for mean and variance.

But more importantly, I just realized that I don’t understand what ‘Download Factors’ is offering. I need a basic tutorial.

With Factor=AvgRev, Scaling=Z-Score, Scope=Entire Dataset, and ML Training End=12/16/2023, what range of data is used to compute the AvgRec mean and variance?

When Scope=By Date, the mean is simply the mean of all the AvgRec for a particular date. Easy-peasy. I can look at all the S&P500 stocks AvgRec for today and compute the set means/variance and then normalized to Z-Score. The normalization for every period is independent.

Maybe someone with more experience w/ what ML tools need can help me out.

Walter,

Thank you. I will have to think about some of that. I am open to the idea that there may be some workarounds and I hope there are.

Marco’s idea for saving the means and standard deviation for later use will certainly be a nice addition.

In the meantime ranks are perfectly standardized each week. So 5 stars on what P123 has done already.

I will think about what you said as well as my own ideas for workarounds.

Best,

Jim

I’m having my doubts about the recursive stuff.

I need to get back to basics and understand what ‘Download Factors’ is doing.

1 Like

I am making an assumption that the zscore using the entire dataset as the scope is using the same mean and std for every week when calculating the zscore. That is why I think you can use one overlapping week to scale one dataset to the other. If all weeks in each dataset have the same mean and std then it does not matter how the data varies over time. You just need to scale the data such that the one week lines up. Then since all weeks have the same scaling you should now have the same scaling across the datasets.

But I need to verify this. I have not had time yet… My plan is to download a few months for each dataset with one month of overlap. Scale the second dataset using one week and then check the other weeks to make sure the zscores are identical.

I agree that if you use the zscore scope of per date this will not work at all. Also my goal is not to get the original mean and std. Just make the two datasets line up.

EDIT: made a new thread so I can minimize hi-jacking this one any more than I have…

Jonpaul,

I do not want to dominate this discussion. Walter may have a method. I am not sure I fully understood what he was saying. There might be methods none of us have thought of.

But also you put a lot of work in on this and I hope we can help prevent each other from making mistakes.

I asked ChatGPT 4.0 about this with an open-ended question that I hope was not too biased and I hope was an accurate summary of all of our problems now: “Okay. Let me expand and get a clear answer from you about whether you would recommend anyone investing with this method. A person has 24 years of data. Suppose he has 30 factors and can get a z-score on each of those factors but does not have access to the raw data. They do a multivariate regression knowing about standardized regression. They want to then make predictions based on the z-score but have no knowledge of the mean and standard deviation that was used in the calculation of the z-score in the training data. AND it cost money to repeat the process with 24 years of data at each weekly rebalance. What method do you recommend, if any, to get a z-score that can be use for prediction.”

A: “In such a scenario, without knowing the mean and standard deviation used to calculate the z-scores in the training data, accurately predicting future values becomes highly uncertain. It’s generally not recommended to proceed with investing based on this method unless the original scaling parameters can be obtained or reliably estimated, as the lack of this crucial information significantly undermines the model’s reliability and potential investment decisions.”

I then tried to lead it toward you solution for estimating “the scaling parameters”—again, hopefully without too much bias in my question.

Q: “Do you think overlapping the training data and the data to predicted by a week or more be beneficial?”

A: “Overlapping the training data and prediction data by a week might help in aligning the z-scores to some extent, but it’s not a guaranteed solution, especially if the statistical properties of the data change over time. It’s a method that could potentially be beneficial, but it comes with uncertainties and should be approached with caution.”

Maybe it is wrong for me to just copy and paste ChatGPT but it does write pretty well and I was already committee to its answer.

I would have just said remember:

  1. That the z-score is calculated independently for each factor

  2. One week of data is not a very good sample. And a year might not be very good for some markets either (e.g., 2008)

  3. The mean and standard deviation get recalculated each time you do this with new data WITH NO MEMORY OF ANY PREVIOUS CACULATIONS.

BTW. P123 is probably being very careful to make it so you can never calculate, know or accurately estimate the mean and standard deviation. Otherwise, freshman high school algebra would allow you to calculate with 100% accuracy (or high accuracy in the case of a great estimate) the raw values for all of the data on P123’s site.

I think that is part of the problem you are having. I respect P123’s need to make it hard on you to calculate the raw values while working to make it possible to “decode” the data internally and use it (with P123’s help). They will never, knowingly, give you a way to decode it on your own unless their contract with FactSet changes.

Now, at least I understand why they have to save the mean and standard deviation internally and not give that to us. It does seem P123 understands the difficulties and has considered some solutions.

I am long. But this is not a trivial thing for us if we want to use it to invest money. We probably need to get it right. And I know P123 has spent a great deal of time and effort working on this without accidentally violating its contract with FactSet.

Jim

1 Like

For those that want to use Z-Scores calculated with means and stddevs that span multiple dates:

Yes, it will be expensive in terms of API credits because you have to download the entire period using the same ML Training End Date so that the latest data uses the same distribution stats.

Yes, we’re not giving you the mean & stddev bc it would make it trivial to get the raw data and thereby violating our vendor terms (technical data is ok)

We might offer a way to remember the statistics for each feature so that you only need to download the latest data. But it’s not a solution for all cases, and probably a waste of time. What if you are doing k-fold cross validation for example?

Therefore just use Min/Max with a very small Trim% with lots of digits in the precision and do the Z-Score yourself. You will have total control.

Lastly the FactSet license is only $12K/year and you can download raw data.

@ marco. Thank you.This requires all of the data and you are thinking you can use this for k-fold validation, I think.

It works for that I believe.

It still requires a full download of all of the data for every rebalance to keep the linear scaling coefficient constant, if I am correct. Almost no one is going to do that I would guess.

You might need to look at ease (and cost) of rebalance as a separate issue if you want this t otake off.

Sorry guys I said something stupid. There’s currently no way to do a consistent normalization (using the same distribution stats) for the prediction data without downloading more and more data. The prediction data will not have the same min/max values as the training data but it will still range from 0 to 1 which is incorrect.

So, if normalization spanning multiple dates is what you want, then it can be costly in terms of API credits to keep normalization consistent unless we add some way of remembering the stats.

The “Factor List” is a component of our upcoming, built-in, AI factors where you will not need to download anything, and we wanted to kickstart dome discussions.

1 Like