NEW 'Factor List' tool for downloading data for AI/ML

WalterW · December 23, 2023, 1:57pm

If I’m reading this right, a recursive definition for variance and means should work w/ data updates. Right? From that, one could build new z-scores. The original d/l would need to include period mean/variance.

EDIT: seems like only a data license holder can do that trick

Jrinne · December 23, 2023, 2:30pm

Walter,

From the paper: “You can estimate the variance by randomly sampling the dataset.”

That is why I specifically mentioned the non-stationary nature of stock data above.

You cannot take a small sample with non-stationary data I think.

If you were to sample a portion of the data for volatility would you be okay leaving out 2008. Maybe I think.

You might be comfortable going back ten years and just including the Covid crash to get an adequate sample of volatility.

Ten years of data is still a lot of data to download every week. Not really efficient use of P123 resources.

I think means are probably a greater problem. More of a problem than volatility I think. A factor can underperform for long periods. With value’s underperformance in 2018 lasting at least a couple of years as a possible example.

But that would work for i.i.d data that they are probably talking about here. Stock data is not I.I.d, however.

I hope you understand that I wish there were an easier way with z-score and would love to find I am wrong about that.

There could be some satisfactory workarounds. Maybe I am too much of a perfectionist.

That having been said, the established way to do this is to calculate the z-score of the data you want to predict using the mean and standard deviation of the data you trained with.

This is called a standardized regression.

Jim

WalterW · December 23, 2023, 3:07pm

Jim,

I was referring to the highest scoring answer. The one that has the recursive definitions for mean and variance.

But more importantly, I just realized that I don’t understand what ‘Download Factors’ is offering. I need a basic tutorial.

With Factor=AvgRev, Scaling=Z-Score, Scope=Entire Dataset, and ML Training End=12/16/2023, what range of data is used to compute the AvgRec mean and variance?

When Scope=By Date, the mean is simply the mean of all the AvgRec for a particular date. Easy-peasy. I can look at all the S&P500 stocks AvgRec for today and compute the set means/variance and then normalized to Z-Score. The normalization for every period is independent.

Maybe someone with more experience w/ what ML tools need can help me out.

Jrinne · December 23, 2023, 3:14pm

Walter,

Thank you. I will have to think about some of that. I am open to the idea that there may be some workarounds and I hope there are.

Marco’s idea for saving the means and standard deviation for later use will certainly be a nice addition.

In the meantime ranks are perfectly standardized each week. So 5 stars on what P123 has done already.

I will think about what you said as well as my own ideas for workarounds.

Best,

Jim

WalterW · December 23, 2023, 3:20pm

I’m having my doubts about the recursive stuff.

I need to get back to basics and understand what ‘Download Factors’ is doing.

jlittleton · December 24, 2023, 1:14am

I am making an assumption that the zscore using the entire dataset as the scope is using the same mean and std for every week when calculating the zscore. That is why I think you can use one overlapping week to scale one dataset to the other. If all weeks in each dataset have the same mean and std then it does not matter how the data varies over time. You just need to scale the data such that the one week lines up. Then since all weeks have the same scaling you should now have the same scaling across the datasets.

But I need to verify this. I have not had time yet… My plan is to download a few months for each dataset with one month of overlap. Scale the second dataset using one week and then check the other weeks to make sure the zscores are identical.

I agree that if you use the zscore scope of per date this will not work at all. Also my goal is not to get the original mean and std. Just make the two datasets line up.

EDIT: made a new thread so I can minimize hi-jacking this one any more than I have…

Jrinne · December 24, 2023, 11:12am

Jonpaul,

I do not want to dominate this discussion. Walter may have a method. I am not sure I fully understood what he was saying. There might be methods none of us have thought of.

But also you put a lot of work in on this and I hope we can help prevent each other from making mistakes.

I asked ChatGPT 4.0 about this with an open-ended question that I hope was not too biased and I hope was an accurate summary of all of our problems now: “Okay. Let me expand and get a clear answer from you about whether you would recommend anyone investing with this method. A person has 24 years of data. Suppose he has 30 factors and can get a z-score on each of those factors but does not have access to the raw data. They do a multivariate regression knowing about standardized regression. They want to then make predictions based on the z-score but have no knowledge of the mean and standard deviation that was used in the calculation of the z-score in the training data. AND it cost money to repeat the process with 24 years of data at each weekly rebalance. What method do you recommend, if any, to get a z-score that can be use for prediction.”

A: “In such a scenario, without knowing the mean and standard deviation used to calculate the z-scores in the training data, accurately predicting future values becomes highly uncertain. It’s generally not recommended to proceed with investing based on this method unless the original scaling parameters can be obtained or reliably estimated, as the lack of this crucial information significantly undermines the model’s reliability and potential investment decisions.”

I then tried to lead it toward you solution for estimating “the scaling parameters”—again, hopefully without too much bias in my question.

Q: “Do you think overlapping the training data and the data to predicted by a week or more be beneficial?”

A: “Overlapping the training data and prediction data by a week might help in aligning the z-scores to some extent, but it’s not a guaranteed solution, especially if the statistical properties of the data change over time. It’s a method that could potentially be beneficial, but it comes with uncertainties and should be approached with caution.”

Maybe it is wrong for me to just copy and paste ChatGPT but it does write pretty well and I was already committee to its answer.

I would have just said remember:

That the z-score is calculated independently for each factor
One week of data is not a very good sample. And a year might not be very good for some markets either (e.g., 2008)
The mean and standard deviation get recalculated each time you do this with new data WITH NO MEMORY OF ANY PREVIOUS CACULATIONS.

BTW. P123 is probably being very careful to make it so you can never calculate, know or accurately estimate the mean and standard deviation. Otherwise, freshman high school algebra would allow you to calculate with 100% accuracy (or high accuracy in the case of a great estimate) the raw values for all of the data on P123’s site.

I think that is part of the problem you are having. I respect P123’s need to make it hard on you to calculate the raw values while working to make it possible to “decode” the data internally and use it (with P123’s help). They will never, knowingly, give you a way to decode it on your own unless their contract with FactSet changes.

Now, at least I understand why they have to save the mean and standard deviation internally and not give that to us. It does seem P123 understands the difficulties and has considered some solutions.

I am long. But this is not a trivial thing for us if we want to use it to invest money. We probably need to get it right. And I know P123 has spent a great deal of time and effort working on this without accidentally violating its contract with FactSet.

Jim

marco · December 24, 2023, 3:07pm

For those that want to use Z-Scores calculated with means and stddevs that span multiple dates:

Yes, it will be expensive in terms of API credits because you have to download the entire period using the same ML Training End Date so that the latest data uses the same distribution stats.

Yes, we’re not giving you the mean & stddev bc it would make it trivial to get the raw data and thereby violating our vendor terms (technical data is ok)

We might offer a way to remember the statistics for each feature so that you only need to download the latest data. But it’s not a solution for all cases, and probably a waste of time. What if you are doing k-fold cross validation for example?

Therefore just use Min/Max with a very small Trim% with lots of digits in the precision and do the Z-Score yourself. You will have total control.

Lastly the FactSet license is only $12K/year and you can download raw data.

Jrinne · December 24, 2023, 3:26pm

@ marco. Thank you.This requires all of the data and you are thinking you can use this for k-fold validation, I think.

It works for that I believe.

It still requires a full download of all of the data for every rebalance to keep the linear scaling coefficient constant, if I am correct. Almost no one is going to do that I would guess.

You might need to look at ease (and cost) of rebalance as a separate issue if you want this t otake off.

marco · December 26, 2023, 7:55pm

Sorry guys I said something stupid. There’s currently no way to do a consistent normalization (using the same distribution stats) for the prediction data without downloading more and more data. The prediction data will not have the same min/max values as the training data but it will still range from 0 to 1 which is incorrect.

So, if normalization spanning multiple dates is what you want, then it can be costly in terms of API credits to keep normalization consistent unless we add some way of remembering the stats.

The “Factor List” is a component of our upcoming, built-in, AI factors where you will not need to download anything, and we wanted to kickstart dome discussions.

jlittleton · December 26, 2023, 11:55pm

In another thread I did an algebraic proof to show that with just one week of overlap zscores can be scaled to match each other. It will also apply across all of the zscores. This makes the very important assumption that the zscores were calculated using the entire dataset!

Unless my proof has a flaw, I think this is an acceptable solution as the added cost of rebalance is fairly low with the new tool doing total points across dates and not using API credits per date.

pitmaster · December 28, 2023, 1:52pm

Hi,
Is there any way to normalise raw data using Rank but with scope e.g., Sector ?
There is no option to select scope for normalisation ( all, sector, industry, etc.)

marco · December 28, 2023, 5:02pm

The ability to set the scope in the front end is a future enhancement. For Rank normalization you can just transform the data with FRank . For example if your factor is Pr2BookQ then rewrite it like this

FRank(“Pr2BookQ”,#sector)

Notes
The sort parameter (#asc or #desc) is not really necessary for machine learning

For scoping z-scores that span multiple dates this workaround probably won’t give you the results you want

pitmaster · December 28, 2023, 7:58pm

I tried to follow your advice @marco.

My ratio of interest is:

Eval(FRank(GMgn%TTM ,#sector, #desc, #ExclNA)=NA, 50, FRank(GMgn%TTM, #sector, #desc, #ExclNA))

This is my target data I want to download to csv file.

My setup is: 1st tab: ‘Skip Normalization’ is OFF and in 2nd tab ‘Normalization’ is OFF.
I have received this information.

Download preparation failed.
Invalid formula Eval(FRank(GMgn%TTM ,#sector, #desc, #ExclNA)=NA, 50, FRank(GMgn%TTM, #sector, #desc, #ExclNA)): A data license is required for this operation.

The only way I can run this download is to have in 1st tab: ‘Skip Normalization’ is OFF and in 2nd tab ‘Normalization’ is ON (Rank).

Then the final data to be downloaded to csv file is:

FRank(“Eval(FRank(GMgn%TTM ,#sector, #desc, #ExclNA)=NA, 50, FRank(GMgn%TTM, #sector, #desc, #ExclNA))” , #all, #DESC)

Quite convoluted but it works Maybe there is a simple approach I’m not aware of.

My another suggestion us to remove constraint of 100 factors per download (increase to 1,000 or so). Sometimes a user want to download many factors for a small universe. If I have 1,000 factors I need to prepare a download 10 times.

jlittleton · December 29, 2023, 4:51am

I just completed my downloads for zscores to match the API downloads I did for rank. Overall I really like the implementation!

A few comments and questions:

I have 125 ish factors and I had to split them into 6 sections to download 20 years at a time for my universe. I could have split into more time periods instead, but since I need overlap to scale my normalization I went this way instead. It would be nice if there was a better way to manage this, or a more efficient download format to allow larger data downloads
Will the download expand to include ETFs?
When I try to download Macro factors I am getting this error:

image1491×1024 47.9 KB

image1535×752 58.7 KB

I have tried a few dates and a period of dates

aschiff · December 29, 2023, 4:37pm

That’s one way to do it. As you’ve seen, a limitation in the backend support for this requires FRank/ZScore to be the topmost element or it will complain.

If your normalization is Rank, FRank("GMgn%TTM", #Sector, #DESC, #ExclNA) should be enough to do this. If it’s N/A Handling Middle, it will place them in the middle for you. Otherwise, one could fill N/A values with 0.5 after the fact. And of course if you want 0-100, the output can be multiplied out.

If Skip Normalization is still necessary or preferred, this formula can be shortened this way: FRank("IsNA(FRank(`GMgn%TTM`, #Sector, #DESC, #ExclNA), 50)", #All, #DESC).

marco · December 29, 2023, 5:28pm

@jlittleton ,

We upped the limits to 300 factors from 100. We’ll also up the total number of data points from 100M to 300M (requires a build so it will be the next release)
Adding ETFs should be easy enough. If others want it please chime in.
You are using Close_D in some series that do not support it like ##RGDP (quarterly) and ##CPI (monthly). Be sure you know the frequency before using these. For example you seem to be using ##UNRATE as a quarterly series, but it’s monthly. The easiest way is to test them is in the screener, then double click on ##CPI to find the reference which tells you the frequency of the data.

Thanks

PS you can also use the fundamental chart to see the macro series

jlittleton · December 30, 2023, 2:47am

This fixed the issue! As a note those came from the Predefined Macro factors. Might be good to update them to work right out of the “box” for the downloads.

korr123 · April 17, 2024, 4:08pm

@marco When will we see the release of the AI / ML work you've been pluggin away on? I've read alot about it over the last few years and even got an email about how exciting it is (as my subscription is to renew shortly). Yet, I have actually seen nothing.

Can you provide a realistic eta for this?

Thank you,

marco · April 17, 2024, 5:03pm

Hi, it's real, it's working. Did you see this ? PREVIEW: Screenshots of upcoming AI Factors

We're testing it now. We are going to open it up soon (next week?) to about a dozen users since we have not yet purchased more hardware to support, for example, a validation study of 100 models all at once. We wanted to get a feel of real world usage of a sample of users before deciding how much we need to scale.

Thanks