NEW 'Factor List' tool for downloading data for AI/ML


I was referring to the highest scoring answer. The one that has the recursive definitions for mean and variance.

But more importantly, I just realized that I don’t understand what ‘Download Factors’ is offering. I need a basic tutorial.

With Factor=AvgRev, Scaling=Z-Score, Scope=Entire Dataset, and ML Training End=12/16/2023, what range of data is used to compute the AvgRec mean and variance?

When Scope=By Date, the mean is simply the mean of all the AvgRec for a particular date. Easy-peasy. I can look at all the S&P500 stocks AvgRec for today and compute the set means/variance and then normalized to Z-Score. The normalization for every period is independent.

Maybe someone with more experience w/ what ML tools need can help me out.


Thank you. I will have to think about some of that. I am open to the idea that there may be some workarounds and I hope there are.

Marco’s idea for saving the means and standard deviation for later use will certainly be a nice addition.

In the meantime ranks are perfectly standardized each week. So 5 stars on what P123 has done already.

I will think about what you said as well as my own ideas for workarounds.



I’m having my doubts about the recursive stuff.

I need to get back to basics and understand what ‘Download Factors’ is doing.

1 Like

I am making an assumption that the zscore using the entire dataset as the scope is using the same mean and std for every week when calculating the zscore. That is why I think you can use one overlapping week to scale one dataset to the other. If all weeks in each dataset have the same mean and std then it does not matter how the data varies over time. You just need to scale the data such that the one week lines up. Then since all weeks have the same scaling you should now have the same scaling across the datasets.

But I need to verify this. I have not had time yet… My plan is to download a few months for each dataset with one month of overlap. Scale the second dataset using one week and then check the other weeks to make sure the zscores are identical.

I agree that if you use the zscore scope of per date this will not work at all. Also my goal is not to get the original mean and std. Just make the two datasets line up.

EDIT: made a new thread so I can minimize hi-jacking this one any more than I have…


I do not want to dominate this discussion. Walter may have a method. I am not sure I fully understood what he was saying. There might be methods none of us have thought of.

But also you put a lot of work in on this and I hope we can help prevent each other from making mistakes.

I asked ChatGPT 4.0 about this with an open-ended question that I hope was not too biased and I hope was an accurate summary of all of our problems now: “Okay. Let me expand and get a clear answer from you about whether you would recommend anyone investing with this method. A person has 24 years of data. Suppose he has 30 factors and can get a z-score on each of those factors but does not have access to the raw data. They do a multivariate regression knowing about standardized regression. They want to then make predictions based on the z-score but have no knowledge of the mean and standard deviation that was used in the calculation of the z-score in the training data. AND it cost money to repeat the process with 24 years of data at each weekly rebalance. What method do you recommend, if any, to get a z-score that can be use for prediction.”

A: “In such a scenario, without knowing the mean and standard deviation used to calculate the z-scores in the training data, accurately predicting future values becomes highly uncertain. It’s generally not recommended to proceed with investing based on this method unless the original scaling parameters can be obtained or reliably estimated, as the lack of this crucial information significantly undermines the model’s reliability and potential investment decisions.”

I then tried to lead it toward you solution for estimating “the scaling parameters”—again, hopefully without too much bias in my question.

Q: “Do you think overlapping the training data and the data to predicted by a week or more be beneficial?”

A: “Overlapping the training data and prediction data by a week might help in aligning the z-scores to some extent, but it’s not a guaranteed solution, especially if the statistical properties of the data change over time. It’s a method that could potentially be beneficial, but it comes with uncertainties and should be approached with caution.”

Maybe it is wrong for me to just copy and paste ChatGPT but it does write pretty well and I was already committee to its answer.

I would have just said remember:

  1. That the z-score is calculated independently for each factor

  2. One week of data is not a very good sample. And a year might not be very good for some markets either (e.g., 2008)

  3. The mean and standard deviation get recalculated each time you do this with new data WITH NO MEMORY OF ANY PREVIOUS CACULATIONS.

BTW. P123 is probably being very careful to make it so you can never calculate, know or accurately estimate the mean and standard deviation. Otherwise, freshman high school algebra would allow you to calculate with 100% accuracy (or high accuracy in the case of a great estimate) the raw values for all of the data on P123’s site.

I think that is part of the problem you are having. I respect P123’s need to make it hard on you to calculate the raw values while working to make it possible to “decode” the data internally and use it (with P123’s help). They will never, knowingly, give you a way to decode it on your own unless their contract with FactSet changes.

Now, at least I understand why they have to save the mean and standard deviation internally and not give that to us. It does seem P123 understands the difficulties and has considered some solutions.

I am long. But this is not a trivial thing for us if we want to use it to invest money. We probably need to get it right. And I know P123 has spent a great deal of time and effort working on this without accidentally violating its contract with FactSet.


1 Like

For those that want to use Z-Scores calculated with means and stddevs that span multiple dates:

Yes, it will be expensive in terms of API credits because you have to download the entire period using the same ML Training End Date so that the latest data uses the same distribution stats.

Yes, we’re not giving you the mean & stddev bc it would make it trivial to get the raw data and thereby violating our vendor terms (technical data is ok)

We might offer a way to remember the statistics for each feature so that you only need to download the latest data. But it’s not a solution for all cases, and probably a waste of time. What if you are doing k-fold cross validation for example?

Therefore just use Min/Max with a very small Trim% with lots of digits in the precision and do the Z-Score yourself. You will have total control.

Lastly the FactSet license is only $12K/year and you can download raw data.

@ marco. Thank you.This requires all of the data and you are thinking you can use this for k-fold validation, I think.

It works for that I believe.

It still requires a full download of all of the data for every rebalance to keep the linear scaling coefficient constant, if I am correct. Almost no one is going to do that I would guess.

You might need to look at ease (and cost) of rebalance as a separate issue if you want this t otake off.

Sorry guys I said something stupid. There’s currently no way to do a consistent normalization (using the same distribution stats) for the prediction data without downloading more and more data. The prediction data will not have the same min/max values as the training data but it will still range from 0 to 1 which is incorrect.

So, if normalization spanning multiple dates is what you want, then it can be costly in terms of API credits to keep normalization consistent unless we add some way of remembering the stats.

The “Factor List” is a component of our upcoming, built-in, AI factors where you will not need to download anything, and we wanted to kickstart dome discussions.

1 Like

In another thread I did an algebraic proof to show that with just one week of overlap zscores can be scaled to match each other. It will also apply across all of the zscores. This makes the very important assumption that the zscores were calculated using the entire dataset!

Unless my proof has a flaw, I think this is an acceptable solution as the added cost of rebalance is fairly low with the new tool doing total points across dates and not using API credits per date.

Is there any way to normalise raw data using Rank but with scope e.g., Sector ?
There is no option to select scope for normalisation ( all, sector, industry, etc.)

The ability to set the scope in the front end is a future enhancement. For Rank normalization you can just transform the data with FRank . For example if your factor is Pr2BookQ then rewrite it like this


The sort parameter (#asc or #desc) is not really necessary for machine learning

For scoping z-scores that span multiple dates this workaround probably won’t give you the results you want

I tried to follow your advice @marco.

My ratio of interest is:

Eval(FRank(GMgn%TTM ,#sector, #desc, #ExclNA)=NA, 50, FRank(GMgn%TTM, #sector, #desc, #ExclNA))

This is my target data I want to download to csv file.

My setup is: 1st tab: ‘Skip Normalization’ is OFF and in 2nd tab ‘Normalization’ is OFF.
I have received this information.

Download preparation failed.
Invalid formula Eval(FRank(GMgn%TTM ,#sector, #desc, #ExclNA)=NA, 50, FRank(GMgn%TTM, #sector, #desc, #ExclNA)): A data license is required for this operation.

The only way I can run this download is to have in 1st tab: ‘Skip Normalization’ is OFF and in 2nd tab ‘Normalization’ is ON (Rank).

Then the final data to be downloaded to csv file is:

FRank(“Eval(FRank(GMgn%TTM ,#sector, #desc, #ExclNA)=NA, 50, FRank(GMgn%TTM, #sector, #desc, #ExclNA))” , #all, #DESC)

Quite convoluted but it works :slight_smile: Maybe there is a simple approach I’m not aware of.

My another suggestion us to remove constraint of 100 factors per download (increase to 1,000 or so). Sometimes a user want to download many factors for a small universe. If I have 1,000 factors I need to prepare a download 10 times.

I just completed my downloads for zscores to match the API downloads I did for rank. Overall I really like the implementation!

A few comments and questions:

  1. I have 125 ish factors and I had to split them into 6 sections to download 20 years at a time for my universe. I could have split into more time periods instead, but since I need overlap to scale my normalization I went this way instead. It would be nice if there was a better way to manage this, or a more efficient download format to allow larger data downloads
  2. Will the download expand to include ETFs?
  3. When I try to download Macro factors I am getting this error:

    I have tried a few dates and a period of dates

That’s one way to do it. As you’ve seen, a limitation in the backend support for this requires FRank/ZScore to be the topmost element or it will complain.

If your normalization is Rank, FRank("GMgn%TTM", #Sector, #DESC, #ExclNA) should be enough to do this. If it’s N/A Handling Middle, it will place them in the middle for you. Otherwise, one could fill N/A values with 0.5 after the fact. And of course if you want 0-100, the output can be multiplied out.

If Skip Normalization is still necessary or preferred, this formula can be shortened this way: FRank("IsNA(FRank(`GMgn%TTM`, #Sector, #DESC, #ExclNA), 50)", #All, #DESC).

@jlittleton ,

  1. We upped the limits to 300 factors from 100. We’ll also up the total number of data points from 100M to 300M (requires a build so it will be the next release)
  2. Adding ETFs should be easy enough. If others want it please chime in.
  3. You are using Close_D in some series that do not support it like ##RGDP (quarterly) and ##CPI (monthly). Be sure you know the frequency before using these. For example you seem to be using ##UNRATE as a quarterly series, but it’s monthly. The easiest way is to test them is in the screener, then double click on ##CPI to find the reference which tells you the frequency of the data.


PS you can also use the fundamental chart to see the macro series

1 Like

This fixed the issue! As a note those came from the Predefined Macro factors. Might be good to update them to work right out of the “box” for the downloads.

@marco When will we see the release of the AI / ML work you've been pluggin away on? I've read alot about it over the last few years and even got an email about how exciting it is (as my subscription is to renew shortly). Yet, I have actually seen nothing.

Can you provide a realistic eta for this?

Thank you,

Hi, it's real, it's working. Did you see this ? PREVIEW: Screenshots of upcoming AI Factors

We're testing it now. We are going to open it up soon (next week?) to about a dozen users since we have not yet purchased more hardware to support, for example, a validation study of 100 models all at once. We wanted to get a feel of real world usage of a sample of users before deciding how much we need to scale.


1 Like

Marco, I just caught up on your post. Thanks a lot for sharing those screenshots; they were super informative. I'm glad this is real and I would like to be invited to test it out.

I did notice, though, that we're missing some key statistical data needed for analysis. Specifically, we need the t-stat of the signals in ranking (alpha/st. of alpha) to assess their significance. While the other tools are great, they don't pass this most basic test. It's crucial to determine the statistical significance of our alpha predictions, which API usage doesn't enable us to solve on our own. P123 design prevents us from doing this on our own.

The second thing we cannot do due to p123 design limitations is correlation matrices. If a user, like me, wants to create my own risk model without any help from p123, I cannot. It's such a simple addition yet years pass and even the most simple ETFs, like USMV, use these matrices and here we are without such data.

Making decisions with less data is not smarter.

There is a concern using t- or F-statistics for hypothesis testing since they rely on normality assumption and stock returns are nothing like normal. I'd rather use non-parametric tests. Yet the whole concept of testing is questionable since return distributions are highly non-stationary and samples not independent.