AI Factor Dataset Download Parity

Hey. I have downloaded my datasets for AI factors a few times form different factors. All have the same date range, normalization, filters, target etc. But each time the download differs. Is there a reasoning for this?

Is there a guide on how I can replicate the results I get on the P123 website in a local environment using the dataset download? Right now I am spending a good bit of time doing this type of research but there seems to be little to know parity.

@marco @judgetrade @AlgoMan

Have you done the downloads the same day or over several days?

Over multiple months (due to large number of factors and large universe) - have each of them as separate parquets and have to join them

Can you describe what's different between the downloads? For example, are you seeing different tickers, different values, or different row counts and roughly how large are the differences? Are we talking about minor rounding variation or something more significant?

Primary issue is missing tickers without any of the filters having changed.

  1. Create an AI factor, download its dataset.

  2. Create a second AI factor, different features but same universe and filters.

  3. Train the second ai factor and download the predictions.

  4. Intersect their tickers (the symbol names in the predictions vs the symbol names in the downloaded dataset) → they do not match and there is no subset or superset relationship → back track this to mean that the universes are not the same.

    I would download the second ai factor as well but I’m out of credits lol as I have been trying to figure this out for a while.

When was the last Ai Factor pair created?

I have seen this as well when it's not downloaded the same day.
I did a deep dive the other day and it was only one stock that had entered to universe and another left, the market cap of one of the companies had been altered, that's all, it gave this massive butterfly effect over the years.
It adds a lot of extra data handling work when it happens, annoying and time-consuming.

Last 4 days all side by side.

Why would it matter if the its cross sectional normalization by week? Any suggestions on better par to par research on local devices given this type of behavior?

I’ve encountered this as well, but for me it’s been stocks on the edge of my filters entering/exiting the Universe. I assume there are some updates to the historical data.

1 Like