Dataminer fields for ML

Excluding ranking factors, would anyone care to comment on what other fields they download for ML? I’m looking at;

Additional Data:
- stockID
- asofdate
- 1WkFutureRet: Future%Chg(5)
- 4WkFutureRet: Future%Chg(20)
- 13WkFutureRet: Future%Chg(60)
- AV20D: Vol10DAvg
- ADT20D: AvgDailyTot(20)
- UST10Y: Close(0,##UST10YR)
- UST3M: Close(0,##UST1MO)
- ClaimsNew: Close(0,##CLAIMSNEW)
- SpreadPct: 100LoopAvg(“Spread(CTR)/Close(CTR)”,5)
- VolSlip: pow(0.25

Industry code would be helpful, but it’s not directly available. Any ideas on mapping stockID to Industry code would be super helpful.

This is good food for thought. I’m also pulling down several of the volatility factors (TRSD1YD, TRSD30D, etc). Your post inspired me to see if I can include the VIX as well, though it seems not the most efficient to have these global series like VIX or US10YR repeated across the entire stockID-date index.

Very true. They should be broken out into a separate d/l and then merged back when necessary.

I suspect there could be synergy between macro economic indicators and stock industry returns. I’ve run into that recently when a model wanted to buy home builders while the US10Y was peaking. Since the model is macro blind, it didn’t know any better.

You can get an industry rank, but the codes are not constant, so mapping them would take a fairly large amount of effort.

You mentioned it in another post, but you can create universes for each industry and download them separately. I am not personally doing this though as it will potentially use a lot more credits over a 10-20 year period due to unused credits as 1 data point uses the same number of credits as 24,999 data points. But if you are credit rich why not! Maybe when I have more credits I will download just universe data for the sectors that I can use as a “mask” with my other data.

Otherwise I think your extras download list covers everything I thought of as initially useful. I did also download open(5) and close(0) for all stocks in the US primary universe so I can construct price histories, but that has been sidelined due to needing do some data wrangling to get the single ticker history out of it.

Thanks for reminding me about the credit schedule. I may have to rethink this approach.

It will be interesting to see what (if any) changes in the API credits happen along with the initial implementation of the first AI/ML.

It does seem that DataMiner and the API will not be always necessary going forward. I am not a programmer but how much does a data download cost? Again, I am not a programmer but maybe fewer bits in the P123 downloads than the last movie I watched on Netflix. Quite bit fewer than that 4k movie with Dolby Suround?. Not to mention the multiple-season 4k series (e.g., Ozark). Isn’t the array already produced (and in memory cache) when we do a rank performance test (at no additional cost now)?

Anyway, for the last time here, I am not a programmer. But it seems the cost structure could be changed if it makes business sense—including that supply and demand curve they made me learn in economics 101. Happy to have what is available now even, but will be interesting see if increased volume brings down the cosst (with some of the costs being sunk or fixed costs). It would seem that it should work no matter the cost structure (totally supportive).

Also I wonder what the average or median Kaggle competitor looks like. Young and just out of schoo,l or in school trying to become know, with a MacBook Pro drinking a lot of energy drinks is my stereotypical image.But more pertinent, I wonder what their median income is.


I wonder if the new download tools Marco mentioned really need to wait for the ML deployment.

From the lack of p123 forum participation, I’m guessing they’re busy with the release.

I thought he implied more like week(s) than months.

I’ve worked on many engineering projects and the effort required to finally closeout a design is enormous. My rules of thumb was - reasonably account for the time you think is required and then double it!

1 Like

Correct. And honestly just a question. I work in a Monopoly (medicine with controlled admissions to medical school). Certificates of need required for expansion of physical facilities (often depending on donations to politicians for approval). So, I own a business but not one that depends on supply and demand… Medications a monopoly until they become generic.

I joked that Xalatan (a eyedrop) was more expensive per ounce that silver when it was released as a new revolutionary glaucoma treatment $8 for a 2.5 ounce bottle now that it is generic with competition from other medications with the same mechanism of action. The cost of development not reflected in the price now.

But once you are done with the project, doesn’t that become a fixed or sunk cost having little to do with the cost-structure when there is competition?. Maybe in the weeds and you probably were not directly involved in this. I am just interested without commenting on (or knowing) what P123 will do with API credits.

I do think it is not an easy thing and I wish P123 the best in considering this.


The issue of buying credits - either for so call resources or API calls - has always perturbed me a bit. I feel like I’m paying a lot of money to start. It’s like flying first-class and then getting charged for a bag of peanuts.

Anyway, Happy Thanksgiving to all!

So great news and just in time for Thanksgiving! The API credit usage for data points seems to have been changed
“Each 100,000 data points will count as 1 API request credit.”

Thank you Portfolio123!!!

Now to find more info to download hahaha (or just clean-up my old downloads)

Edit: I will note that this does not significantly help with the issue of one API credit per week of rank or universe data (at least in the case of the python implementation). So if you want to download per industry or sector you likely still need to think about credits, but you don’t need to be as picky because the 100k points makes it easier to keep the API credit count per industry low.

1 Like

Is it possible to remove restriction of 100 factors per 1 api request for data download ?
For small universes (Germany or Poland, e.g., 200 stocks), 100k data points per request will never be reached.