NEW: Data Miner App & P123 API -- v1.0 (beta)

Quantonomics · May 20, 2020, 6:18pm

1. Request Limit
2020-05-20 14:16:22,460: API request failed: ERROR: You are limited to 200 requests per hour. Please wait until 1:57 PM (41 minutes from now) before making additional requests.

I thought it was increased to 5k per month? My ID is 37.

2. Universe

Seems like you have to write the full universe name now, it doesn’t recognize just the symbol anymore.

3. Usage
Also I’m not sure if the number of requests are calculated correctly. I says I made a usage of 224 units. Can you explain how these are calculated? 1 backtest of 10 deciles = 10 units? 1 backtest of 4 quartiles = 4 units? Shouldn’t it be 1 backtest = 1 unit? Otherwise it will increase super fast to the limit.

Thanks!

Jrinne · May 20, 2020, 6:24pm

So, I did not want to bring this up but P123 already knows this or should now. I thought I might clarify what dnevin123 is saying.

This data is not limited by the data provider and there are more than a few who are already getting this data using Web Scraping.

There are already people running Deep Nets and Random Forests off of P123 data. I helped develop one of the deep nets (but the data had already been scraped) so this is just a fact.

I do not have an economics degree but it seems P123 would have to make a relatively easy method available at a modest price-point to get a large market.

For example, there may be a limit to what dnevin123 is willing to pay for what he can get for free. But he looks interested in what could end up being an easier method with substantial data. Please, clarify if I am missing something.

From P123’s point of view it should consider extracting some revenue from the data it is already making available. And even attract those who do not want to Web Scrape (like me). I have already mentioned a label and a row index could help with regard to the last.

There seems to be pretty good interest judging form this thread.

Best,

Jim

marco · May 21, 2020, 3:33am

We are in the process of figuring this out At the moment there are two limits in play. They were added quickly to get API out asap, and protect the resources. They are : 1) a hard limit of 5K API requests/mo and 2) the existing “quant engine request/hour” that depends on your membership (ex. 200/h for Ultimate).

When you do a request on the website, like a rank, it only counts towards “quant engine request/hour”. When you do an API requests it checks both, the total per month and the “quant engine request/hour”.

The goal is to separate the limits for website and API. Also running the API should not interfere with your website requests (right now it does).

The “quant engine request/hour” was introduced long time ago to protect the website from too much scripting. It is however becoming an issue again with noticeable website slowdowns at times . So we expect to reduce the “quant engine request/hour” to further to protect the responsiveness of the website.

For scripters there will be a number of API requests included in your memberships for “normal” use, and two add-on plans for heavy use, and ludicrous use. Some of the numbers we kicked around is 500 requests/mo included, 5000/mo for heavy use , and 50000/mo ludicrous. We have not decided on costs for heavy & ludicrous. Another option is to have a time limit since some requests can take 5 min , others a few seconds.

Let us know your thoughts. Thanks

philjoe · May 21, 2020, 12:36pm

I don’t see a use for the API unless its >50k per month (if not more). Anything less than 50k datapoints can be easily handled with excel, no need to get complicated.

Jrinne · May 21, 2020, 1:06pm

Philip knows at ton more than I do about this. So if I say something that differs I need to go back to my O’Reilly books and learn about Programming/Python. I appreciate any corrections and maybe I can refresh my memory on the difference between append and concatenate. Does Python use a zero indexing?

Whatever language one uses wouldn’t you always convert it to a CSV file and upload it? Who wants to always use Python on a windows machine?

Who likes C ++? I know many do. I am amazed by this but there are a lot of true experts in machine learning who use Macs. Probably the preferred. The administrator for the second best Machine learning method in the world (XGBoost second to TensorFlow) uses a MacBook Pro: https://discuss.xgboost.ai

And of course Mac or Windows some still like R.

But why run it on your own machine? I like Colab for TensorFlow. How do I upload it to that? How do I upload it with ANY of the above methods?

Excel.csv file. Honestly (probably because I am not a programmer) I do not get it.

Maybe Philip can explain it to me if he does not agree. If I am not saying what he just said.

I do get that it is really cool to Web Scrape it and then put it into a csv file without telling anyone you are doing it. That I truly get without kidding. It’s like you are a white-hat hacker or something. But I think it could get old when I really just need the csv file.

Best,

Jim

Quantonomics · May 21, 2020, 1:27pm

1 factor = iteration of 21 backtests of 20 buckets (default) + 1 benchmark = 21 requests

23 factors < 500 requests
238 factors < 5000 requests
2380 factors < 50,000 requests

I don’t know if you guys remember DanParquette’s file where he compiled backtests for many factors 10 years ago.
IIRC he was using 50 buckets. Back then there were no limits or the limits were much higher than they are right now.

1 factor = iteration of 51 backtests of 50 buckets (default) + 1 benchmark = 51 requests

9 factors < 500 requests
98 factors < 5000 requests
980 factors < 50,000 requests

My conclusion from this is to avoid using too many buckets with limited amount of requests. Maybe 10 buckets max.
I kinda agree that Data Miner should allow much higher requests than on the website or it defeats its purpose IMHO.
I hope the plans for Ludicrous are available ASAP before we lose access to S&P data to compare outputs with FactSet.

Other points I want to bring

Is the output saved anywhere as the Data Miner is working? What if it crashes or computer reboots, is the output lost?
Also allow some extra requests to “forgive” wrong setups or errors from the Data Miner if it fails / error out.

philjoe · May 21, 2020, 1:36pm

Is ithe data request cap due to a bottleneck in hardware? Or is S&P in the background placing caps?

marco · May 21, 2020, 1:44pm

It’s not 50k “datapoints”, it’s 50k API endpoint requests. So for example a bucket backtest is one request. We came up with 50k using the median time we see in current operations from the website which is 1min. So 50K 1min requests is basically 30+ days of uninterrupted use. But I think the API will be more used for simple , single or dual factor tests that are much faster (5-10sec) so 50k is not enough.

Also, S&P has been doing deals at around $10K. Not a crazy amount if you are doing a data mining project. They will continue to supply us data to service users that have license for at least 1 more year. That’s what they said two weeks ago, and are working on the contract. But they have changed their mind in the past. If they get more requests for licenses it will help finalize the deal.

marco · May 21, 2020, 1:46pm

philjoe, neither at this point. We always have caps when we launch something.

We once had “unlimited # simulations” and one user filled up our database.

philjoe · May 21, 2020, 2:03pm

Ahhh OK I see, perhaps you can differentiate between back tests based on how many factors.

Just to clarify, S&P will offer the 10k licenses for one more year and then thats it? Also, are you referring to a CapitalIQ license?

I’m still learning about how all of this works, do you store the API output on your database after we output it?

dnevin123 · May 21, 2020, 2:15pm

marco, is there not a way to throttle based on a % of available computational bandwidth instead of a fixed number of requests? The fixed number of requests is somewhat arbitrary and not necessarily “fair” (i.e. some requests will require significantly more computational bandwidth than others).

For example, the task I was attempting to complete was to grab the rank for a small number of stocks (2 to 3 on average) for a number of different transaction dates. For a single simulation, this quickly runs into hundreds of requests, but I would imagine (I might be wrong) that this uses far fewer resources than a single request for a Rolling BackTest. If both the Rolling Backtest and calculating the Ranks of a couple 100 stocks on a single day are treated as “1 Request”, then the usage metric being used seem a bit arbitrary. (I could be misunderstanding something here).

IMO, this is similar to the limit on the number of nodes in a Ranking System. Where I can’t have more than 200 nodes (or whatever it is) in a single ranking system, but I can have 5 or so ranking systems in a simulation that each have 199 nodes.

Anyway just wanted to poke the bear a bit, and see if there might be an alternative way of managing things. I completely understand that there has to be some way of sharing resources between users. However, I would much rather have to wait a long time to get my request fulfilled, then run up against a hard limit and not be able to make my request at all until the next month.

Thanks,

Daniel

philjoe · June 2, 2020, 8:31pm

Was this ever changed? I just ran marketcap data for 10 years on a weekly bases and now I’m done for the month…

The pull only ran for about 45 seconds, and resulted in a 9MB csv file.

marco · June 3, 2020, 12:59am

Sorry we did not get back to this. Several things kept us quite busy. We’ll come up with some plans for API & Miner very soon.

Also I hope to finalize something with S&P tomorrow for those that want to continue on with an S&P data & license. Anyone interested in this , please let me know as it will give us some leverage. Thanks

Quantonomics · June 3, 2020, 12:47pm

I would be interested in keeping S&P for another year of a few months. This will allow most issues to be ironed out and a smoother transition.

dnevin123 · June 3, 2020, 1:40pm

I concur, another few months with S&P will probably be necessary.

philjoe · July 8, 2020, 8:16pm

Can I ask about updates on this or is it too soon still?

philjoe · August 5, 2020, 12:31pm

Any updates on the restrictions on this feature?

yuvaltaylor · August 5, 2020, 2:52pm

We’ll be changing the restrictions and offering pricing quite soon. In the meantime, please share your experiences with the Data Miner. What do you like and what don’t you like?

philjoe · August 6, 2020, 1:36pm

The restrictions make it useless. I pulled one factor (say market cap) for 10 years and then it said I was done for my monthly usage.

Other than that making it useless, everything else was great.

Jrinne · August 6, 2020, 2:05pm

Yuval,

As it is, some of the web scrapers may begin to use the Data Miner App a little. It will remain a niche market BUT THERE IS A HUGE POTENTIAL HERE.

I am not as sophisticated as Philip as far as programming. But once the data has been manipulated (sliced, sorted, concatenated etc) using Python, I think I can hold my own as far as building a neural net, performing a ridge regression, boosting, using a random forest, etc.

So, I am trying to say that I defer to Philip and others as to whether the downloads are adequate and can be manipulated into a usable form.

But, unless Philip posts again and says that he loves the formats available, I think there is a lot of room for improvement.

Most important, a label is required. That would be returns over the rebalance period. If one is rebalancing weekly that would be the returns for the next week.

Ideally the output would have a usable index (e.g., a hierarchical index consisting of date and ticker).

But whatever index you use, I believe the best format for a download is an m x n matrix (or array) with column index, label (returns), and factors (P123 factors and functions).

Factors are just the factors (or functions) from P123 in a column with the ticker and date in a row (indexed).

This format would allow you to do bootstrapping, build a neural net, perform ridge regression, do a random forest, boosting etc, etc, etc,….WITH NO FURTHER MANIPULATION.

Anyway, as I said, if Philip or anyone else has a format that they prefer based on their programming skills then I think I can learn to manipulate their format as long as it has a label and a usable index.

But you cannot do anything without being able to manipulate the label into a usable format (generally an array, matrix or DataFrame). Even things considered unsupervised learning like principle component analysis and k-means clustering need the returns to construct the correlations (correlation matrix).

Thank you for your question.

Best,

Jim