NEW: Data Miner App & P123 API -- v1.0 (beta)

Looks like it’s in place now - I think I’ll go for the 500K option right away :slight_smile:

Well, I bought 500k extra. The dashboard did not update - my limit is still stuck at 10k, but I am able to use the API again so I guess this is only a display bug.

I also notice that the API “price” for downloading ranks is now much lower - thanks!

Dashboard only reflects usage over monthly quota for now, will be changed to include add-on extra request credits as well.
However, you can see detailed API quota info (including purchased add-on statistics) in the DataMiner & API tab under account settings.

Marco, you said that volume factors & functions need not be restricted. It’s technical data and it does not come from Factset.

But when using for example, FHist(“AvgVol(20)”, -4), in the “additionalData” option field, with the “ranks” API, I am getting this error:
A data license is required for this operation

Please fix this. Thank you.

we don’t allow FHist , that’s the problem. We could but then we need to look inside the quotes to make sure it’s not factset data.

Much easier (and clearer to use) for us to supply future functions/factors for volume. Like we did with Future%Chg and FutureRel%Chg.

We could start off with FutureAvgVol(bars [,offset]) and FutureVol(bars)

So for example

  1. to get the volume 10 trading days in the future of the as-of date you would use: FutureVol(10)

  2. To get the average volume for the future 10 trading days you use: FutureAvgVol(10)

  3. To get the average volume for the future 10 trading days skipping the next month (21 bars) you use: FutureAvgVol(10,21)

Will this work? we can add relatively easy.

Thanks

Yes, it will work but being able to get data with negative FHist would be much much better since this allow us to create much more labels and we would not need to have a new future function every time we to access a technical data.

1/ If negative FHist is not possible then, it would be great to have a function to get future volatility, HighVal, HighValBar, LowVal, LowValBar and correlation. Maybe allow negative values within these existing functions without adding new Future functions.

2/ The “Loop” functions such as “LoopMax” are capable of looking inside the formula and detect whether it is a factset data or not. Why “FHist” would not?

I have tested with LoopMax using technical data and it works. When using a fundamental data then it returns “A data license is required for this operation”.

Thank you.

OK . we’ll see how feasible it is to allow FHist.

Warning to everyone using the Python P123 API. I purchased extra API requests, and during my data pull, the P123 API threw a “not enough API requests left” error for no reason. I wasn’t even close to the limit. The result was that all the data I had pulled up to that point was lost but I had still “spent” the API requests.

Hi Phil,

The only error we see on our end is “Connection reset by peer”. This usually means there was a lost connection between our server and your application. Our API has automatic retries but it cannot recover if the connection is shut down by the peer. Do you have any logs that show the “not enough API requests” so we can dig deeper? Also, your application should save after every iteration, and be able to resume from a failure, to avoid losing work.

So at this point we can’t see issues in our end but we’re happy to apply the 50K api credit you requested.

Thanks for your feedback.

NVM phil . we do see a potential issue that caused “not enough API requests” . Sorry about that

Marco - I don’t know what philjoe is actually trying to do, but he is an early adopter. If he has troubles then it is likely that others will as well. And if you want this to succeed then you need to be sympathetic to the needs of early adopters. So make sure you are not throwing your apps over a brick wall and telling customers to deal with it. For Big Data projects it seems reasonable that the driver “P123 API” (whatever) should take care of saving intermediate work. Or there should be a library of support software to deal with such things. Maybe if philjoe opens up about what he is attempting to accomplish then the community can chip in to help.

Steve

Steve I was just using your python script to do a big data pull. It was interrupted by an error, when it should not have been. Because of the interruption all of the pull was lost.

Marco, you’re right, in an ideal world each iteration of the data pull would be saved to hard disk. But because I don’t have top of the line equipment that write-process would take way too long.

philjoe - I am actually surprised that the s/w still works. I was under the impression that I had to make some changes to the driver. In any case, how much data are you trying to pull? Are you using Colab/Drive or some other configuration?

Marco - I can try to help out with this application but I would need to have a large amount of API resources to design/test an app that does intermediate storage. It sounds like it is something that is needed.

Steve

it was about 500mb of data in the end… I am not sure what you mean colab/drive, but I am working with a jupyter notebook on my regular desktop computer.

philjoe -

Is that 500 megabytes or megabits? I can never get it straight.

What is your date range and how many factors are you using?

I assume that you are using the ranking system API? I think that I need to redesign that code anyway to make it work with the latest P123 interface.

Steve, The p123api library has some level of retrying built in. But not sure a thin api layer should have the complexity than you suggest.

The way I understand the ML process is: data collection , data cleansing / checking , training

The DataMiner should be used for data collection. It’s a big step , that can take a while. Bad things happen when they last too long. It can auto save and you can easily tell what fails to retry later. It generates clean csv files that can be readily converted to panda frames.

But to be honest, I have not had a chance to train a model yet so I might be missing something.

done. click on Dashboard link to see them

Marco,

P123 already does the data collection, data cleansing/checking. That is what P123 does. Does it well.

Yuval is constantly posting about how factors are created—handling NAs, using fallbacks etc. There are also frequent threads where errors from the data provider are corrected. I think this work is already done and done VERY well.

Also, if you can run a sim on the data you can do machine learning with it.

That goes for a rank performance test which is a very legitimate statistical-learning method. P123’s data WITH NO FURTHER CLEANSING does this just fine and I never hear any complaints. But I can attest that boosting also does just fine with the data as it is.

The only additional thing you need is the excess returns in an array with date and ticker.

Some scrappers interested in machine learning have emailed me in the past. They use Holding → Ranks ->Include Composite & Factor Ranks to get almost every column heading they need for their machine learning. This works without any further cleansing or manipulation of the data.

Why not start with the sim data as they have done and found useful? They are EXTREMELY intelligent and savvy people and may have something to offer. After a sim is created why not use that data and put it into an array? You even have excess returns relative to the benchmark in the sim data don’t you?

Me, I still like to go to the sim for the data I need over the API: P123 provided csv downloads as I do not know how to scrape data.

Anyway, I think you may be making this too hard. Too hard on yourself and maybe members too. All the data (nicely cleansed data) is already there for download after a sim is completed.

Even if people had to split up the data —downloading a few years at a time—this would clearly be the easiest method, conceptually. And again all the data needed is there after a sim is run.

I am ignorant regarding any programming issues but I would hope that there are easier ways to get access to the data that is already there after running a sim: easier for everyone including P123, perhaps.

If you do not take this thread to heart then I suggest you see what Kaggle has to say. I think they will ask for one simple array—perhaps a text or csv file. Nothing more. But whatever they ask for just give the same thing to the P123 members in the same format, in one simple download is my suggestion.

Put a box at the end of the sim to click on for the download (with the charge) in the format that Kaggle asks for if they get back to you.

Jim

Steve that is megabytes

philjoe - I am working on the assumption that you are using the original code that I gave out. I have updated it here: https://colab.research.google.com/drive/1uPpHnwqCdXRoPtIGsYEABXertD6kcD7a?usp=sharing

This new code will keep trying up to ten times to get data (across all dates). If it fails ten times then it will save what it managed to retrieve and stop. Honestly, if you get to this point then something else is wrong. You should never get 10 failures, at least in my opinion.

You will then have to do some manual steps in order to continue. You will need to (1) copy the file that was generated; (2) perform a new p123api but with the dates adjusted; (3) combine the two files when finished.

I hope this helps. This is all I can do without further communication on your end.

Steve

ok great thanks