Thanks Dan - I understand that this activity may place a strain on P123 systems. I’m hoping that the new API that P123 comes up with will concatenate all historical dates into one call. If that happens then API call usage is minimal and not a big concern.
Right now, I am evaluating 5 years of monthly frequency to see how that performs with the AI models and maybe that will be enough. But I am assuming weekly data is required…
The following usage is for model development. Once the model is developed and deployed then API calls will be minimal, like maybe once or twice per week per neural net. I am expecting to develop ~10 neural networks in total over time. Once the first NN is developed then I will be moving on to the next.
Per neural net:
Preferred - 52 weeks x 5 years (number of API calls with weekly frequency)
Less desirable - 12 months x 5 years (number of API calls with monthly frequency)
Inputs or rank nodes: 5-10
Target: 1
300 tickers
Note: The problem is that the development is an iterative approach, finding and testing inputs. I don’t know how many iterations that it will take to get the exact inputs that I want to use in the final model. I could structure it so that I could guess at what inputs might be useful and collect many more inputs than I think I need, thus minimizing the overall API-call usage. In that case, I would increase the number of input nodes to something substantially higher than 10.
And as I said, in the long run I hope to develop on the order of 10 NNs. I don’t know how long each will take. It could be spread out over months or if I burn the midnight oil it could be a few weeks.
Dan - I don’t know where the exact issues lie, whether P123 is paying a third party usage fees, or if it is a question of taxing the P123 servers. I just want to mention that Quandl dealt with the problem of servicing massive numbers of API calls by introducing a two-step process. The request for data is made, then sometime later the data is available for download. Initially they were talking about a 12 hour wait period for the data to be ready, but eventually it turned out to be like 30 seconds to a minute. Something like that would be OK for me for historical data, even 12 to 24 hours after making a request API. (not for the current week however).
Keep in mind that literally every software company in the world is introducing AI into its product stream. I think that P123 should be thinking along those lines as well. But to be successful, P123 needs to have the resources for Big Data available.
Steve - We are still finalizing the formulas for calculating the request costs and I have not seen any requirements yet for the new API endpoint you and Marco have been discussing, but based on how we are calculating the costs for the other endpoints, 300 tickers and 10 nodes should not be a problem even with multiple iterations. Worst case is if you were to increase the scope of your project and need more requests allocated to your account during your initial development phase of your NN’s, then you could purchase additional requests with an AddOn that will be available soon at a reasonable cost.
That sounds good! I am going to use monthly data to start and when I get to finalizing my NN I’ll switch to weekly. That should reduce any demands on the system.
Marco,
Recommendation to include if you do an educational doc about how to use the P123 ML API:
Users must be very careful how they split the dataset in training and validation sets. Training and validation sets (“simple way” or K-fold) MUST be on SEPARATE TIME PERIODS.
Some ML tools split data by default in RANDOMLY INTERTWINED sets. For example, I think it is the default parameter in sklearn for one of the most used dataset-splitting function. It is the best way for many ML applications, not when using timeseries! P123 ML users have to set the correct parameters or do their own function to prepare data so that training and validation sets are not intertwined in time, and if possible SEPARATED by an UNUSED data period. Else the model will be trained with massive data leakage, resulting in widely overestimating its predictive ability.
To understand the problem, just think you will have lots of records with almost identical features and labels in the train set and the validation set if they are randomly intertwined on a daily basis: same fundamental features (rank), almost same technical features… and almost same label (forward return).
To be sure to get rid of data leakage, ideally the training and validation sets should be separated by a non-used period of one quarter (not to use fundamentals from the same earnings reports in ranks) and/or the longest look-back period used in technical features (maybe half of it should be enough). Hope it’s clear, else let’s make a call
There is certainly lots to think about with regards to use of the data. But let’s not have P123 bogged down in writing application notes at this point in time. Let them focus on getting the data out in one simple 2D array with column headers. The ML application of the data can be left to third parties but that won’t happen without the data first.
So Marco, basically the stats on the Summary, Holdings, Statistics & Charts pages. Is there somewhere on this website where I can find instructions on how to easily grab that data?
Not lots. Data leakage is a major issue with timeseries in ML. I know Marco is very interested in how data will be used, because he told me so. A nice API without clear guidelines may do more harm than good.
“Data leakage is a major issue with timeseries in ML”
I agree. But that is an end-application problem not a data delivery problem. And there are different types of application, not only timeseries but also cross-sectional analysis. So I don’t want the message back to Marco being that we get training data separate from validation and test. Let’s concentrate on getting data now, in a form that everyone can use, and doesn’t presuppose a specific intent. Presumably the s/w programmers at P123 are smart enough to deliver date stamped data that is organized from oldest to newest, or vice versa.
Right. We have to have someone controlling how we use data.
Crazy that P123 could mess it up from here. Such a simple request from Steve. Steve is willing to share how he does it. Let people look at it in what he calls a “peer review process.” Although I do not get why should have to do that.
We need data police? Frederic what country are you from?
“Although I do not get why should have to do that.”
Jim - my results look too good to be true, so I want other eyes on what I am doing to keep me honest. That is why.
Jim, I wrote “I know Marco is very interested in how data will be used, because he told me so”, you are speaking of data police.
You are making a very strange interpretation of my words. I don’t speak for him. We just had a conversation on another subject, which came to ML. He was not speaking of controlling, but I understood it as educating and explaining. Hence my post on data leakage.
Steve just wants a 2D array. Let him have it then.
Again, I think Marco can speak for himself.
BTW, When I want lectures on machine learning I get it from professionals at Coursera. Tell me again when your last machine learning project was?
This is not the only active thread about people wanting some data. I do not think we need to spend any time asking them what they want it for so that we can make sure they are using it right either.
For the immediate future, I can work with what I have. So it might be best to postpone the arguments here until after I have presented my approach and given out some software (in a few weeks). THen Fred can have his kick at the can. You know, maybe I am doing something incorrectly and that will come out in the wash. Everyone will be able to see what I am doing and then P123 can decide how to proceed. I just want to make sure that no one is putting limitations on this without understanding the potential. That is a quick way to turn something useful into something useless. Documentation can come later.
Phase 1 is just an ML enabler project: we’re just going to enhance existing “get ranks” API to return the data needed. It won’t be ML specific. What I proposed a few pages back still seems to fit the requirements. Everything being discussed seems post-processing after one has the data. The use or misuse of the data is entirely on the user. Once we gain more knowledge of ML and its needs we’ll think of ML specific functionality and apps perhaps.
mm123, we’ll add a new endpoint for port/book stats.
The one thing that would be nice is to be able to concatenate all of the dates into one API call. That may reduce stress on your servers. Other than that, thanks!
Would you be requesting ranks on mixed dates with somewhat random gaps between them?
Or could we create an ranks endpoint that is like the RanksPeriod operation in DataMiner which takes a start date, end date and frequency (1 wk, 4 wk, etc)?