Python code for calling 123 API

marco · November 19, 2020, 3:24pm

So that we don’t get bogged down too much we would just add a new API endpoint specifically for generating ML data for features (ranks of nodes and composites ) and labels (based on technical data only). Packaging the data for consumption would be done after, either in a DataMiner operation or in a Python program that could be community developed

Fred’s suggestions for labels are great. Can we get more specific examples of types of labels? We can probably just let you write your own formula for the label, but having some examples helps. Is this a good start for labels ?

Total Return after X bars
Relative Return after X bars
Future Volatility – parameters t.b.d
Sector or Industry Return after X bars

Is there anything missing that cannot be derived from this list?

Also, the question if we needed other types of ranking was never really addressed. I guess for now we will just use what we have. Rank by sorting and place NA’s in the middle or the bottom, and derive the percentile.

Jrinne · November 19, 2020, 3:43pm

Thanks Marco.

I would like the excess returns of the ticker over the next rebalance period. The next week’s excess returns for a weekly rebalance or monthly for monthly rebalance

Preferably in excess to the equally-weighted-returns of the universe rather than a separate cap-weighted universe.

I can show you that this minimizes the noise from the market and just works. At least for the data imaged above.

And that is what I want to know (predict) when I buy a stock.

Jim

piard2 · November 19, 2020, 3:50pm

Steve, it looks great. Your code seems to do a big part of the job for a generic “P123 to supervised ML” interface (I have not looked into it though). The table you provide as a result is very close to the structure I describe and it should go through pandas.read_csv() to make a dataframe for platform-agnostic data wrangling. It needs to be generalized and packaged in an API with appropriate parameters (RS used for features, label/target definition, date range, etc…) . It would be great to have the opinion of other people working with other ML environments to see if they have specific requirements (personally I am not really involved in ML projects now).

piard2 · November 19, 2020, 4:33pm

Marco, “Sector or Industry Return after X bars” is not a good label because we must suppose it cannot be predicted from a single stock’s features. Features may be industry attributes, not labels, at least in this context. They may be if we create datasets where the index is (industry, date) instead of (ticker, date), to train algos to make predictions on industries.

Jrinne · November 19, 2020, 4:38pm

Frederic,

Steve got that format from me when we were doing neural nets and boosting. And I can say it works for all machine learning methods using REGRESSION. Starting from regularized regression including LASSO regression and Ridge Regression, kernel regression, polynomial regression, robust regression using Huber, LOESS, CUBE, Random Forests (a regression tree algorithm that can also do classification), Boosting (another regression tree algorithm that can also do classificatioin), Neural Nets (which can be regression or classification) etc.

You have mentioned in previous posts that you would also like to look at classification. There is a lot of literature regarding classification for stocks and my experience is that this tends to work just as well. You have a great point.

This is one example where a csv downloads onto a local hard drive have helped me. I am sure Steve can do it in Python or Marco can make it available.

But I have just sorted the label in a spreadsheet and started a new “Classification” column next to the label column Then but a 1 in the classification column where the returns (next to it) were positive and 0 when not.

This will predict the probability that the week will have a positive return when trained. It seems to work about the same a regression as I said. Your idea was a good one.

Would you or Steve prefer to program this in Python or just do it in a spreadsheet? Whatever your answer is the spreadsheet has the ability to visualize–right there–that you have one it right. And it offers more flexibility for those not as experienced in Python.

So anyway, Steve’s format (preferably with excess returns as an option for the label) works for everything I have tried in Python and R for both regression and classification. Assuming a csv download onto a local drive to get the classification label.

I like your idea of using classification. It works in my experience.

Jim

piard2 · November 19, 2020, 4:46pm

Jim, I don’t have specific programming requirements at this time because I have no ML project. You have visualization packages in Python to draw all kinds of charts. Anyway pandas has functions to make a dataframe from a csv file and the other way (with some requirements on the csv structure: https://pandas.pydata.org/pandas-docs/version/0.23.1/generated/pandas.read_csv.html#pandas.read_csv ).

Jrinne · November 19, 2020, 4:59pm

Frederic,

You are absolutely correct on this. Steve, likes Colab and we have worked on getting the data into Colab together (with Steve actually figuring it out). I like Colab too for a lot of reasons.

But my Anaconda still works on my computer and I use the code you mention. I appreciate your pointing this out in case I was not familiar.

For anyone else reading this here is an example of this command that works:

alldata=pd.read_csv(“~/opt/ReadFile/MLFactorUpload.csv”). Is the “~” a Mac thing? Maybe.

Here is a write command that works:

s.to_csv(“desktop/s.csv”, index=False)

Jim

InspectorSector · November 19, 2020, 5:32pm

Marco - I don’t understand your preoccupation with technical analysis and labels. Labels should be user-specified and every column should be user-specified. Don’t make any assumptions and try to force things that people don’t want. The return for Python should be a 2D array.

marco · November 19, 2020, 10:15pm

Here’s my proposal for enhancements to the API to retrieve ranks so that it’s better suited to generate data for an ML system.

You will be able to define any number of ‘Extra Data’ which will be actual, “raw” values , not ranks. Without a data license you will only be able to write formulas using technical data (prices, dividends, splits for stocks & etfs). With a data license you can use the full set of factors/functions.

The ‘Extra Data’ items can either be a feature or a label (using AI speak). If you use a -ve value you are specifying a label, otherwise it’s a feature . It is up to you to keep it straight and not feed label data for feature data when you submit this data for processing. I used “label” and “feature” in the column name just to help identify.

Attached is a screenshot of what the settings would look for the API call (using pseudo code) and the output it will generate. I color coded in gray the reference columns, in green is the ‘actual value’ columns that can be either labels or features, and the rank data which is always feature data.

Notice the n/a for 6/1/2020 for the 6mo label since Dec 2020 has not happened yet.

Let me know what you think. Does this make generating ML data a snap ?

And this should not take long to do.

Thanks

InspectorSector · November 20, 2020, 4:38am

Marco - it looks fine except hopefully the Python routine doesn’t return the first line of the spreadsheet, just a 2D array with column names. If the ranking nodes are complex, they should still be ‘flattened’ so we end up with a 2D array.

InspectorSector · November 20, 2020, 4:43am

Marco - the more I think about it, the standard deviation rank would probably be a good thing to bring out. Thanks!

InspectorSector · November 20, 2020, 6:00am

Marco - I have another idea… instead of having this API call based on ranking systems, why not create a new factor for the screener and have the API call the screener instead? The new screener factor would be like ShowVar() except you could call it APIVar(). APIVar() would give an error when purely fundamental data is called out. But anything else would be allowed. The API call would only return data specified by APIVar(). The variable name specified in APIVar() would be the column header for the data returned by the API. Just a thought, I don’t want to slow down your efforts though. Expediency is important.

With this solution, you aren’t building in custom solutions to handle technical data.

piard2 · November 20, 2020, 8:00am

Marco,
Your idea of adding extra data for technical features and labels is great. Some possible tweaks:

It would be nice to allow the InList() function in extra data formulas. I understand we cannot export Sector and Industry without a license, but we should be allowed to simulate it by creating lists of tickers from ETF holdings (not PIT, but better than nothing for users who can’t buy a license). I think I have also read in the forum that a kind of inETF() function to get PIT ETF holdings was in your projects. If it’s true, it would be great to allow it in extra data formulas too.
For convenience, it would be better to add the extra data columns on the right, after the rank’s columns. Reason: in examples, tutorials and courses on ML platforms I have tried (python/sklearn and Azure ML studio), labels are usually in the last column of datasets. It makes table manipulation a bit easier.
I don’t know if someone will need the id column. It’s easy to drop later, but dropping it in the output format may avoid taking it mistakenly as a feature (unless Steve or someone else thinks it is useful).

A clarification: I post my ideas to help, I will not be a ML user in the short term, but possibly in the future.

Jrinne · November 20, 2020, 9:51am

Marco,

So P123 is mostly there, I think. I do not really like DataMiner mostly because I can only use it at my office. And I had some coding issues for a while that Steve has helped me with. He says this code does a lot of what I want: RankData = client.rank_ranks({“pitMethod”: “Complete”,‘rankingSystem’: RankingSystemID, ‘asOfDt’: RunDate, ‘universe’: 'Digital Transformation", “includeNodeDetails”: 'true’})

I cannot confirm that this will work because I am not at the office with a Windows machine. But I am pretty sure that I can get something to work with Steve’s help.

The only change I might want would be excess returns as discussed above. And honestly, I think you have a point. You might want to make it so that someone could use this with JASP and not have to ask Steve about the Python code. He is not paid to help everyone—not yet anyway.

I have asked for some help on this in the forum many times and Yuval was not able to offer anything more than me paying for a one time download of data. I think you should figure something else out if you want to attract a lot of new customers.

You are probably aware that as long as the data is there in a column, Python can find it.

For example here is the code of setting up the training data:

f1_train=f1[[‘Factor2’,'Factor3’,'Factor4’,‘Factor5’,‘Factor6’,’Factor7’]].values
f1_label=f1[‘ExcessReturn’].values

Notice I have left out Factor1 as I have found it isn’t really predictive. But I can easily add it back and remove Factor5 if I want to. You understand this better than i do.

But I do not think where you put the columns to be an issue.

If there were no download issues you could have multiple labels. I could easily change the label using Python to f1_label=f1[‘P123sFavoriteTechicalLabel’] on the same download.

Again, I understand that you know this better than I do.

Anyway, I appreciate that you are there or almost there. I agree that making more labels available might be helpful. I do not know if DataMiner is the best program or not.

I suspect that there are a lot of people already using this. For whatever reason people highly skilled in Python do not post much. And they NEVER discuss what they are doing.

But for every person using it here I think there are thousands better skilled at Python than your average P123 user.

I guess it remains to be seen whether you can market to them. But I would not use your average forum poster as a gauge for the potential of this business model.

I cannot imagine that there is a great cost to this but if it does not bring in new customers and is not worth it you should abandon it.

It is only my personal opinion that there are good number of people like Steve, Frederic, the silent people on the forum and me waiting to be marketed to. My apologies if I am wrong about that.

But I would recommend perfecting this in whatever way you think is best and try some marketing of this. You can always abandon it later.

FWIW in gauging interest in this, Steve is not the first member who has asked me to help him build a neural net. Steve is just the first who wanted to discuss it in the forum.

You have already thanked Steve for sharing—rightfully so.

Best,

Jim

InspectorSector · November 20, 2020, 12:12pm

Marco / P123 staff - I get a request quota exceeded when I try to run my code now. This is happening on the first API call and has been happening since yesterday afternoon. What are the actual resource limitations? I thought it was 100 per hour or something like that?

InspectorSector · November 20, 2020, 1:39pm

Marco - the problem here is that you have introduced a new piece of functionality with the extra data. The requests for new data items is going to grow with time. I have several that I would like as well, including FMedian(), FOrder(), Aggregate(), etc.

I originally based this exercise on the Ranking System module, because it allows historical access without a data license. I thought that would be the minimum impact from P123’s perspective. But since you are going to this level of customization, I think that my preference would be to use the screen module, but being able to access APIVar() as I described in a previous post. APIVar() would be identical to ShowVar() in every respect except it would only allow items that are not raw fundamentals. It would allow such things as FRank(), FOrder(), FMedian(), Aggregate(). FSum(), industry factors, any technical formula, any InList() operation, and other APIVar() variables as part of a formula.

This provides maximum flexibility and we won’t be coming back for more requests for other types of data, because we already have access to everything except for raw fundamentals. To make this work, the API call would be able to run on historical dates. Also, the column header should have the @ stripped off at the front of the variable name. There are some programs that don’t allow such symbol. MySQLi for example croaks when presented with an @ in the column header.

Also, my opinion is that 5 years of weekly data is needed as a minimum to make AI work. So either one API call should concatenate all of the dates into one returned array or think about how you can structure resources versus P123 membership to make it happen.

Thanks!

marco · November 20, 2020, 4:11pm

Steve , I bumped your API limits to 5K. We have not formally launched the API and still tinkering with limits.

Your proposals can create lots of backdoors for downloading data. We can’t have backdoors ( it would not remain a secret for long ). So trying to keep this simple initially. We should start thinking of what an API specifically for generating input data for ML should look like . But right now the quickest way to get most of the way there is to make some mods to the existing rank API. Using the ranking system API simplifies things a lot for us since there’s no way to download raw data with it .

Lets see where this goes. I’d also want to try it myself . Maybe you want to do a demo for me and others? SIDE NOTE we want to find ways for expert users on p123 to do way more than just Designer Models. This stuff is not easy, DYI investing is not easy, and having an army of experts (that get paid somehow of course) is the way to get 123 going somewhere at last.

Jrinne, you lost me there a bit with few things. You don’t need a windows machine. Python and DataMiner runs on macs, linux, and windows. Also, not sure what you referring to re. JASP & one time data download.

Thanks

InspectorSector · November 20, 2020, 4:59pm

Thanks for upping my API limit. Is there an hourly limit? My optimal application would be for 5 years of weekly data. With my current software, I have to make 5 x 52 calls so that is 260 API calls to collect a set of data. I can change this to monthly frequency but the end result won’t be as good.

As for help, I will gladly help in exchange for the perks such as the increased API limit. Is there any chance of setting up a separate forum so that the regular P123 folk are not bothered by this activity? I don’t know about a demo, but I can supply everyone with most of my Python code and can assist with any problems with Python or Google Colaboratory.

I would like you to think about the possibility of a new type of membership for P123, Big Data or ML. Dedicate much higher resources to API calls and defocus other resources such as portfolios and sims. Right now, I am only using a handful of portfolios for example.

Jrinne · November 20, 2020, 5:08pm

Marco,

Thank you for your response.

I did not know that DataMiner runs on a MAC!!! So thank you very much.

I already have Python (using Anaconda) up and running on my MAC.

So with JASP I just need a Excel csv file which I now understand that I can probably get this weekend at home without staying late at the office using DataMiner.

And pretty good column headers thanks to Steve’s help. Frankly, however, I doubt that it will be useful until I can get excess returns as a label. Still, this is all good news to me.

I understand that this was directed toward Steve and I assume he can help you.

But I was thinking today that I might be able to give a simple demonstration with JASP that you should be able to duplicate. I think I can and you could move to XGBoost or TensorFlow later. If I get something that I can easily demonstrate then you can upload my Excel File or duplicate what I have done with your own factors.

My main question with JASP is how large of a data volume it can handle. But I think it is capable enough to hint at what XGBoost can do. The main advantage for now being that you should be able to be up and running quickly to test it if for yourself without having to use any data that may be cherry-picked by me.

I do machine learning pretty well (I think) but not so much on some Python munging tasks sometimes. I appreciate our help. Thank you again for the information about DataMiner on MACs!!!

Best,

Jim

danparquette · November 20, 2020, 6:27pm

Hi Steve, It would be helpful if you could give me more details on your typical use case so I can use that to test the calculations for the cost in ‘requests’ and make sure it is reasonable. You mentioned 5 years of weekly data. How many tickers? How many ranknodes(aka factors/formulas)=?