Data download the day of rebalance for machine learning

Yuval,

Glad to hear you are now a big advocate for machine learning at P123.

As someone in charge of product development could you please help make sure that Dan has what he needs to make a convenient download the day of rebalance for someone doing a random forest for example? Maybe take suggestions from Jonpaul on what he needs since this is actually important to any machine learner. I did not see you respond to him previously in the forum. He is wanting to look into using the download for XGBoost. XGBoost is one of the machine learning methods you believe is useful, I understand from one of your recent posts.

Also can you make sure what P123 wants to charge for those downloads considering you will be attracting a lot of newbie machine learners to P123 and the cost of daily rebalances can add up—possibly requiring an Ultimate membership for a new membe to do machine learing. I am fine if you do not change the price but you might want to consider the question of charges for newbies wanting to do machine learning.

This is something I have been asking for for over 10 years and I missed your previous interest and concern about this important feature for machine learners, Perhaps you missed my previous request. Maybe you can talk to Jonpaul about this now that you are aware of the need and think machine learning is important.

Seriously, this is important and has always been important for anyone interested in machine learning at P123. Dan has taken the time already and I appreciate that. I wonder if you might help make sure this happens since you now clearly understand the importance of the to P123 now.

Jim

It would be helpful if you would address requests like these to Portfolio123 in general or to Marco in particular rather than to me. I am not in charge of product development. My current title is “Director of Special Projects and Product Evangelist.” Next time I talk to Marco about the machine learning project I will mention your request, but I’m certainly not the most effective advocate for it since it’s not clear to me what you require. Thanks in advance.

Yuval and Jim,

Yuval - nice title and understood this is not your area of responsibility (especially if you have semi-retired now)!

I have not started doing machine learning yet, but I was planning on posting about my download format, and maybe my colab code for the download, once I have done a few tests. That way hopefully I am posting something useful and not extra noise!

That being said here is my best understanding of what is needed for XGBoost and the requested update.

Table format for training: I already have this and weekly is good enough for training (for now anyway)


Now the key is that the ranks are from the previous Friday. Price information you can fudge by using the negative offset or the bars offset, but it is a bit difficult to keep track of.

Table for rebalancing: I can get this for Monday rebalance, but not any other day


So the specific ask is that the Factor ranks be based on the as of Date, not the previous Friday for the api ranks or universe download!

Jim please comment if this seems off!

So far the community seems pretty engaged and I am please with the responses from Dan! I understood making changes without messing things up can take a while.

Thanks,
Jonpaul

If I understood the prior threads, the API can only return weekend data. By weekend, p123 means a Saturday asofdate. But this is the same data that is used for Monday morning rebalancing - in spite of the Sunday evening Factset update.

EDIT: by data, I was referring to rank data

P123,

Thank you for your interest.

TL;DR: My needs may be pretty similar to the coding for a sim (training) and port (predicting). If so, no coincidence in my mind as I believe P123 is just one example of machine learning or reinforcement learning. Anyway nothing mysterious for any P123 coder, I believe.

The download below is what i use to gather historical data. It works well: Thank you. .

If I could just continue to use it today, for example, to get the overnight update of the ranks, that would be ideal.

For machine learning this will be a pattern, I think. You basically use the same features (or factors at P123) to make predications as you used to train the model. Future returns, in this case, being the “target” which is what you are trying to predict (with the predictions not yet known).

If today you could get it to give me "Future 1wkRet: Future%Chg(5) that would be particularly appreciated. Obviously that data is not available. Maybe there would be an NA there. Maybe I would have to remove that from my DataMiner code—**assuming I cannot get a direct download button on the P123 website for a particular ranking system (or in a sim where the universe is already defined.**Please charge me the cost plus some profit for any downloads however I may get them. Or take account of my expected usage in my membership level.

But all the other column data should be available. Obviously, any unncessary columns could be sliced from the Pandas DataFrame in Python (and would be).

Here is a screenshot of the download with the first three factors and the code in DataMiner that produced it below that:

Main:
Operation: Ranks
On Error: Stop # ( [Stop] | Continue )
Precision: 4 # ( [ 2 ] | 3 | 4 )

Default Settings:
PIT Method: Prelim # ( [Complete] | Prelim )
Ranking System: ‘M3DM’
Ranking Method: NAsNeutral
Start Date: 2005-01-02
End Date: 2010-01-01
Frequency: 1Week
Universe: Easy to Trade US
Columns: factor #( [ranks] | composite | factor )
Include Names: true #( true | [false] )
Additional Data:
- Future 1wkRet: Future%Chg(5)
- FutureRel%Chg(5,GetSeries(“$SPALLPV:USA”)) #Relative return vs $SPALLPV:USA

I don’t see how any of this is hard. But there are no tricks, shortcuts or substitutions. You always need the same factors that you use to train the model. Preferably in an array ready to be converted to a DataFrame without a lot munging needed.

Jim

Hi Jim,

I just noticed that in the DataMiner, Future%Chg(5) calculates the return as from Friday to Friday.

Using 100*((Close(-6)/Close(-1))-1) works for Monday to Monday returns. However, when asked to return future values that haven’t yet occurred, it doesn’t return NAs. I haven’t looked at what is returned.

Walter,

Thank you VERY much. This is the code that P123 suggested for me. I felt bad about not being a good enough coder to figure it out for myself.

What you have said means my downloads for the last 2 months have serious look-ahead bias. Because the ranks I download were from Saturday or perhaps Monday morning and the purchases would have been made Monday. Any real purchases could not have been made before Monday but the prices are from Friday.

It is so helpful.to know that my data has look-ahead bias. The API credits I used: burned for nothing of use…

I think some of this is easy for some people most of the the time but I think we can all get stuck and need some help at times. And perhaps this is an example where it may not be trivial, exactly, for an advanced coder.

I was told this should be easy and did not even require knowledge of Python.

Sometimes it is just a small feature we need, whether it fits into someone else’s established protocols or not. Features have not been quick to come from P123. And essentially none of the ones that have been implemented have been for statistical metrics or machine learning.

As far as feature requests, even with your changes in my future downloads, it will be difficult to use this that information without the download Jonpaul (and I) are requesting.

All that is needed is a simple matrix (or relatively simple I should say seeing how this is going) that i have been requesting for 10 years. In truth–assuming one has access to a powerful enough computer–nothing else is need to do machine learning. The rest of what is needed can be found at Scikit-Learn.

I think maybe things are changing at P123. I hope so. You being here helps.

Thank you for your help. I am EXTREMELY GRATEFUL!

JIm

We’re in uncharted territory.

I’m working through these issues just like you are. What I’m seeing makes sense after working through the downloaded data.

For a given weekend, the Rank operation will return ranks identical to that available Monday morning. I did confirm that.

However, w/r/t prices, since the Monday market open hasn’t happened yet (from the POV of lookup), the latest pricing date is from Friday. So functions like Future%Chg(5) would have to use that. It was just unanticipated. Unfortunately, that function doesn’t support an offset parameter. If it did, something like Future%Chg(5,-1) would work - I think.

Another issue I need to check is why looking up a future close that hasn’t yet occurred doesn’t return NA. I consider this issue important. I wanted to have DataMiner return three targets; 1, 4, and 13 week future returns. Not returning NAs when the data doesn’t exist complicates the issue.

Finally, my findings need to be confirmed.

1 Like

Which P123 could help you with. Which it will do, if and when, it gets serious about machine learning.

And it is not even as if people who do things that the do not consider to be machine learning could not use this too.

Reminder: At the end of the day it is just an array. Features, target, an index and preferably no look-ahead bias. Updated data at the time of rebalance.

I had the same conclusion about future returns and I want Monday morning to morning returns so I did:
(Open(-6)-Open(-1))/Open(-1)

This is the same equation as Walter, just a little different form (not simplified). Also open instead of close.

I have not looked into what it returns for a date that does not exist, but I did check the open prices were correct for Monday to Monday when both Mondays have happened.

I checked that in the screener and it also shows the problem. I think OHLC prices with negative offset is broken.

This feature request encompasses so many other possible feature requests that are already available at Sklearn.

Non-linear methods, early stopping, bootstrapping, K-fold cross-validation, recursive feature elimination, regularization, methods addressing collinearity issues (duckrucks PCA suggestion and others), not just the ability to calculate metrics not provided by P123, but the ability to use them for cross-validation. No spreadsheet required.

Many of these features are features that P123 might not be able to provide quickly if they are interested in making these features available on their platform at all. It remains to be seen how interested P123 is in responding to machine learning feature requests or taking suggestions for improvements in their initial AI/ML offering going forward. But addressing this one request would allow for more focused planning for P123 as no feature would be crucial to anyone using the downloads with Sklearn. More and more people are able to use those download with advanced methods because of a lot of new developments including (but not limited to) ChatGPTs code interpreter and Colab.

P123, I am sure you are working on this considering how many features can be addressed at once and the enthusiasm you have expressed for machine learning.

Thank you for continuing to work on this.

Here is some academic support of this featue request. It would be nice to rebalance a model, with updated data, that is trained using this: A Gentle Introduction to k-fold Cross-Validation

Here is what would then be available thru Sklearn: K-Folds cross-validator

Notice Sklearn’s use of random_state that would replace mod() and has more functionality. Also using shuffle = True and Shuffle = False has different uses covering many previously describe at P123 (e.g., Shuffle = True is similar to even/odd universe with more options and Shuffle = False allows for selecting different time-periods).

Sklearn allows for optimization with different metrics and you can code you own metrics if Sklearn does not have a metric you like. And automate this optimization.

This is something only available thru uploads and downloads of csv files to and from Python now. And Yuval had stress the usefulness of dividing data into discrete universes himself. Something i agree with. So there seems to be wide support for making this available thru with spreadsheets at least. Perhaps it would not be wrong to make this available thru multiple different means, including thru Python.

Python does make this easier, better and adds additional functionality compared to a spreadsheet using mod(). And I can run it while I sleep.

Bootstrapping, subsampling, recursive feature elimination, model averaging as well as may other features might also be useful to some P123 members.

Earlier in this thread I said I would speak to the developers about the requested enhancement to the API to add the ability to return daily data instead of the weekly data it currently returns. The development team would need resources allocated to this to accurately determine the scope of the project, but the expectation is that this will be a fairly large project. The developers are fully allocated to other projects for the near future including the P123 AI functionality. This enhancement is not on the schedule at this time, so I cannot provide an estimated date for completion.

This thread is a Feature Request created by Jim. As of today, it has received no votes from the user community. Votes are important in determining priority.

Dan,

Thank you for taking the time to understand the request in the first place, look into it, give it serious consideration and for getting back to us.

Best,

Jim

Hi Jim,
You wrote in another thread today that I said ‘no’ to this request. I just want to clarify that the answer was not no. The answer was ‘not right now’. We have a lot of projects already in progress and everybody is fully allocated to those. We will look into enhancing the API so it can retrieve daily data, but I cant tell you when.

1 Like

Daily fundamentals is an old Feature Request (2013!). I moved it to the Roadmap here Feature Request: Backtests with daily rebalance for a fee

It is something we want to do and might be able to start after we launch AI/ML, so please vote if you want it more than other things.

Thanks

So Marco.

I just need it today (literally just for 8/23/23). Just today. Today. I have no use for daily historical data.

DataMiner download or download from my port, ranking system or anywhere for today. Just each factor and an index of the tickers.

Like we get with the screener (that has 500 row limit) for today. And can use in our classic port that uses the ranking system (machine learning ports needing something a little different).

I understand we can get it now with screenRun from DataMiner.

But you can get only one factor at a time in the screener. It would be nice to get an array (csv file) with all of the factors you use in your system as one download (ticker as index is the only other column needed).

I really think this will at some point give you a good cost/benefit ratio for machine learners!!!

EVERY MACHINE LEARNER WILL USE THIS!!!

Anyway I am fine if you disagree on the profitability or importance once you understand my request and how useful it is to machine learners.

Thank you very much. Truly. For looking at this.

Jim

Aaron pointed out an important clarification regarding retrieving daily data from the API. The API can return the latest daily data, but only if the asOfDt is today or today-1. For example, today is Friday 8/25/23. An API call with asOfDt set to 2023-08-24 or 2023-08-25 will return the latest available daily data. An asOfDt <= 2023-08-23 will return the weekly data from the Saturday prior to the asOfDt. So those wanting to retrieve the latest daily data to feed into machine learning scripts can use the data_universe or ranks endpoints.

The DataUniverse and Ranks operations in DataMiner cannot return today’s daily data because the minimum Frequency setting is ‘1Week’ which will return data only for Saturday dates. We will look into an enhancement to enable DataMiner to return the daily data for the current day.

I am not sure I understand. Lets say next week on Wednesday the 30th (after market close) I want to get the data for Wednesday to rebalance the next day. If I make a call with an asofDt of 8/30/23 will I get that days information or will I get 8/26/23 (last Saturday) data? Using the python rank_ranks api.