Is there a good download for the rebalance day (for systems that use DataMiner data and not ranks)?

Jrinne · August 1, 2023, 3:13pm

A machine learning model is developed with a csv file (or DataFrame within Python) that has feature columns and the target (often returns). Ticker and date as the index often.

P123’s DataMiner has a nice ability to get csv files for that. Nice job P123!

I cannot find a convenient DataMiner download for the day of the rebalance. Specifically all of the same factors used to train the model as columns and again ticker and date as index.

Or basically one download to rebalance with (once you got the model trained).

I am not a great programmer but so far all that I can identify that is even workable is to run each factor as a screen. Then sort each screen’s ticker column in alphabetical order to align the tickers and concatenate that for each factor.

This is one or the reasons I abandoned an XGBoost model several years ago. It is borderline, in my opinion, as to whether it is practical for me to run an XGBoost program now. But P123 would not have to change much to make it much, much easier. Maybe just explain to someone how to use DataMiner to get a simpler download. Or consider making one available if it does not already exist.

While slightly easier with the DataMiner screener there will be a large number of individual downloads, sorts and concatenations for many. And could be a cost, perhaps, with DataMiner just to rebalance an already trained model. I assume P123 set the data limits for DataMiner with the intention of avoiding large cost (it is a nice service that should be paid for by members using it so just a consideration).

So, I must be missing something as P123 is attracting and wanting to attract machine learners now. The way I described above to get the data needed (downloading multiple screens) could be streamlined and I would be surprised to find that P123 has not done this already. Any help from P123 appreciated. Personally, I cannot see any other good way to do machine learning at P123 without something like this so this could be useful information for others.

Thank you in advance,

Jim

Jrinne · August 1, 2023, 9:16pm

So just for anyone considering the P123 platform for machine learning.

Unless you have the Ultimate membership you get 1,000 API units per month or less.

Assuming there is no better way and you have to download the screening data through DataMiner to rebalance each day, that is 2 credits per screen (P123 already said they would not allow more than 500 rows of downloads throughout the regular screener download).

If you want to rebalance just one model every day that limits you to 25 nodes for a model.

Calculation: (25 nodes ) * (20 days in a month) * (2 credits per screen) =1,000

And there would be no credits left over to develop a new system.

TL;DR: I think you cannot consider doing any serious machine learning (with more than 25 nodes) at P123 without the ultimate membership $2,400 per year or buying additional credits. Someone please correct me if my math is wrong or I am missing a convenient download.

danp · August 3, 2023, 11:27pm

Hi Jim,
I have not done much with machine learning, so I appoligize if this is a dumb question but I always assumed that when a P123 user created a model using machine learning that the final product would be a live strategy in P123 with a ranking system, buy/sell rules, etc created using the info they gathered during the machine learning project. Are you saying that your process once your model goes live will be to download the FRanks, ZScores, technical data, etc from P123 and use that data in some other system to rebalance your portfolio?

Dont forget that there is no way to get raw fundamental data out of DataMiner unless you have a vendor data license. You can only get FRank, zscore, etc and technical data.

If you want to download “the same factors used to train the model as columns and again ticker and date as index.” then ScreenRun is probably not helpful since it does not return any data. It only returns UID, Ticker, Name, Last, Rank, MktCap, SectorCode and IndCode. Doesnt the DataUniverse operation in DataMiner work for you for this purpose?

Jrinne · August 4, 2023, 12:05am

Dan,

I am trading a model like this now that is doing very well So this can be done. Sometimes the machine learning can be used to determine the weights in a ranking system as you suggest…

Let me briefly add that Duckruck was trying to help do this in many of his post. He was wanting to use linear models like PCA or Elastic net that could be put back into the ranking system. I will let Duckruck expand on this. But I believe he was trying to give sound advice to P123 if you want to maximize the number of machine learning models that can be used to build a ranking system.

But I do not believe you can do that with random forests, boosting or neural nets (non-linear models).

Yes that is what I am say for most models Yes.

I definitely understand this. For boosting, random forests and neural nets, ranks will do the same thing as the fundamentals. You just have to have the order correct. The order does not change for the fundamentasl or the rank of the fundamentals.

I do believe that the API and DataMiner was started for machine learning by Marco when I convinced him of this (about ranks working for most of us). I do not need fundamentals. I am very sure of this for boosting and random forests. But I am 100% certain of this.

BTW, z-score is just as good as fundamentals for regressions.

I am extremely happy with the DataMiner download is another way to put it!!! Thank you particularly for making this available for the Mac OS.

So, again I am really bad with DataMiner for some reason. But I will learn DataMiner and I like the flexibility.

It can be done with the screener but is is really FAR from ideal. Pretty labor intensive involving sorting and concatenation. I was doing a boosting system before DataMoner with screener downloads and I hated it. One of the reasons I stopped but it worked pretty well.

Can you give me some sample code with the DataUniverse? And I will look at Marco’s writings too I hope you are right that it will be better.

Much appreciated,

Jim

danp · August 4, 2023, 1:51am

Thanks for explaining your use case.

This thread had an example of the DataUniverse operation in DataMiner and the output:
https://community.portfolio123.com/t/thank-you-dan/63529/2

Let me know if the DataUniverse operation in DataMiner doesnt work for you. The only case I can think of that would be an issue is if you wanted to pull data for weekdays. The DataUniverse operation has parameters for the start and end dates and frequency (ie weekly, 2 weeks, etc). The script runs using weekend dates only.

If that is an issue, then an option is to use the Data_Universe endpoint in the API directly instead of DataMiner. That endpoint has a parameter for the asOfDt, not a date range like DataMiner has. I just did a quick test and when I run it on different weekdays in the same week, my ranks for fundamental and technical data did change each day. ← 8/5 update: this statement about it returning daily data was incorrect. It returns weekly data.

You would have to write code of course, but you seem comfortable with Python and this is pretty easy to use since the output can be a dataframe. Let me know if you want to try that and I will send you a sample Python file. This might be a better solution for you since it eliminates the step of exporting the data to a file and then importing it to use in your script.

jlittleton · August 4, 2023, 3:18am

I am interested in a python api implementation of what Jim is talking about so I can write to a python dataframe. I started writing one that would pull from an existing ranking system using rank_ranks. Is there any benefit to the data_universe over rank_ranks?

Also can I pull data like future returns for my custom universe? To my understanding alpha of the ranking system vs the universe it is using is one of the best performance metrics.

Thanks,
Jonpaul

Jrinne · August 4, 2023, 9:16am

Dan,

For training the models your recent example with “Operation: Ranks” works very well. I cannot immediately think of what else I need to train a model. I will look at DataUniverse too for training models.

What i am interested in—in addition to training a model—is what I would need today to rebalance an already trained model. If I had a trained random forest model that I wanted to use today (a Friday) to pick stocks or rebalance a port with the most recent FactSet data uploaded to P123’s server overnight would DataUniverse work? Your example did not work with 8/4/23 or 8/3/23 dates entered and you talk about weekend data for DataUniverse in your post. See below as what I entered. There was no output.

Obviously the screener is built to give us today’s ranks. It works but it is cumbersome as you only get one column of ranks: one factor at a time for a rebalance today. I think there will be a use case for any serious machine learner (other than me) to get the most recent data–thru screener or with downloads that have more than one column of recent data (each column a factor or node with the most recent data). I do not see how you can do serious machine learning without rebalancing a port with the most recent data.

Once a model is trained it would be nice if the machine-learning-rebalance did not take a lot longer than rebalancing a regular port with P123 today. Thank you for helping to facilitate that with DataMiner.

Thank you for your help. I am sure I am missing something snd what you have shown me already is a big help… What I tried below:

Main:
Operation: DataUniverse
On Error: Stop
Precision: 4

Default Settings:
PIT Method: Prelim
Start Date: 2023-08-03
End Date: 2023-08-03
Frequency: 1Week # ( [ 1Week ] | 2Weeks | 3Weeks | 4Weeks | 6Weeks | 8Weeks | 13Weeks | 26Weeks | 52Weeks )
Universe: DJIA
Include Names: false
Formulas:
#Target ie future excess returns
- FutRelRet_SPY: FutureRel%Chg(20,GetSeries(“SPY”)) #4 week future total return relative to the SPY ETF
- FutRelRet_Ind: FutureRel%Chg(20,#Industry) #4 week future total return relative to its industry
#Other ‘future’ return functions: FutureRel%Chg_D() FutureRel%Chg_W() Future%Chg() Future%Chg_D() Future%Chg_W()

    #Factors. Up to 100 formulas.
    - FRank("EarnYield",#ALL,#DESC)
    - ZScore("Pr2SalesQ",#All)

Best,

Jim

WalterW · August 4, 2023, 12:34pm

Ok, I’m going to wade into a subject I now nothing about … yet.

Is this (DataUniverse/screener) really the best process for getting ranks out of p123? I mean, for a given ranking system, the rank module already provides rank values for all rank factors and all stocks in a selected universe. Those values are available for d/l as a csv file.

The only thing missing is the target field(s).

Isn’t there a API endpoint to do such a thing?

If there were, the flow would be

create an investigatory factor/feature ranking system
d/l multiple rank/target values for some time period
train your model
and then generate the final ML optimized ranking system from the factors the ML discovered

Jrinne · August 4, 2023, 1:04pm

And not that I don’t appreciate that absolutely none of this would be even remotely possible without P123 AND without Dan’s help. I absolutely appreciate that and will never forget. Truly appreciated. Full stop.

But if I may, one click while the coffee is brewing and I am trying to wake up after a night of call and remember whether I have to go to the hospital or the office first would not be hated by me. Charge me for the download. I will happily support any additional costs incurred by P123 for large downloads. No one else should be fitting the bill for my downloads, for sure. But I would set things up for a beginner machine learner to be attracted to P123 without a $2,400 Ultimate membership from a business perspective. Would a recent college grad just put that on a credit card and hope it was mostly paid off before the next year rolled around? I do not have an MBA (I do own a business and take calls in the middle of the night from a service perspective, however). So who am I to say what is best.

With good service in mind though, I wonder if Dan ever does any remote IT services (for my electronic medical records)? I truly recognized and appreciate that this would not even be possible without P123. But Python is an integral part of P123 now for everyone. I think there are machine learners out there to be recruited to P123 as well as old-timers that remember how to use dropdown menus.

Jim

jlittleton · August 5, 2023, 6:57am

Jim,

If the data_universe or rank_ranks in the python api returned the up to date information for the week would that work for your use case?

Dan,

I ran both the rank_ranks and the data_universe python api and I am not getting values for that day. Am I missing something?

Code for data_universe (I have it as a class, but that should not affect things):

Code to run it:
date = ‘2023-07-24’ # this is a Monday
ranking_univ = downloader.download_universe(date)

Output for both with some formatting: note that on that day TSP closed at $2.1 not $2.23

Next day in case I don’t understand what close(0) does:

We can see that none of the values changed even though I changed the date including close(0) and the market cap which is formula2.

Thanks,
Jonpaul

Jrinne · August 5, 2023, 9:38am

Jonpaul,

Up to date data for the week would be good for Mondays yes.

When I get a system I like I also tend to do a weekly rebalance on Tuesdays, Weekely rebalance Wednesdays etc. So 5 ports with weekly rebalance. This should help with any slippage problems and may tame the market volatility slightly. I am not saying it has a huge effect on volatility but if a stock has a large drop on Monday it makes me feel better if it shows up on a Tuesday rebalance and I can buy it at a lower average price.

I would want overnight data for each port. I would not want to be buying on a Thursday with data from the last week.

BTW, you clearly have the skills to learn about machine learning over at Scikit-Learn on your own. I would be interested in you general progress (without giving away any privately developed factors or whatever do not wish to share).

I will be happy to share most of my insights, But the cool thing about Scikit-Learn is you can test your ideas without having to get anyone’s else’s opinion.

If you find a great way to rebalance please share it with me.

Near as I can tell you have not found one yet? It seems possible to me that Dan might be working on something. Dan seems pretty open to other people’s “case uses.” He made a special effort to make DataMiner available for the Mac I think. Maybe not just for me but he considered other people’s needs when he did it even though I do not get the impression that he uses a Mac. Assuming we have made our case-use clear and it seems reasonable I am sure he will at least consider this.

It is not impossible to do with ScreenRun. Set me know if you need any help with how to do it with ScreenRun if you need any help with that.

Jon, so cool what you are doing! Thank you for your help.

Best,

Jim

Jrinne · August 5, 2023, 10:09am

P123,

I had a nice port a while ago (before DataMiner) that used XGBoost. Something P123 is now working on nearly a decade later.

As you know the screener is limited to 500 rows. So the screen download could not be used for my universe (because my universe had more than 500 tickers).

I asked then if I could get more rows. That could not be done immediately because of a contract with CompuStat at the time. But the data companies generally have less of a limit on ranks and I asked that maybe that could be considered as part of any future contract negotiations.

Nothing ever happened with the screen downloads.

In addition, Marco keep thinking machine learners need fundamental data.

Several years later Marco came to understand the true potential of ranks for machine learners. He began developing AI/ML and made DataMiner and the API available to machine learners.

It seems like the API and DataMiner may be creating some revenue at P123. Marco has thanked me for some of my machine learning ideas. To be sure, none of that would have been possible without the awareness that ranks are a good as fundamentals for many machine learning applications.

Very nice P123!!! Truly I am grateful.

I do have ScreenRun now which functions better than copying from the screen data from the web page. And Rank with DataMiner is a HUGE improvement over how I had to get the training data before.

I get the you are developing something with AI/ML but I am concerned that you are not getting input from people like Jonpaul now and P123 is poorly equipped to take suggestions thru the forum on machine learning. We will see if the person developing AI/ML has thought of everything when it is unveiled. But P123 could easly make itself a machine learning heaven now irregardless of what is developed separately with regards to AI/ML. And that could be a revenue stream now thru DataMiner, API and any other downloads that P123 can keep track of and charge for.

P123, I do think Jonpaul’s case-use is very mainstream for machine learners. You should continue to address his questions in the forum until he can do advanced machine learning. His mainstream ideas will serve to attract other machine learners, I believe.

It is good business if nothing else. And thank you for the improvements over what I was trying to do years ago in Jonpaul’s situation. I find P123 useful now and I am grateful. It could be much better with no additional cost, without a doubt.

None of this is new: it is very old at P123 in fact. I have made an effort in this regard for nearly a decade and P123 has adopted many of my ideas along the way. What Jonpaul is wanting to discuss has always been crucial to any machine learning. It is fine to be able to train a model but then you have to be able to use it. Or not. But then many will ask: Why use the API to train my models and increase my membership to Ultimate? The ranking system works okay.

TL;DR: I am one of those asking the question above. I have not upgraded to Ultimate, BTW. I would like to be able to use the systems I develop (with a rebalance) for some reason. Is that weird?

Best,

Jim

WalterW · August 5, 2023, 12:50pm

You can use ‘asofdate’ as a formula. Doing that, I think you’ll find the date is always that of the past Saturday. So, whatever Friday’s close was, you’ll see that for each day of the following week via the api calls.

Since I’m interested in rank values, I would like to know if the Sunday night update is used in anyway.

Overall, the api wrapper rank_ranks() provides all the data I feel I need for now.

EDIT: it looks like data_universe() returns a asofdata that matches asofdt, but the close(0) data is from the prior Friday. At least rank_ranks() gets it right i.e. the returned asofdate and close(0) data align.

Jrinne · August 5, 2023, 1:35pm

Walter,

I full agree that what P123 provides is excellent and my not need any changes or additions for training data (from my point of view at least).

But I rebalanced a port yesterday (a Friday) with P123 using their ranking method (a cool thing that I give P123 credit for making possible).

But if I were using a random forest I would need data from the Thursday night/Friday morning FactSet uploads to P123 (or use data from last Friday’s close or maybe Sunday as you are asking about and is not even clear to you or me).

Not having that in an easy download (other than ScreenRun) would be a problem for anyone running a random forest with 150 nodes. Am I wrong about that? Any serous question about that?

So it may be a little different for me but Jonpaul has said he is looking at random forests and Whycliffes is now looking at 150 nodes.

A real problem, IMHO, if P123 is serious about attracting people who do machine learning to its platform.

If there is a different way than ScreenRun to get recent data I have asked P123 this question for a decade now (asking to get more than 500 rows from a screen at the time).

Honestly, I have bypassed some of that need with a unique system that I have had to develop without that option, but not everyone is going to stick around for 10 years at P123 to figure out their own useful (but somewhat limited) methods.

So, I do not really care that much myself anymore, but P123 should appreciate that Jonpaul is making a sincere and polite effort to have his very mainstream ideas addressed by P123. P123 will be shooting itself in the foot by not continuing to engage him until he is fully satisfied, IMHO. It is not anything that will cost P123 one cent, and it will generate revenue through increased demand by machine learners for the API and DataMiner.

Jim

danp · August 5, 2023, 2:02pm

Walter: Yes, the rank_ranks endpoint in the API can be used to return the ranks as well as the values for other formulas like future returns.

jlittleton: Thanks for running that test. You are correct that the datauniverse endpoint is returning weekly data. I didnt take screenshots of my results, so I dont know exactly what I tested. I know I saw the values (FRank results) change slightly day to day, but I cant reproduce that now. The lead API developer is out next week, but I will talk with him when he gets back to see what the options are regarding returning daily data from the API.

Jrinne · August 5, 2023, 2:04pm

Dan, you are the best.

A true asset to P123. And if anyone were to ask I think a raise and a couple of new titles are in order!!!

WalterW · August 5, 2023, 2:11pm

Can we expect that rank_ranks() will miss the Sunday night data update? If that’s the case, then the returned values may not match what is used for Monday morning rebalancing. Is there a good use case for rank_ranks()?

danp · August 5, 2023, 2:31pm

I think the choice between data_universe over rank_ranks mainly depends on if you want to create a ranking system on the website which contains all your factors. If so, then use rank_ranks so you dont have to write out all the formulas in the API script.

The only other distinction I can think of is how NA’s are handled. In rank_ranks, factors like FCFYield would be in the ranking system. The ranking system has a setting for ranking method where you can specify if ranks are treated negative or neutral.

In data_universe you would have a formula like FRank(“FCFYield”). FRank doenst have an option for treating NAs as ‘neutral’. From the FRank help file:
“incl_na: Whether to include stocks with NA values in the rank
#InclNA - include NA values in the set to be ranked (default)
#ExclNA - exclude NA values from the set to be ranked, assigning NA to such stocks
NA’s, if included, are always placed at the bottom of the array and all get the same percentile.”

There are functions available to return future performance results. They are in the Technical, Performance section of the Reference and have names that start with ‘Future*’. You can add these in the formulas section for data_universe or the additionalData section in rank_ranks.

WalterW · August 5, 2023, 2:49pm

I’ve also used close() w/ negative bar count e.g. close(-20). I need to double check that it’s returning the correct values, but it didn’t throw an error.

jlittleton · August 6, 2023, 4:43am

I just checked and Close(-1) actually returns the Monday of the week that the specified date is in. So Tuesday would be -2 and Friday would be -5. This works for things that you can specify a offset on like price and volume.

However, for factors that you cannot set an offset you get the Friday value (as far as I can tell).