Boosting your returns

piard2 · November 28, 2020, 8:33am

Hi Jim, after my last post on data leakage, you asked me where I come from and when was my last ML project, but please consider reading it again and possible consequences in your process. I had a quick look at JASP online materials (without testing the software). It seems the data splitting process is random: you just specify a percentage for the test set and a validation method (maybe some parameters can override randomness: possibly the “test set indicator”). As the “test set indicator” is set to none in your screenshot, it is likely that both the test set and the validation subsets used in the k-fold validation are all picked randomly in a unique time period and a unique ticker universe. Please check this into the full doc of JASP, because if it is really done this way, it means your training set, your k-fold validation subsets and your test set are randomly intertwined in time and in ticker universe, whereas they should be independent in at least one dimension of the double index (date, ticker). It may result in massive data leakage. Results may be very misleading if I am right in interpreting what JASP does. Moreover, your P123 simulation may also include the same time period (?).
You should look into this “test set indicator” parameter to see if it is possible to segregate the training/validation/test sets in different time periods and/or different ticker subsets.
If after looking into this it is not possible to split data in time periods and/or ticker subsets, JASP (in its current version) may not be the right tool to deal with financial timeseries

I did a similar mistake at the beginning of my “last ML project”: I used the default “random splitting” parameter of a sklearn function and figured it out after a few days. After correcting that, my super model with a flawed 80% predictive power appeared to be much less attractive.
This was just random thoughts from a guy living in the third world with no professional experience in ML but who wants to help.

Jrinne · November 28, 2020, 11:51am

Hi Jim, after my last post on data leakage, you asked me where I come from and when was my last ML project, but please consider reading it again and possible consequences in your process. I had a quick look at JASP online materials (without testing the software). It seems the data splitting process is random: you just specify a percentage for the test set and a validation method (maybe some parameters can override randomness: possibly the “test set indicator”). As the “test set indicator” is set to none in your screenshot, it is likely that both the test set and the validation subsets used in the k-fold validation are all picked randomly in a unique time period and a unique ticker universe. Please check this into the full doc of JASP, because if it is really done this way, it means your training set, your k-fold validation subsets and your test set are randomly intertwined in time and in ticker universe, whereas they should be independent in at least one dimension of the double index (date, ticker). It may result in massive data leakage. Results may be very misleading if I am right in interpreting what JASP does. Moreover, your P123 simulation may also include the same time period (?).
You should look into this “test set indicator” parameter to see if it is possible to segregate the training/validation/test sets in different time periods and/or different ticker subsets.
If after looking into this it is not possible to split data in time periods and/or ticker subsets, JASP (in its current version) may not be the right tool to deal with financial timeseries

I did a similar mistake at the beginning of my “last ML project”: I used the default “random splitting” parameter of a sklearn function and figured it out after a few days. After correcting that, my super model with a flawed 80% predictive power appeared to be much less attractive.
This was just random thoughts from a guy living in the third world with no professional experience in ML but who wants to help.

Frederic,

You certainly make some points worth considering.

So first, if one wants to do your way: JASP gives you the ability to pick and mark the data that you want to be in the test set. And this is how I usually do it with my studies.

I chose the simpler approach for illustration. While simpler, I believe it is valid and is the preferred method for some situations.

So one can, if they want, train and validate on earlier data (say 2005 to 2015) and test on data from 2015 to 2020. But I disagree that one is always required to do it this way.

I do agree that one should do this with time-series data. And the “embargo period” that you mention is also a good idea. I assume you know what I mean by “embargo” but perhaps you call it something different.

Mine was NOT time-series data.

I completely disagree with what you are saying for a cross-sectional study on data that is i.i.d or ergodic and cannot imagine where you would get that.

For example, if a company wants to test whether a web-page that is green gets more clicks than a web page that is orange do they have to separate-out the test data to be after the training and validation data in time?

Simple answer: no. And again I do not understand where you would get that.

The company could test their web-pages with exactly the same method I used—using JASP if they want.

I think a full discussion of this is beyond the scope of this thread. [b]But I think the people who designed JASP kind of knew what they were doing when they made this method an option for cross-sectional studies.[/b]

But if you prove me wrong maybe we can call JASP–as well as Amazon and a few other operators of web pages in the Silicon Valley–and tell they how they should be doing this.

P123 can hire a professional to supervise this if they want but I do not think some committee from the present forum has the understanding to supervise how others use this in any intelligent manner.

That was my only point in my previous post and it seems even more true now.

Jim

Jrinne · November 28, 2020, 12:34pm

Delete or edit I guess. I had a separate point. But I think I will stick with the idea that the present forum is not capable of deciding this.

For sure, I am not applying for the job of moderator of machine learning in the forum. I am was glad to try to help out (when asked) as I did when I shared my knowledge of Colab and XGBoost and TensorFlow (and Anaconda which Steve did not appreciate much compared to Colab) with Steve—and I guess Marco too as he said: “I’d also want to try it myself.”

I think Marco and others have an easy and valid first-way to do this for cross-sectional data. I am not sure whether Marco has moved on to XGBoost yet or not. If so, it is probably with Steve’s help.

People can modify what I have done above to their heart’s content. Expand upon it. Decide for themselves.

And Marco, if you are considering putting some committee from the present forum (as it is now) in charge of this you should just shelve this project. It will not end up being worth anything and it will be a waste of everyone’s time.

And Marco, thank you. We do not have anyone c*ssing and calling everyone a quant yet. Or lecturing us about how the Theil-Sen estimator is the only acceptable quantitative method. BTW, I do not see you stopping anyone from using the Theil-Sen estimator if they want. I know you had a lot to do with that. I think you will find that to be a wise business decision (or not). You can still shelve the project if you think that is best—with no complaints from me.

piard2 · November 28, 2020, 2:34pm

Jim, your ideas are valuable and certainly not a waste of time.
It’s also the purpose of a forum to confront different points of view in a civil way. We don’t need a police or a committee.
You want to promote ML in P123, I am absolutely on your side. But we have to be a bit careful when we write here because forum threads may be taken as a reference. In this particular case, your objective is not very explicit. You are showing parameters of JASP that may result in using ML algos for curve-fitting rather than generating a predictive model. Then you show the feedback of the curve-fitting process in a simulation. I think I understand why you do that, but it may be misleading for readers coming here without understanding all what is involved behind.

Jrinne · November 28, 2020, 3:06pm

Frederic,

If you look back at my treads you will find that I have been a strong advocate of certain people picking up a book. I kind of get that. Or even better would that P123 hire a consultant again if they want to actually be involved in how any AI/machine learning models are being used. They have done it before.

Maybe P123 could get professional confirmation as to whether a rank (as long as this “transformation” preserves the order of the data) is as good as raw data for a boosting model, for example.

I am sorry you do not get it. I do think Steve has gotten it. He has been willing to put some time into it on his own.

I am happy to help Philip, Steve and others who may want to take what I have done and expand upon it and improve upon it to see for themselves. To help them get started if they are interested. I do not plan to fully expand upon all of the possible improvements in this thread. Or go into all of the theoretical justifications for my methods here. Members can find their own (probably better) methods if they do not like mine.

It isn’t practical or even possible in this format anyway. My posts are long enough as it is. And I think Marco sees that.

This discussion is already boring me and I like machine learning!!!

I await your more complete, peer-reviewed study when you have your own findings to contribute. And in the meantime, please, take over with your own methods and findings on this forum.

Until then, I think Marco is smart to allow members the opportunity to find their own use for the API even allowing them to make some mistakes along the way.

Unless Marco wants to hire a professional consultant, it is my opinion that the “peers” (as in peer-review) on this forum are not up to it.

This could change. I think P123 can attract some people well-versed in machine learning. Some are already here (but not posting in the forum). They have contacted me by email. I am not holding my breath waiting for them to share their ideas with us on the forum, however.

Marco can continue to help them with data as long as it fits his business model, I think. Or not. I leave it to Marco.

Best,

Jim

philjoe · November 28, 2020, 5:12pm

I just need to be able to make the target column (ie. forward month returns) from the API without running out of room. I want data for a 3000 stock universe with say 20 nodes as rankings, plus the one month forward returns for each stock, over a period of 10 years. If this can’t be automated, I refuse to pull it one week at a time, ie. pull one week, reset my API key, and then repeat for 10 years.

If someone can show me that (Steve’s python code was close, but my API key keeps running out) I will happily expand on what Jim has started and report back.

Jrinne · November 28, 2020, 8:21pm

P123 would be wise to waive any fees and facilitate this. For Philip or anyone willing to share at this point.

Whatever Philip finds would be well worth any network or data costs incurred. P123 would either know to put this project on the back-burner or possibly get an early advertisement for their AI/machine learning project should they decide to continue it. Either way, well worth the investment.