Literally EVERYONE AT P123 is an advocate for what K-fold cross-validation and subsampling do

Jrinne · August 21, 2023, 9:39am

TL;DR: THIS IS WHY I DO MACHINE LEARNING MORE THAN ANY OTHER REASON!!! Yuval has helped popularize 2 great ideas at P123. They don’t necessarily have to be done on spreadsheets has always been my only point. Is it bad that I want the computer to do some of this?

I agree 100% and always have. I use K-fold cross-validation for this because it will automatically do this for me in Python. And just me perhaps but I don’t mind the computer doing some of this for me.

@yuvaltaylor does another cool thing that seems similar but is actually different. You may want to do both I think. I do anyway.

Yuval and many other people use mod() on stock ID. This is clearly different.

K-fold will divide things up by date if your csv file is sorted by date (generally you should sort it by date). While mod() divides-up the stocks (possibly mixing up the dates).

They have different purposes and I think I would get bogged down trying to explain that. I am trying to be a better poster. And if you don’t understand ChatGPT on this point I will be unable to do better. You will just have to accept that you are actually doing different things (right?) and they will have a different effect and different uses. Another way to realize they are different things is by finding that ChatGPT may have you use both K-fold cross-validation and bootstrapping or subsampling in the same code. If they were doing the same thing you would only need one of them.

But stock ID and mod() is duplicated in Python with subsampling and bootstrapping. You can also do bootstrapping as a separate library and/or just use a could of lines of random() with replacement.

Sklearn’s random forest willl do bootstrapping by default. You can subsample with XGBoost

When you use subsampling with XGBoost it becomes Stochastic Gradient Boosting

You can do stochastic gradient boosting with XGBoost without mod() or stock ID or subindustries. It speeds things up too.

Friedman’s paper was in 1999 and boosting and subsampling were way before that. So, yep. Good idea. I am not disputing that it is a good idea and I never have.

TL;DR: THIS IS WHY I DO MACHINE LEARNING MORE THAN ANY OTHER REASON!!! Yuval has helped popularized 2 great ideas at P123. They don’t necessarily have to be done on spreadsheets has always been my only point. Is it bad that I want the computer to do this?

I think working on getting a concensus on what we all actually agreee upon would not be a bad thing necessarily. Maybe the person doing AI/ML could automate some of this for those of you using the optimizer since absolutely everyone agrees on this. Or you can just use Python if you want to have it automated.

Jim

Jrinne · August 21, 2023, 10:32am

BTW. Can I have an updated overnight DataMiner download for rebalancing things I do with Sklearn so that I don’t have to try to convince P123 that they have had a great idea all along and that automating this can make it easier and better sometimes? Or that it might be possible to take these great ideas and improve them somehow?

With the download I can just do it myself. That way you can discuss different ways to do this in a spreadsheet in the forum without my suggestions making you wonder if it could be improved upon, augmented, made faster, or just made easier in some way.

Does anyone doubt that the computer could, at a minimum, do more of this for you? Like 10,000,000 times overnight?

And make no mistake. You will get benefit from doing boostrapping 10,000 times or more (compared to however many time you can do it with mod() in your lifetime).

Just stick with mod() without my interference if you want. The DataMiner download will keep me occupied while you are posting about spreadsheets.