Python API/ML Data Download Tips

yuvaltaylor · August 20, 2023, 4:33pm

I’m not an expert on machine learning but I have done a huge amount of backtesting, so I have a few suggestions to avoid this kind of thing.

Run your ML on multiple discrete universes. Run different ML algorithms on the universes. Don’t allow any overlap or allow any learning from one universe run to another. The goal would be to come up with, say, twenty wildly different systems. The final system–the one you use on the out-of-sample hold-out period–would somehow combine all the different systems.
Anything run on a five-year period will almost be sure to perform badly out-of-sample. In all my correlation testing, a five-year lookback period performs the worst (even worse than three years, which benefits from a little factor momentum). I would advise using eight to ten years.
Always trim outliers, especially if you’re using regression.
Make sure to take transaction costs into account as much as you can.