TL;DR: Thank you Pitmaster for bring up the issue of feature autocorrelation. I had previously missed that de Prado was addressing feature correlation here:Advances in Financial Machine Learning with regard to cross-validation. I had been thinking he was talking about the target. But from the text: “Consider a serially correlated feature X…” I am not saying this is the final word and it is only about cross-validation at that. i think you may have broader concerns beyond cross-validation. Maybe my cross-validation methods have benefitted from this discussion.
So we can do some things to improve our models. Many things including addressing issues of autocorrelation. That is a given.
benhorvath says it well here:
I think he is right about that. A wise professor of medicine once said to me: “The reason that there are so many treatment options for this condition is they are all flawed. If there was a really good treatment we would all be using the same one.” Encouraging words indeed as I was about to start a treatment.
Maybe an analogy for the large number of methods published for predicting the market with machine learning. More simply: “It is hard” as benhorvath says.
So I have this question for myself mainly. Knowing I am a flawed individual using flawed methods with limited time to perfect my models is there a good method to test whether my model is reallly better than listening to Jim Cramer on CNBC before I fund it?
I note when I ask this that there is a method of cross-validation that uses Time Series Split for cross-validation at SKlearn.
It seems it is intended to address—at least in part–this problem: " Time series data is characterized by the correlation between observations that are near in time (autocorrelation )".
I leave it to someone smarter than me to discuss how well the people at Sklearn have done at addressing the problem of autocorrelation with this method. But I did use it before funding my present (not a random forest) model—knowing it would never be perfect. Used it because autocorrection can be a problem, I think. And will continue to look for ways to mitigate the problem in my models and cross-validation of those models.
Addendum (for geeks like me mainly):
De Prado discussed this in dept in the book: Advances in Financial Machine Learning
And excerpt from 7.3 WHY K-FOLD CV FAILS IN FINANCE
Leakage takes place when the training set contains information that also appears in the testing set. Consider a serially correlated feature X that is associated with labels Y that are formed on overlapping data: Because of the serial correlation, Xt ≈ Xt + 1.
de Prado, Marcos López. Advances in Financial Machine Learning (p. 195). Wiley. Kindle Edition. .
One just needs to understand:
-
Leakage is a bad thing and can be caused by autocorrelation in k-fold cross-validation
-
Serial correlation is the same thing as autocorrelation in this context.
-
An embargo used with k-fold cross-validation and also walk-forward cross-validation are proposed as solutions to the problem of autocorrelation in this book.
-
Bagging as is used in random forest was also proposed as a solution for classifiers at least (de Prado, Marcos López. Advances in Financial Machine Learning (p. 196). Wiley. Kindle Edition.)
Jim