K-fold cross-validation: The most used method for addressing overfitting on the planet

@Marco,

TL;DR: P123 cannot expect to attract machine learners without giving access to this somehow. The best evidence for the need of this can be seen in the overfitting of the designer models that have produced no excess returns (on average for the last 2 years). This is the most used method for preventing overfitting on the planet!

If P123 will not make it easy to download updated data and use k-fold cross-validation with Python, I wonder if P123 might make it available as a feature within its platform.

Here is the request to make it easier with Python: Data download the day of rebalance for machine learning - #13 by Jrinne

Case use: machine learning and AI is not possible without some sort of cross-validation. That is never done anywhere. If P123 wants to attract machine learners this will be necessary. And it would not hurt if it were easy, I would think.

If I am wrong about this being the most used then some of these other methods can be made available over at Sklearn and/or within the P123 platform and have a higher priority at P123 if we want to be serious about machine learning:

Other Techniques: There are many other methods specifically designed to prevent overfitting:

  • Regularization: Techniques like L1 (Lasso) and L2 (Ridge) regularization directly address overfitting by adding penalty terms to the model’s loss function.
  • Pruning: Used in decision trees to remove branches that have little power in predicting the target variable.
  • Dropout: Commonly used in neural networks to prevent overfitting by randomly setting a fraction of input units to 0 at each update during training.
  • Early Stopping: In iterative algorithms like gradient descent, training can be stopped early if validation performance stops improving, preventing overfitting.
  • Data Augmentation: Especially in deep learning, augmenting the data by applying random transformations can prevent overfitting.

But you will never see machine learning done anywhere without one or more of these methods. It should not be a goal of P123 to restrict access to them. That would go by the name"unforced error", “shooting oneself in the foot” etc. if the goal is attracting machine learners.

Jim

I have not done the research to verify this yet, but the scikit python package probably does not split by stock id (I don’t know how it would). My concern would be that mixing companies across folds/splits would increase the information leakage.

I am curious if anyone else has looked into this. But unless someone has I would recommend creating your own folds/splits using the stock id.

I think this is probably generally correct. I think you might be able to get it to do that by making the stock ID the index.

ChatGPT is busy but Bard seem to agree that is possible with this answer (clear without the question I believe): “Yes, you can make the stock IDs your index and use k-folds to make the stock IDs exclusive to one fold.”

I did not include the code for that as……

I have a concern about that as the stock ID could be in more than one row ( with different dates). I suggested double indexing to resolve that:

A: "Sure, you can also use double indexing with k-fold cross-validation.

Double indexing is a way of assigning multiple indices to a single data point. In the case of stocks, you could use the stock ID as the first index and the date as the second index. This would allow you to track the performance of each stock over time.

To use double indexing with k-fold cross-validation, you would need to create a k-fold object that takes into account the double indexing. This can be done by using the GroupKFold class from the scikit-learn library.

Here is an example of how you can do this in Python:"

Q; And to be clear, each stock will be in only one fold?

A: “Yes, each stock will be in only one fold.”

Edit: I think the code is straight forward and ChatGPT or Bard can repeat it so I deleted it.

Jim