No need for raw scores

I had thought this but wanted ChatGPT to confirm this.

ChatGPT: “In multivariate regression, whether you use raw values or z-score normalized values, the predictions themselves will remain the same,”

So if you do regressions with predictions in mind the raw scores are not needed. Full Stop. Just some confirmation that what Marco, Riccardo and others at P123 have done is incredibly cool!

The same goes for random forest and XBBoost where the order is the only important thing. Order is preserved with both ranks and z-score. Neural-nets do not even work without standardization but I am not much of an expert on the best way to standardize neural-nets. However, I suspect having it already standardized with ranks or z-score will be a good thing.

But if I understand rank and z-score will be different. Rank is actually standardized Monday of each week. If I understand correctly the z-score is standardized over the entire period of the download.

I would ike clarification if I am wrong on that. But the way I understand it now this will give you two different good options and is pretty complete.

Anyway, nice job P123!!!. I think there is are a group of people who already understand how useful this is (e.g., engineers, Kaggle competitors and just many undergrads these days). I think this will succeed if these people are made aware. I will certainly market it to anyone who asks.

TL;DR: Raw data adds nothing to machine learning. IMHO, this is an incredible achievement, But for sure something that you cannot get anywhere else as a retail investor to the best of my knowledge.

Jim

But this is important. If you are going to use z-score for predictions the data you are using for predictions will HAVE TO BE STANDARDIZED IN THE SAME WAY AS THE TRAINING DATA.

I am not sure how to best do that but is will not work without that I believe.

@Marco, I believe this is important. Ranks are standardized each week There is a difference in how you can use ranks and z-scores. An important one

Here is some confirmation of that, Question to ChatGPT: “To use z-score for predictions you need to standardize the data you are making predictions with in the same way you did for training?”

Answer: “Yes, to make predictions using a model trained on z-score standardized data, you need to standardize the new data using the same mean and standard deviation used for the training data. This ensures consistency between the training and prediction phases.

I am not aware of any way to get around that simple statement.

1 Like

BTW, it would be trivial for me and most of us I think to find the mean and standard deviation for the training data in a spreadsheet and enter that information into the download.

No raw data needed. The output would be the z-score only. Based on the user provided mean and standard deviation.

For P123 to do it from scratch each time for predictions would require a large download OR COMPUTATIONS THAT MAY NOT BE THAT RESOURCE INTENSIVE (ie., just mean and standard deviation) on a larger amount to data. But that would work too.

The way around it at the moment is to use the same “ML Training End” date. This option appears when you choose “Entire Dataset”.

The problem with our current solution is that you will need to download more and more data for predictions. For example if your model was trained using 10 years of data and 4 years of validation, when you do predictions a week later you have to download 14 years + 1 week and use the exact same “ML Training End”. From this download you only use the new, single date, that has been normalized using the same stats as the original training.

It’s very inefficient and costly. We planned to store the mean and std for re-use but wanted to hear some feedback first.

1 Like

Awesome. And I am impressed. You really understand this stuff and clearly you have taken a personal interest in this project. I know you will make a wise choice on the best solution.

Thank you.

Another thing to keep in mind is that if you are doing k-fold cross validation, preprocessing @ each fold might be overkill. Using the entire dataset to calculate distribution stats might be totally fine.

In fact I’d even venture to say that it’s preferable to have as complete a picture of the volatility of a factor as possible. Even if you are using the “forbidden” data reserved for validation.