Using a Random Forest Classifier is easy

I have been playing with a Random Forest classifier. In this case, classified by whether each months returns (for an ETF) were above or below the median returns for all of the ETFs included in my little study. It is easy. It also seems to work. Work well enough to be implemented? That remains to be seen. But you can check it out for yourself using your own data if you are interested. The features I used are relative strength but you can use anything including aggregate series from P123. WHAT A GOLD MINE P123’s data is. Gold mine in the sense of P123’s data usability, downloadability, access to Fed Data and the small amount of code required for a random forest classifier to use this data:

#Training the data
import sklearn
import pandas as pd
X=df[[‘xsxlkc’, ‘xsxlec’, ‘xsxluc’, ‘xsxlpc’, ‘xsxlbc’, ‘xsxlyc’, ‘xsxlic’, ‘xsxlvc’, ‘xsxlfc’, ‘xstltc’,‘xsgldc’, ‘xsqqqc’, ‘xsijrc’, ‘xsmdyc’, ‘xsiwcc’, ‘xsicfc’]]
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier(n_estimators=10000, oob_score=True, min_samples_leaf=5, random_state=1, max_features=‘auto’), y)

In this example classxlk is the class (whether XLK had returns above or below the median of all of the ETFs that month) and as an example, xsxlbc is the excess (xs) returns for XLB (c at the end for close) over a previous time-period.

This will give you an out-of-bag score that you can use for cross validation. So easy, but in truth I never even have to use this. I just leave it on max_features=‘auto’ and min_samples_leaf=5. So cross-validation is easy or one does not even have to do it for a random forest classifier. If you want to look at the feature importance:

#Feature importance

Did it work? Run the predictions on some separate test data. Output the predictions to a spread sheet and merge the predictions with your test data. See if the predictions actually predicted the classifications for your ETFs in your own way:

#Make predictions on test data
Xt=dft[[‘xsxlec’ ‘xsxluc’, ‘xsxlkc’, ‘xsxlpc’, ‘xsxlbc’, ‘xsxlyc’, ‘xsxlic’, ‘xsxlvc’, ‘xsxlfc’, ‘xsqqqc’, ‘xstltc’, ‘xsijrc’, ‘xsmdyc’, ‘xsiwcc’, ‘xsicfc’, ‘xsgldc’ ]]

This is for a Mac. You may have to change your code for Window’s file directory methods (or other operating systems).

My results: Using a Random Forest Classifier improves the predictive power (precision) of relative strength as a signal. In other words, a test sample suggests more ETFs will beat the median return if you use the above random forest with relative returns (relative to the median returns the above ETFS) as a feature than if you simply invested in all ETFs that had a positive relative return without using the random forest classifier (rebalancing monthly). Which does not mean that I will necessarily use it or recommend it. But it serves as an example of an accepted ML method that is easy to implement and can be run on a lot of different data (including Fed data) pretty quickly: Full code included

Cool stuff… Thanks for sharing.


I use P123’s sims, ports, rank performance etc. I appreciate the value of what P123 does. If you are looking at using machine learning too, here is some code for a walk-forward implementation of a random forest classifier.

I find that things change. Sometimes I am better off (machine) learning from the past year or two rather than thinking what happened in 2005 has much to do with what is happening now.

Does the financial crisis of 2008 have much similarity with what is happening now? Are we even in a crisis now? Maybe what a program learns from that period does not help and may even harm my results at times.

Walk-forward forgets the distant past and is more attuned to what is happening now, including the present correlations of the factors I use. Correlations change. Here is a very NON-PYTHONIC example of how to walk-forward a random forest classifier learning one year at a time and predicting the next month. It does run in a reasonable amount of time despite running a new random forest for each month’s prediction over multiple years.

I am eyeing Apple’s new processors with 20 cores to speed this up. But the truth is I do not need it (with want I have done so far). And by the way, you can get more than one thread to a core as many of you know, so a lot of parallel processing can be done at home.

‘y’ here is a column heading for a classification. For example, ‘1’ if the ETF outperformed it’s benchmark and ‘0’ if it did not. When using the program for predicting ‘1’ might be a buy recommendation and ‘0’ a suggestion to buy something else.

It would be trivial to copy this into a Jupiter Notebook and upload your own spreadsheet if you have your own factors that you want to explore (adding the proper indents). BTW, the indents (after the for loop) do not copy and editing with spaces does not work either–as Python programmers probably already noticed.

import numpy as np
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
import sklearn
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
X=df[[feature1’,‘features’, 'feature3, 'feature4, ‘feature5’, ‘feature6’, ‘feature7’, ‘feature8’, ‘feature9’, ‘feature10’ ,‘feature11’,‘feature12’]]
y=df[‘XXX Class’]
z = df[‘Date’]
for i in range(24, len(X)):
train_X = X[i-12:i]
train_y = y[i-12:i]
test_X = X[i:i+1]
model = RandomForestClassifier(n_estimators=1000,oob_score=True, min_samples_leaf=1, random_state=1,max_features=‘auto’, n_jobs=-1, verbose=0,criterion=‘gini’, class_weight=‘balanced_subsample’),train_y)
pred = model.predict(test_X)
p = (pred)
print(‘Date =’,z[i],'Prediction = ', p,'actual = ',y[i])
c = y[24:len(X)]
c = pd.DataFrame(c)
accuracy = accuracy_score(c,pp)
confusion_matrix = confusion_matrix(c,pp)
print(‘accuracy =’, accuracy)
print('confusion_matrix = ',confusion_matrix)


Very nice. This is all greek to me (including python). Where do you suggest one start, if one is incined to learn. Assume I am old and slow :slight_smile:

Hi RT,

I tried to make it so that you could just copy and past this into Python. I think the code will work with just about any Excel spreadsheet in csv format (after changing the code to the column heading names in your spreadsheet). Except that the indents that are used for Python are not in the post above (the forum would not format those).

The indents are trivial and I could probably re-post the code with an asterisk where you should use a the tab key on the computer keyboard. I could help you set up your first spreadsheet also. But you will want to look at your own ideas.

Getting Python onto your computer (downloading Anaconda) is not hard compared to troubleshooting a lot of programs that you have probably installed over the years–especially if you used a computer in the days of DOS or still use Windows even now. But still not necessarily something I would wish for you.

Colab does allow you to skip the installation: Colab. Steve Auger (InspectorSector) is a big fan of Colab.

You would have to look up how to upload your spreadsheet into Colab (I have not done that for a while). InspectorSector and others are adept with Colab and could help with questions about this platform.

Then just create the spreadsheet you want to upload. See what accuracy you get (0.5 is right half the time and you obviously want better or a higher number). My code will give you the prediction for each month and then what actually happened each month as well as a confusion matrix which breaks down the overall results on the predictions.

Hope that is a start.



Thanks. I will give a whirl.

RT and all,

There is an error in the code above (I did not reset the index for c when I needed to).

Incorrect above: c = pd.DataFrame(c)

Correct: c = pd.DataFrame.reset_index(c, drop=True)

I had the chance to rerun some of this and the walk-forward random forest classifier still outperforms all the other methods for classifying ETF returns I have look at (so far with my limited data and resources). I continue to investigate this without any final conclusions at this time.

I am sorry for the error.



Nice exercise. Congrats Jim, your code is quite clean and readable. A critical point that I hope you will take as constructive:
“learning one year at a time and predicting the next month”: Check IID assumptions, this is overfitting. Financial data (technical or fundamental) have a lookback period or are based on price/volume snapshot. In all cases they are autocorrelated and you have data leakage in the process of predicting a month with the 12 previous ones. And even bagging all successive models doesn’t fix it: this is still a bag of overfitted models.
By the way, you can do the same in 10 minutes in an auto-ML platform without a line of code and no risk of bug (maybe one line of code for the loop). Coding is not necessary any more in many ML problems, but it is good for neuroplasticity, I still like it too.

Hi Frederic,

Thank you for your comments. I am glad to hear that my code probably will not crash too often now that I have fixed that reseting-the-index problem.

With regard to ML programs I agree those can work well. In fact, I might take this opportunity to mention JASP again as a free download that has a usable machine learning library now. Frederic, you may have other specific programs that you like. I think you mentioned Azure before?

RT, Frederic has a point here. Instead of starting with Python, JASP has dropdown menus and has an easy upload of a spreadsheet. It is free. It loads on a MAC with no problems and I expect it does well on Windows too. The documentation is somewhat lacking as an open source program, however. Still it is a great way to start.

I do like some of the control that Python gives. Like: class_weight = ’ balanced_subsample’ that I use or ‘min_impurity_decrease = x’ that I was playing with today. AND to slice the data when I use an ‘embargo.’

Frederic, de Prado makes the same points you do about autocorrelation (and i.i.d) in his writings. I think you have both used the term “embargo” in fact. You (and de Prado) are obviously right about that.

That is why I do use embargos–especially to protect my holdout test data. It is also true that I do some truly forbidden things (like use oob_score) at times for validation: just because it seems to work. It is an art and I can always (do in fact) recheck it the correct way before I fund it.

Nice to hear you are actively doing machine learning. And again, thank you for your comments!!!