I have been playing with a Random Forest classifier. In this case, classified by whether each months returns (for an ETF) were above or below the median returns for all of the ETFs included in my little study. It is easy. It also seems to work. Work well enough to be implemented? That remains to be seen. But you can check it out for yourself using your own data if you are interested. The features I used are relative strength but you can use anything including aggregate series from P123. WHAT A GOLD MINE P123’s data is. Gold mine in the sense of P123’s data usability, downloadability, access to Fed Data and the small amount of code required for a random forest classifier to use this data:
#Training the data
import sklearn
import pandas as pd
df=pd.read_csv(‘~/Desktop/RFtrain.csv’)
X=df[[‘xsxlkc’, ‘xsxlec’, ‘xsxluc’, ‘xsxlpc’, ‘xsxlbc’, ‘xsxlyc’, ‘xsxlic’, ‘xsxlvc’, ‘xsxlfc’, ‘xstltc’,‘xsgldc’, ‘xsqqqc’, ‘xsijrc’, ‘xsmdyc’, ‘xsiwcc’, ‘xsicfc’]]
y=df[‘classxlk’]
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier(n_estimators=10000, oob_score=True, min_samples_leaf=5, random_state=1, max_features=‘auto’)
model.fit(X, y)
model.oob_score_
In this example classxlk is the class (whether XLK had returns above or below the median of all of the ETFs that month) and as an example, xsxlbc is the excess (xs) returns for XLB (c at the end for close) over a previous time-period.
This will give you an out-of-bag score that you can use for cross validation. So easy, but in truth I never even have to use this. I just leave it on max_features=‘auto’ and min_samples_leaf=5. So cross-validation is easy or one does not even have to do it for a random forest classifier. If you want to look at the feature importance:
#Feature importance
model.feature_importances_
Did it work? Run the predictions on some separate test data. Output the predictions to a spread sheet and merge the predictions with your test data. See if the predictions actually predicted the classifications for your ETFs in your own way:
#Make predictions on test data
dft=pd.read_csv(‘~/Desktop/RFtest.csv’)
Xt=dft[[‘xsxlec’ ‘xsxluc’, ‘xsxlkc’, ‘xsxlpc’, ‘xsxlbc’, ‘xsxlyc’, ‘xsxlic’, ‘xsxlvc’, ‘xsxlfc’, ‘xsqqqc’, ‘xstltc’, ‘xsijrc’, ‘xsmdyc’, ‘xsiwcc’, ‘xsicfc’, ‘xsgldc’ ]]
pred=model.predict(Xt)
pred=pd.DataFrame(pred)
pred.to_csv(‘~/Desktop/pred.csv’)
This is for a Mac. You may have to change your code for Window’s file directory methods (or other operating systems).
My results: Using a Random Forest Classifier improves the predictive power (precision) of relative strength as a signal. In other words, a test sample suggests more ETFs will beat the median return if you use the above random forest with relative returns (relative to the median returns the above ETFS) as a feature than if you simply invested in all ETFs that had a positive relative return without using the random forest classifier (rebalancing monthly). Which does not mean that I will necessarily use it or recommend it. But it serves as an example of an accepted ML method that is easy to implement and can be run on a lot of different data (including Fed data) pretty quickly: Full code included