I am running ML on my computer (or colab) so that I can use sklearn pipelines with things like recursive feature elimination (RFE) and different feature scaling methods. However, I very quickly run out of RAM and it causes both colab and my computer to crash. This is sad as I have about 52 Gb of RAM available and lots more gets expensive.
So my question is, if I want to run ML with say 500 features, but predict on every week to upload to P123 later, how do I get this to run on 52 Gb of RAM? I currently use dask and pandas dataframes and am fairly careful about deleting unused things. I use dask until I load my outer loop data which is sampled to limit the rows.
Any advice? I will also check sklearn, and struggle along with Chatgpt, but P123 seems to have some backend magic that keeps the RAM use way down that I would love to know even a little bit about.
I think different scaling methods can be accomplished by modifying the formulas for the features. The problem is mainly the lack of support for categorization features.
What I think is needed is automatic feature engineering compared to RFE.
I am using tree based algorithms and xgboost was actually the worst with RAM usage especially in combo with RFE. So I will look into the max_features=log2 for the algorithms that let me.
I have not tried float32 yet, but that was actually on my mind to try next. I think the original thing that got me was my nested loop was storing the data on RAM multiple times. I think I solved that with Dask, but now I run out of SSD space. So I think I need to do all of my pre-processing including the outer CV folds, leverage float32, and then save to disk. Then load each fold one at a time...
I am using sklearn pipelines with GridSearchCV and HalvingRandomSearchCV for my scaling, NA imputer, and hyperparameter tuning. So I cannot really touch the inner CV generation.
I feel like this is were a formal education in big data/computer science would help haha.
Any thoughts on what automatic feature engineering would look like compared to RFE (other than you know good old knowing what features work well together already)? I have not isolated the RAM use, but RFE sure slows everything down and I suspect is part of the RAM use problem.
The sklearn RFE I believe looks at feature weights for a chosen model and removes the least important. There are other brute force methods like removing features one at a time, but that is very costly time wise.
you could eliminate not important features...
using e.g., scikit's methods: permutation_importance and feature_importances_ but do it once per training...not recursively as RFE.
Tree-based models are relatively good in selecting best features for each node, I would not to expect get significant gain from using RFE, especially in noisy financial markets...the goal is to lower variance, rather than gaining 1% return more...
Edit: You could reduce n_jobs to reduce the memory used. Fewer parallel processors slowing things a bit, but keeping the memory requirements down at the same time. Probably something you have already considered, however.
I get the desire to do all of this with nested cross-validation. Also I am not sure, exactly, how you are doing your nesting. I understand you could be using RFECV as part of your nested cross-validation (using Sklearn's RFECV or essentially duplicating it thru code).
Me personally, I would feel pretty confident just using pure RFECV (without the nesting) to narrow the number of features down to say 200 or 150 and moving back to nested cross-validation after that.
I would definitely reduce the number of column features with XGBoost at the same time, Using colsample_bynode is analougous to max_features in the Extra Trees Regressor. My recent experience with XGBoost is limited but max_features does not hurt performance with Extra Trees Regressor and theoretically it can help by reducing variance (and overfitting).
Reducing the features speeds things up too so running fewer jobs may be more practical, too.
I do not think the ultimate answer you get will be different.. I basically think the features you discard will be found to be noise--or be redundant with features that are kept--no matter what method you use to confirm your result with future high-performance machines.
I checked my answer with Claude 3. All it did was echo what I said but did not find fault, I think: "This advice is solid and should be helpful for the user in addressing their memory constraints while still performing a robust analysis."
I am actually using n_jobs=1 as that was the biggest knob to reduce RAM. I think I will work on removing the RFE from my pipeline and just use a Tree-based method outside of my inner loop as pitmaster and Jim have suggested.
For those interested here is my general loop structure.
Outer for loop is like a Time-Series CV with a gap between train and test for the label. This is to simulate what I would actually be doing as time goes on - retraining my model around once a year or so.
Inner loop takes the training data and then does hyperparameter tuning on another Time-Series CV using gridsearch. Then the best parameters are retrained on the full outer loop training data. Finally the model predicts on the test data. So I don't really see the inner CV as it is handled fully within GridSearchCV.
I think what I need to add is a step before the gridsearch to select the important features and then remove the RFE from the pipeline.
So like this, but real code:
for train_idx, test_idx in outer_cv:
train_data, test_data = df.loc[train_idx], df.loc[test_idx]
grid_search = GridSearchCV(pipeline, inner_cv ...)
gridsearch.fit(train_data)
gridsearch.predict(test_data)
Example pipeline for HGB:
Pipeline([
('scaler', RobustScaler()),
('imputer', SimpleImputer(strategy='median')),
('regressor', HistGradientBoostingRegressor(random_state=42))])