LightGBM has multiple seeds to control if one wants the same results on the same data-set (in the same order). Code to set all of the seeds:
import lightgbm as lgb
params = {
'data_random_seed': 42, # Fix the seed for data randomization
'bagging_seed': 42,
'feature_fraction_seed': 42,
'seed': 42,
}
You can also use deterministic mode to accomplish the same thing:
params = {
'deterministic': True, # Forces determinism
'seed': 42, # Sets global seed
}
Note: These settings ensure reproducibility only if the dataset order remains unchanged. LightGBM’s histogram-based methods aggregate feature values into bins. If the dataset order changes, the sequence of aggregation can lead to slight differences in split calculations due to floating-point precision.
Reduce Sensitivity to Dataset Order by Turning Off Bagging:
Bagging introduces randomness by sampling subsets of rows during training. Even when a bagging seed is set, there will be different randomization if the row order is not the same. While bagging improves generalization, it is not an essential feature for boosting. To disable bagging:
params = {
'bagging_fraction': 1.0, # Use 100% of the data
'bagging_freq': 0, # Do not perform bagging
}
Combined Configuration : For reproducibility with no bagging:
params = {
'deterministic': True, # Forces determinism
'seed': 42, # Sets global seed
'bagging_fraction': 1.0, # Use 100% of the data
'bagging_freq': 0, # Disable bagging
}
Note : Disabling bagging may lead to overfitting in some small datasets, as it removes a form of regularization. However, this tradeoff may be acceptable when the need for consistency, such as during a grid search or hyperparameter tuning, is prioritized. Reproducibility ensures consistent results when evaluating different parameter combinations, making it easier to identify the optimal configuration.