Using Log Transformation

AlgoMan · August 25, 2024, 1:50am

I have seen that it's a common practice to use Log transformation on features for many reasons, but the ultimate goal is to improve results. I haven't experimented a lot, but I understand that you typically would use a Log on features that has a skewd distrubution or what has quite a few extream values. However, with the little testing I have done I haven't really seen any improvments in results.

Does someone has any feuture examples where using Log has clearly improved the results of your AI factor?

Jrinne · August 25, 2024, 9:49am

If you are using LN() within a function it could make a difference.

But if you have a value x or function f(x) as a feature, then using ln(x) or ln(f(x)) will not make any difference at all for random forests, LightGBM and some other P123 models.

It could make a difference for linear regressions if you are using z-score or min/max but not if you are using rank.

Caveat: this only applies to log transformation of the feature or factor and not the target for P123 models.

When I checked for errors Claude 3 put it this way: "Tree-based models are invariant to monotonic transformations of individual features.".

BTW, this is what makes downloads of data with the API useful. Raw data cannot be dowloaded with P123's contract with the vendor. However, any monotonic transformation (e.g., rank) will give the same answer with a random forest.

So, good to know at times. It has been crucial for making P123's API useful to machine learners.

ZGWZ · August 25, 2024, 1:06pm

It actually makes a difference because it affects the z-score conversion. With LightGBM, the z-score results will be better than Rank's results on some feature sets, so it's not just about order.

AlgoMan · August 25, 2024, 1:15pm

Do you have any good feuture example when Log transformation truly improves the outcome?

Jrinne · August 25, 2024, 2:01pm

This is a subtle thing—to the point that Claude 3 thought you might be right initially. I checked with Claude 3 trying to find any mistakes.

Claude 3 does allow for rounding error, order of feature selection, random seeds and binning for LightGBM (which I question) giving slightly different answers.

None of those making the log transformation inherently superior, I think.

Claude 3 (emphasis mine for the most part): "You're absolutely right, and I appreciate you pointing that out. I made an error in my reasoning, and I'm glad you've caught it. Let me correct and clarify:*

You are correct - I was mistakenly conflating the effect of transformations on the target variable with their effect on the features. Let's rectify this:

Tree splits and monotonic transformations:

In a decision tree (and by extension, in tree-based ensemble models like Random Forests and LightGBM), the splits are indeed found at specific points in the feature space.*
If we apply a monotonic transformation to a feature, the split point will change, but the data points going to each side of the split will remain the same.
For example, if we have a split at x = 5, and we apply a log transformation, the new split would be at log(5), but the same data points would go to each side.

Invariance in tree-based models:

This is precisely why tree-based models are invariant to monotonic transformations of individual features.*
The structure of the tree and the predictions it makes will not change, regardless of whether we use the original feature values, their logarithms, their ranks, or any other monotonic transformation.

Preprocessing methods (z-score vs. rank):

When we apply z-score normalization or ranking to the features before building the tree, we're not changing how the tree algorithm works or how it finds splits.*
These preprocessing steps don't affect the invariance property of the tree-based models.*

Why preprocessing might still matter:

In practice, preprocessing can still affect model performance, but not because it changes how the trees are structured.*
It may impact things like:*
Numerical precision and stability in the calculations*
The initial ordering of features when the algorithm considers which splits to try first*
How the model handles missing values or extreme outliers

LightGBM specifics:

LightGBM uses histogram-based algorithms for finding splits, which can be sensitive to how the data is binned.
Different preprocessing methods might lead to different binning strategies, which could indirectly affect model performance.*

Thank you for catching this misconception. You're entirely correct that the split is found at a point, and monotonic transformations don't change the fundamental structure or decisions of the tree-based models. The effects of preprocessing in tree-based models are much more subtle than I initially suggested, and they don't negate the invariance property of these models to monotonic transformations of individual features."*

Jrinne · August 25, 2024, 2:12pm

For linear regression using z-score or min/max there may, or may not, be an improvement. You would have to do a cross-validation to see which works better for your data.

Generally speaking in this instance, using ln() might help with skew as you suggested and with outliers.

It is also possible that using rank might handle outliers better. This would technically be called a rank regression but whatever you call it you might find it just works. You could get a definite answer on how this sorts out with linear regression (or rank regression) using cross-validation.

In addition to linear regression, there might be a difference with support vector machines which I never use because they take so long to run if nothing else.

ZGWZ · August 25, 2024, 2:26pm

I don't have access to the AI system at the moment, so I can't answer that.

yuvaltaylor · August 25, 2024, 2:40pm

Be careful to only use logarithms with positive values. You can't take a logarithm of a negative number. So using logarithms for returns or earnings will be problematic.

Jrinne · August 25, 2024, 2:45pm

For P123 returns that are often in percent, you will invariably use =ln(1 + [return cell]/100) as a spreadsheet equation to make best use of log returns. I am sure there are exceptions.

Great point Yuval!

AlgoMan · August 25, 2024, 3:29pm

I'm super nobe when it comes to machine learning. So I'm trying to understand the splitfinding. From what I understand, the histogram is build up based on a frequency distrubution table. So when changeing the value of the feature, the values might be a bit mixed in the bins compared to the original feature value.

I cought that from this video. https://youtu.be/N39NE4Nj6vc?si=wB_KaNbLWYivmnBA

This is the answer I got from ChatGPT when I asked.

Assume that the feuture has a postive value and I use the nautral log to on the feature. How will that affect the frequency distrubution table and ulimately the splitting point?

ChatGPT said:

ChatGPT

When you apply the natural log (ln) transformation to a feature with positive values, it will have a noticeable impact on the frequency distribution table and the splitting points in LightGBM. Here’s how the transformation affects these aspects:

1. Effect on the Feature Values:

The natural log compresses the range of the feature values, particularly for larger values. Small values are expanded, and large values are compressed. For example:
- ln(10)≈2.3\text{ln}(10) \approx 2.3ln(10)≈2.3
- ln(100)≈4.6\text{ln}(100) \approx 4.6ln(100)≈4.6
- ln(1000)≈6.9\text{ln}(1000) \approx 6.9ln(1000)≈6.9
As a result, the feature values after the log transformation are spread more evenly across a smaller range compared to the original values.

2. Impact on the Frequency Distribution Table:

Binning Process: When LightGBM bins the log-transformed values, the distribution of these values across bins will be different compared to the original feature. The log transformation can lead to a more uniform distribution of data across the bins, especially if the original feature had a skewed distribution with a long tail.
Frequency Counts: The frequency distribution table will reflect the new, more compressed range of values. Bins that might have been sparsely populated with the original feature could become more densely populated after the log transformation. This can change the way the model perceives the importance of different parts of the feature’s range.

3. Effect on the Histogram and Splitting Points:

Histogram Construction: The histogram created from the log-transformed feature will now reflect the more evenly distributed (or differently distributed) values. The sum of target values (or gradients) for each bin will correspond to the log-transformed data.
Split Point Selection:
- The log transformation can make it easier for the algorithm to find meaningful split points, particularly in cases where the original feature had a skewed distribution.
- The splits will be based on the log-transformed values, meaning that the decision boundaries (split points) will likely occur at different places compared to where they would have been with the original, non-transformed data.
- The model might find better splits if the log transformation leads to a more linear or uniform relationship between the feature and the target variable.

Example:

Original Feature: Suppose the original feature had values ranging from 1 to 1000, with most values clustered between 1 and 10, and a long tail extending to 1000.
After Log Transformation: The log-transformed feature will compress the values in the long tail, making them more comparable to the values in the lower range.
Effect on Splitting: This transformation might make it easier for the model to find splits that distinguish between important ranges of the feature, such as splitting between ln(10)≈2.3\text{ln}(10) \approx 2.3ln(10)≈2.3 and ln(100)≈4.6\text{ln}(100) \approx 4.6ln(100)≈4.6, rather than struggling with the original feature’s wide range.

Summary:

Applying the natural log to a positive feature in LightGBM affects the frequency distribution by compressing large values and expanding small ones, leading to a different binning structure. This change can help the model make more meaningful splits, particularly when dealing with skewed data. The resulting split points will reflect the transformed, more evenly distributed feature values, which can lead to better model performance if the log transformation enhances the relationship between the feature and the target.

Jrinne · August 25, 2024, 3:35pm

So I think this is true:

LightGBM aims to create bins with approximately the same number of data points, regardless of the scale of the feature.

Doing it another way would not make any sense to anyone trying to pressure the non-parametric nature of most tree methods, I think.

Again, there may be some issues with rounding or numerical stability as well as lack of a random seed giving different answers at times:

" However, the log transformation might still be beneficial in some cases:*

It could help with numerical stability for features with very large ranges.
It might improve the model's ability to capture certain non-linear relationships."

Here is how Claude 3 specifically addresses the alleged use of frequency binning in LightGBM:

" Frequency distribution table:

The frequency distribution of the transformed values will indeed change, as ChatGPT described.
However, LightGBM's binning strategy means this change in distribution shouldn't significantly alter how the algorithm bins the data or finds splits."

AlgoMan · August 25, 2024, 3:46pm

ChatGPT mentions this

3. Equal-Width vs. Quantile-Based Binning:

Equal-Width Binning: If the feature's values are uniformly distributed, LightGBM may use equal-width binning, where each bin covers an equal range of the feature's values.
Quantile-Based Binning: LightGBM often uses quantile-based binning, where bins are created such that each bin contains approximately the same number of data points. This approach is particularly useful for features with highly skewed distributions, as it ensures that the model considers all parts of the feature's range.

So I guess there is two ways to go about the binning, but how do I chose which binning method to use?

Jrinne · August 25, 2024, 3:49pm

I will switch to ChatGPT.

I learned something. Specifically, different binning strategies can be used.

Q: What is the default?

A: The default binning strategy in LightGBM is quantile binning.

In quantile binning, LightGBM attempts to create bins such that each bin contains approximately the same number of data points.

Q: Okay. So if you make a monotonic transformation of a feature using this default, LightGBM would make the splits in about the same position allowing for different random seeds, rounding errors etc? Or not?

Answer in full from ChatGPT:

Yes, if you apply a monotonic transformation to a feature (such as scaling, translation, or other transformations that preserve the order of the data points), LightGBM's default quantile binning strategy should result in splits occurring at roughly the same positions, allowing for minor variations due to different random seeds, rounding errors, and the specifics of the implementation.

Here's why:

Order Preservation: A monotonic transformation preserves the order of the data points. Since quantile binning in LightGBM is based on the rank or order of the data points, the transformed feature will have the same ranking of data points as the original feature.
Quantile Binning: Since the order is preserved, the quantile binning will create bins that contain approximately the same number of data points as before the transformation. Therefore, the bins (and thus the split points in the decision tree) will be in roughly the same positions.
Minor Variations: As you noted, minor variations can occur due to different random seeds, floating-point precision issues, or slight differences in how the bins are calculated and assigned. However, these differences should be minimal and should not drastically change the resulting splits.

In summary, after a monotonic transformation, LightGBM will generally create splits in similar positions as it would with the original feature, although some minor variations might occur.

AlgoMan · August 25, 2024, 3:57pm

ChatGPT

Summary

Default Binning: LightGBM uses Quantile-Based Binning by default.
Changing Binning: Direct switching between binning methods (like equal-width) isn't natively supported. You would have to preprocess the data for custom binning.

So no point doing Log transformation with LightGBM then...

Jrinne · August 25, 2024, 4:01pm

Not with the default (or what is "natively supported"), I think. I think you are right about that.

Thank you for the discussion. I did not know about the option of different binning strategies with LightGBM (not natively supported but a consideration with coding it seems) so I learned something from this discussion. I am not going to pursue that line or research now. So not sure how that would work out.

Thanks!

ZGWZ · August 25, 2024, 4:10pm

If you use "entire dataset", it doesn't make sense. But I always use "by date'

Jrinne · August 25, 2024, 4:14pm

ZGWZ,

Thank you. You are right, I think. If you use z-scores and scale by date there will be a difference when you use a log transformations on z-scores, I believe.

I tend to use rank by date and was not considering z-score by date.

It took your last post for me to understand what your were doing.

Basically, to reconcile any apparent paradox in the discussion, the data points from the different dates get "shuffled" into different relative positions after the transformation of each individual date. The data for each date represents a "monotonic transformation" but each date is scaled individually (not with the same scaling parameters).

The shuffled data is not a single monotonic transformation by any stretch of the the imagination. Rather it is a shuffling of a large number of different monotonic transformations (by date). A transformation will make a difference in this situation, I believe. Data points will be in different relative positions after the shuffling of multiple, individual and different transformations, I believe.

Best,

Jim

Jrinne · August 25, 2024, 4:48pm

(Deleted by author)

ZGWZ · August 25, 2024, 6:37pm

For some reason, my "entire dataset" models always perform worse than their "by date" cousins, despite their theoretical information advantage.