For ML Feature Selection, Can a Larger MSE Actually Mean a Better Model?

Yuval,

Without getting into a debate over scikit-learn’s r2_score and its use in machine learning, I think it’s worth pointing out that R² has at least two distinct interpretations.

In the ML context, R² typically measures how much better a model performs compared to simply predicting the mean — which can be meaningful in some regression tasks, but much less so in others (like ranking stocks). That may not be how you’re using it.

As a coefficient of determination, R² is more traditional: it captures the proportion of variance explained by a linear model. If your use aligns more closely with this interpretation — as in Excel regression or similar — then it seems quite reasonable. In that case, adjusted R² also makes sense to account for model complexity.

That said, regularized models like ridge regression intentionally introduce bias. In those cases, R² can penalize models that are actually better at ranking, even if their predicted returns are biased. Here, the error R² is penalizing isn’t really meaningful — because the goal is stock selection, not return prediction.

So as Ben mentioned, R² may not be ideal in ML settings where ranking matters more than raw predictive accuracy. But from what I can tell, your use seems closer to the traditional coefficient-of-determination view — which may still be quite appropriate.

Depending on your specific goals, I think you could both be right.

Even that is problematic. Say you fit y ~ x and get an R^2 of 0.8. You would say "x explains 80% of variance in y." But reverse it -- fit x ~ y. The R^2 is the same 0.8 -- now "y explains 80% of variance in x" is also true. But how could that be? We can avoid getting into metaphysical issues about "explain" by simply using another measure that doesn't have all this baggage. Nothing would be lost.

(Again, your mileage may vary)

Fair point. That said, I don’t worry too much about using Excel to get a rough estimate of something like transaction costs. I’ve done things similar to what Yuval describes — and for that kind of task, I don’t feel the need to switch over to Python for a different metric.

But I think you’re absolutely right. Given the models I use — and the number and type of variables involved in our ports — I’ve also moved away from R² in favor of other approaches for machine learning.

So Ben makes some important points. But maybe an equally important question is this: what is the default metric in XGBoost? Is it: reg:squarederror? And is it as misleading as everything is this thread suggests it might be?
Where you might not be off by a little bit but could be lead in the exact wrong direction at times? That could clearly affect a lot of people if that is the case. If R² has similar problems, that makes it worse.

Either way, this isn’t a small issue if it turns out to be true.

Remember this becomes most apparent using make_regression where we know with 100% certainty which factors are informative and which are noise. We can set that and know with absolute certainty. It's not like with our features where we can feature inversion, interactions etc and never can be sure. Anyone can look at the code and run it themselves.

What metric are you going to use in XGBoost if you don't get a different result? Or ExtraTreesRegressor for that matter?

Are there a lot of choices in ExtraTreesRegressor for metrics?

1 Like

These algos are all about minimizing error/loss rather than maximizing reward/gain. One needs a clever approach to get closer to a maximizing reward framework. Rewriting the loss function is probably needed.

1 Like

The desired function:

Penalizes false positives the most. Predicted gains that are large losses.

False negatives: less penalty

Large losses penalized in a non linear way.

Large gains not penalized or penalized linearly

I want to find or write something that does that and MSE and MAE are not it

Huber loss in XGBoost or objective='reg:pseudohubererror'. The others are harder,

Yeah except huber has the same fundamental problem but it does work better. Both huber and mae work better than mse. Huber is quadratic (loss squared) for small differences and linear for big. Mae is all linear. Mse all quadratic. Out of these three I would agree with Huber but it still penalizes large gains and treats losses and gains the same despite needing larger gains to offset a loss.

Its a tricky problem but I am sure the big shops have it figured out

1 Like

There is an aside. As embarrassing as it might be I would love to be 100% wrong on this. I would love to keep using criterion='squared_error' in my ExtraTreesRegresor model.

So the aside is the ChatGPT 5.0 is available and I just hate it immensely. I will be keeping may paid Anthropic plan. But as advertised it is like talking to a PhD. One that uses integrals and summing signs in every discussion. But I thought it might like to prove me wrong and as I said that would be the best outcome:

It said:

You can honestly say:

MSE can fail as a guide for feature selection — there are realistic cases where adding truly informative features makes MSE worse. So at least in some situations, it’s not a reliable sole metric.

That’s an existence argument: you’ve shown with your make_regression example that it can happen, so MSE isn’t foolproof.

Q: Okay not foolproof but for feature selection not so bad usually or is something better?

This answer was not terrible, I think. But I am still keeping my Anthropic subscription for now:
If these were real returns from the same stock universe and time frame, the variance of yy wouldn’t jump around so much, so MSE would be more trustworthy — but still, if volatility regimes shift, you could run into the same issue.

Maybe if you keep to the same time-frame, same universe and essentially the same set of returns you might be better off. I wanted ChatGPT 5.0 to prove me wrong and it did at least make me question my method. Now I really hate it. :wink:

It should be said the problem is much deeper because other than mse being used to evaluate, the loss function itself is used by the trees to learn. This is the real crux of the issue. They are training to be precise rather than to max outcomes and we know how important outliers are with stocks

1 Like

Here are the loss functions supported by the mlforecast multi-series time series forecasting library, as a handy start: PyTorch Losses - Nixtla

Quantile loss "pays more attention to under or over estimation" depending on the parameter. I believe it is used in finance, but I'm not sure if it's typically used for returns prediction. The Huberized quantile loss also has this probability.

The first thing that comes to mind for all your criteria is creating terms like:

L(y, y_hat) = penalty for FP + FN penalty + underestimated penalty + "normal penalty"

where, e.g., in the case of an FN penalty, the first and third terms end up as 0, and you're left with just the FN penalty and normal penalty.

I have no idea if a function exists that looks like this and has the desirable mathematical properties.

I think the key here is to offer the user numerous options for loss function.

1 Like

One way to deal with that is to change your target. For example, I've been creating a ML method for buying OTM puts, so I really didn't want any stocks in my portfolio with a projected 100-day gain higher than -15%. I didn't give a fig about what the predicted returns were for 80% of the stocks. Any stock that makes a gain or even a small loss results in a 100% loss at expiration. So I set my target as Eval (Future%Chg_D (71) > -15, -15, Future%Chg_D (71)).

But I wouldn't apply this logic for going long. Perhaps a better target might be Eval (FutureRel%Chg_D(71) < 10 and FutureRel%Chg_D (71) > -10, 0, FutureRel%Chg_D (71)). That basically tells the program to ignore accuracy and search out the extremes.

1 Like

Yes I think a custom function could be required but I doubt that is currently supported at p123. Thanks for your feedback I will revisit it once I am further along. This type of (real life investor) function usually goes under the name of cost minimizing function. For example it penalizes a false positive a lot more than a false negative, etc.

If I were to go swim on the deep “do it yourself” waters right now I think light gbm as an algo does not just support regression but also ranking functions so that is another idea i would test. How well can it predict the order of the returns rather than the magnitude. This would actually be more akin to P123 ranking and worth exploring in my opinion.

1 Like

Thanks- I like it. Will definitely try customizing targets like this once I get there as it is closer to what I am trying to do as well. I want basically a list of say 50-100 stocks with high potential and pick from there like I do with my current system. I don’t require them all to do well since if it is risky in an obvious way I probably would not have bought it anyways. I want to make sure I don't penalize the high outliers mostly- and catch more of them. A lot of what I have spent years of time on is in not excluding them.

Yes, would be a great addition. Maybe not a great number but a number of great ones

Yes, exactly. ML is itself an optimization problem, e.g., to minimize the MSE.

1 Like

Yuval, in your alternative formula shouldn’t it be “and” instead of “or”?

Yes, of course. Silly me. I changed it above.