Asset Pricing and Machine Learning: A critical review

Whycliffes · March 7, 2024, 6:58am

https://onlinelibrary.wiley.com/doi/full/10.1111/joes.12532

Interesting:

Despite the recent noticeable efforts to apply new methods in Asset Pricing, a major insight that transpires from this emerging literature is that the factor zoo issue is still unresolved. Every method we review detects indeed different groups of prominent factors for the cross-section and at this stage, it is complicated to tell which approach to prefer, especially considering that several well-known anomalies has become less relevant over time (Chordia et al., 2014) and that the predictive power of some characteristics varies after conditioning on the others (Freyberger et al., 2020). Notwithstanding, some factors tend to stick out more often than others, namely past returns (e.g., STR and momentum), liquidity factors (e.g., Pástor & Stambaugh, 2003), and trading frictions (e.g., SUV). Pinning down a small set of risk sources robust to various identification algorithms ultimately ameliorates the comprehension of what drives returns and provides an excellent starting point for further research. In particular, this information can assist researchers in the question about the sparsity/density of the SDF, which remains open. While majority of the papers find redundancies in the factor zoo (e.g., Feng et al., 2020), others claim that one needs to consider all anomalies to achieve good performance (Kozak et al., 2020). The point is far-reaching as it guides researchers’ views in the process of building theoretical models to explain these empirical facts.

A second notable finding is that nonlinearities matter. Interactions among covariates play a big role in several studies, and matter more than nonlinearities in single characteristics (Bianchi et al., 2021), although not everybody agrees (Kozak et al., 2020).

Finally, new patterns have been discovered in specific applications thanks to ML. For example, high-frequency stock returns are largely driven by industry factors in contrast to traditional characteristics-based factors (Pelger, 2020). Returns of assets traditionally difficult to describe with factor models, like options, are easily captured by few economically interpretable factors (Giglio et al, 2022). Firm characteristics and first moments of returns contain valuable information to build low-dimensional but highly efficient statistical factor representations (Kelly et al., 2019; Lettau & Pelger, 2020b).

From the conclution:

Risk of overfitting the data and difficult interpretation of the procedures employed are the price to pay for the flexibility and the performance of ML methods. Common grounds for data sample, evaluation metrics, and tools to identify the contributions of characteristics to expected returns are vital for future research.

Jrinne · March 7, 2024, 9:56am

From the paper:

“Regression trees are invariant to monotonic transformations of the data…”

This is on of the things that led P123 to think there would be market for downloads of ranks, z-score and Min/Max, I believe. Quite a while ago now (counting the API and DataMiner). Perhaps Marco was right about the market for this.

Also as a major heading in the paper: " Regularization….

This will keep coming up as a topic as long as overfitting is a concern. This along with cross-validation are the primary solutions for overfitting (perhaps ensemble techniques being a distant 3rd method which will be provided in P123’s random forest models). P123 is including methods for cross-validation along with regularization methods most likely—even if just early stopping in XGBoost.

And: " Dimension reduction…."

This has been part of P123’s methods from the beginning in the form of composite nodes. E.g., the dimensionality of value factors can be reduce by putting them all into one node. Likewise, growth factors can be grouped into a node which serves the purpose of “dimension reduction.”

P123 has been a great machine learning platform all along using most of the methods discussed in this paper now. Including the fact that they are making “monotonically transformed” data available in the API, they are an advanced machine learning tool.

TL;DR: P123 is already a great machine learning tool requiring some manual optimization for now. Except for (possibly) regularization, none of the concepts in this paper are actually new to P123. Marco and Riccardo seem to have a good understanding of that already.

I appreciate what P123 already understands much of what can be used in this platform from the paper and continues to work on making these concepts available in a practical way. Adding–in a substantial way–to what P123 has already done in this regard.

Jim

benhorvath · March 7, 2024, 7:01pm

I am personally glad to use the transformed data, and not have to fuss with the original numbers. Saves a bit of headache for me.

XGBoost has a number of hyperparameters for regularization (alpha, lambda, etc.) built in. Early stopping will help, too, and save computing power on P123’s end.