P123 users create models based on their own, or academically sourced ideas that they hope are profitable out of sample. Designers who use factors based on their own backtests face the challenge of determining whether or not the productive factors are truly legitimate. Obviously, there is no way to have their conclusions peer-reviewed, which is a requirement for real science. For this reason, many turn to academic research for ideas, but is that research peer reviewed? Many may be surprised to learn that, unlike traditional scientific research it almost always is not, and therefore faces the same challenge.

Last week, researchers at Ohio State published a new study on the NBER website (“[color=blue]Replicating Anomalies[/color] ,” Hou, Zhang, Xue) which may be of interest to the P123 community. The study examines the reliability of 447 different anomalies uncovered by quantitative finance academics and other researchers since the 1980s.

The authors found that more than eight out of 10 anomalies vanish when rigorous tests are applied. Among those failing to reach statistical significance: one anomaly recently set out by the godfathers of quantitative finance, Nobel-winning economist Eugene Fama and his colleague Kenneth French.

EXCERPTS

“We replicate the entire anomalies literature in finance and accounting by compiling a largest-to-date data library that contains 447 anomaly variables. 286 anomalies (64%) are insignificant at the conventional 5% level. Imposing the cutoff t-value of three raises the number of insignificance to 380 (85%). Even for the 161 significant anomalies, their magnitudes are often much lower than reported.”

“[The list of anomalies] includes 57, 68, 38, 79, 103, and 102 variables from the momentum, value-versus-growth, investment, profitability, intangibles, and trading frictions categories, respectively. To control for microcaps that are smaller than the 20th percentile of market equity for New York Stock Exchange (NYSE) stocks, we form testing deciles with NYSE breakpoints and value-weighted returns. We treat an anomaly as a replication success if the average return of its high-minus-low decile is significant at the 5% level (t >= 1.96). Our results indicate widespread p-hacking in the anomalies literature.”

EXAMPLES OF FAILED ANOMALIES

• “Out of 447 anomalies, 286 (64%) are insignificant at the 5% level. Imposing the cutoff t-value of three raises the number of insignificant anomalies further to 380 (85%).”

• “The biggest casualty is the liquidity literature. In the trading frictions category that contains mostly liquidity variables, 95 out of 102 variables (93%) are insignificant.”

• “The distress anomaly is virtually nonexistent in our replication. The Campbell-Hilscher-Szilagyi (2008) failure probability, the O-score and Z-score studied in Dichev (1998), and the Avramov-Chordia-Jostova-Philipov (2009) credit rating all produce insignificant average return spreads.”

• “Even for significant anomalies, their magnitudes are often much lower than originally reported. Prominent examples include the Jegadeesh-Titman (1993) price momentum; the Lakonishok-Shleifer-Vishny (1994) cash flow-to-price; the Sloan (1996) operating accruals; the Chan-Jegadeesh-Lakonishok (1996) earnings momentum, formed on standardized unexpected earnings, abnormal returns around earnings announcements, and revisions in analysts’ earnings forecasts; the Cohen-Frazzini (2008) customer momentum; and the Cooper-Gulen-Schill (2008) asset growth.”

SOME TAKE-AWAYS

Market anomalies which passed the new study’s tests for statistical significance included several of the biggest. Cheap stocks indeed beat expensive ones; share prices have momentum; companies that invest a lot underperform, and quality of earnings matters. However, 85% of anomalies did not pass the test, including many that are well-known and frequently referenced by the p123 community. Serious p123 model designers may wish to download and review this research paper ($5 from NBER or SSRN) for insights into factors that remain durable under rigorous, unbiased testing by peers, which is a fundamental tenet of scientific legitimacy.

Chris