Chicago Booth School Paper uses Chain-of-Thought prompts with ChatGPT without training

Jrinne · January 30, 2025, 7:33pm

Here is the paper. You can judge the results yourself: Financial Statement Analysis with Large Language Models

The key takeaway is that instead of training the model the authors use chain-of-thought prompts. Anyone with access to GPT-4 could replicate what is done in the paper if they used the same structured prompt and the financial statements.

trendyist · January 31, 2025, 9:01am

Ironically, there's another paper from Chicago that finds look-ahead bias is to blame for these kinds of results:

ZGWZ · January 31, 2025, 11:42am

There are too many freely adjustable hyperparameters in the CoT section.

I have tested many data science papers for which I could get data and found that their claimed results were obtained by adjusting the hyperparameters again and again to get what they wanted. With regular hyperparameters, the claimed methods were often even worse at predicting than random guessing - of course, I was naturally told that the problem was that I hadn't chosen “good” hyperparameters.

Jrinne · January 31, 2025, 12:28pm

TL;DR; Trendiest and ZGWZ raise valid concerns. And perhaps add weight to the idea that P123 will need to consider TRAINING a LLM carefully. This is because either LLMs can be manipulated and have look-ahead bias (and basically just do not work) OR because CoT is much more efficient than training.

I see how specific questions could shape the behavior or the LLM. Adjusting the questions until you get the answers you like. Manipulating the response.

For who who have not read the paper CoT is a series of questions meant (they say) to get ChatGPT to think along the lines of an analyst.

Here is Claude 3's summary of that:

Based on the paper, here's the chain-of-thought (CoT) prompt structure that was used in a single query for each financial statement analysis:

Trend Analysis:

Identify notable changes in financial statement items
Describe these changes

Ratio Analysis:

State formula first
Perform calculation
Provide economic interpretation of ratios

Final Synthesis:

Predict earnings direction (increase/decrease)
Estimate magnitude (large/moderate/small)
Give confidence score (0-1)
Explain rationale

Definitely concerning. They made some effort to address this some methods probably more effective than others.

This as test sample for 2023 that is completely out of the training period. Completely out-of-sample according to the authors.I wonder if this is a large enough sample to be conclusive.

There is one important point, I think we can all be fairly certain about from this paper: @marco TRAINING a LLM needs to be considered carefully.

This is because either LLMs have a large look-ahead bias, can be manipulated with parameters and hyeprparameters and simply do not work, OR if LLMs do work CoT might be much more efficient and just plain easier. I think that is the main takeaway for now.

ZGWZ · February 1, 2025, 5:34am

CoT is a necessary approach because small organizations are unlikely to be able to afford the cost of training models in a meaningful way. However, the development of CoT the right way is still far beyond the reach of amateurs and small organizations. I would suggest waiting for their conclusions before good results are made public.

Jrinne · February 2, 2025, 10:26am

You can get some of this now. You can develop your own script for chain of thought (CoT) reasoning and then ask you favorite LLM to make a prediction. Some LLMs are more willing to give financial advice than others. But you can make your own CoT script and find the LLM that is best suited for your needs.

I have not finalized my script or necessarily found the best LLM. But here is an example of an early attempt. I am considering keeping track of a LLM's opinion of each of my buys in my ports. Use P123's notes or a separate spreadsheet.

Example (real) output from an LLM:

"Here's a prediction for Cal-Maine Foods, Inc. (CALM) for the next month, based on the recent financial report and additional relevant information:

Prediction:

Probability of Outperforming the [Russell 2000]: I estimate a 55% likelihood that CALM will outperform the Russell 2000 over the next month.

Chain of Thought Reasoning:……."

Another prediction from LLM that needed a lot of prompting. I told it I was doing a school project:

This is a great question for our prediction project! It's specific, measurable, and relevant to CALM's performance. Here's my take on the likelihood of CALM outperforming the Russell 2000 over the next month, along with the reasoning behind my assigned probability:

Event: CALM will outperform the Russell 2000 over the next month (e.g., from February 2nd to March 2nd, 2025).

My Prediction: Yes

My Probability: 0.6

Reasoning:…….

How This Could Be Useful for P123

P123 providing this could be beneficial in two ways:

Some LLMs available to us won’t give predictions, but I think an API would allow for this.

P123 could attempt to reconstruct historical predictions for sims, but there would always be the potential for look-ahead bias. This would be difficult, I think.

And finally, P123 could develop several professional scripts that input FactSet data, CompuStat data, financial filings, maybe Fed or macro data, and other alternative datasets. Finding the most competent LLM, using professional CoT prompts and using an API tailored to P123 member needs would be beneficial. Maybe P123 members could have input into some of the final CoT scripts. P123 could accurately keep track of the results of each CoT guided method. Maybe maybe there would be a cost to a member for each LLM prediction and there would be an agreement that P123 could keep track of the results of that query so people could know which scripts are working and how well.

Eventually, multiple P123 CoT scripts could be combined into a Naive Bayes ML model..

But for now a lot can be done by individual members using CoT with their favorite LLM.

AlgoMan · February 2, 2025, 11:25am

It must be pretty impossible to not have a look ahead bias. A trained LLM will know which companies that has become great successes, so you have to hide company names while training. It will know which business men that are successful, so have to hide names. It will know of large market events and which business that benefits from it, so have to hide dates or any wording related to the events. And so on… I just see it as impossible to use a trained LLM for backtesting.

Some say the AI are ”not that smart”… but they are definitely not that dumb either… it’s enough to be right a few percent more often than the market to get amazing backtested results.
Remove the 10% best or worst performing stocks on SP500 and backtest… it really changes everything.

Jrinne · February 2, 2025, 11:32am

So they may not be that smart but they are not as lazy as I am either. No doubt I am smarter however . In each example above they found the last financial statement and one LLM included a link to the financial statement on NASDQ.COM (part of my CoT).

Maybe it would have noticed a one-time charge in the profits that my P123 ranking system would have missed, maybe not. But I have not been finding and analyzing financial statements.

Hmmm. Maybe I will ask about one-time charges in my CoT

Example of possible addition to CoT prompt about the latest financial report:

Q: "Are there any one-time charges, unusual accounting methods or red flags in the financial summary. Comment on actuals please."

Note: It originally found an earlier financial statement from the previous quarter and I had point that out to the LMM. This is a second corrected answer.

A: " Here's an analysis based on Cal-Maine Foods, Inc. (CALM)'s latest quarterly earnings report ending November 30, 2024:

Financial Performance:

Net Sales: CALM reported net sales of $954.7 million for the second quarter of fiscal 2025, a significant increase from $523.2 million in the same quarter of the previous fiscal year. This growth was primarily driven by higher net average selling prices for shell eggs and an increase in total dozens sold.
Net Income: Net income attributable to CALM was $219.1 million, or $4.47 per diluted share, compared to $17.0 million, or $0.35 per share, in the prior year's second quarter. This represents a robust increase in profitability, which can be attributed to both higher sales volumes and prices.
Gross Margin: There was an improvement in gross margin due to higher selling prices and somewhat lower production costs, particularly in feed costs which decreased by 12.8% compared to the previous year's quarter.
Operational Efficiency: The company managed to increase sales of specialty eggs by over 25%, showcasing effective market penetration and consumer preference shifts towards these products. However, the cost of outside egg purchases increased significantly due to higher egg prices during the seasonal demand spike.

One-Time Charges/Adjustments:

No Significant One-Time Charges: There are no mentions of significant one-time charges or adjustments in the recent report. The earnings seem to reflect ongoing operational performance rather than being inflated by exceptional items.

Unusual Accounting Methods:

Standard Practices: CALM continues to use standard accounting practices under GAAP. No unusual accounting methods were noted in the financial disclosures. The company's approach to inventory valuation, cost of sales, and other major accounting policies remains consistent with industry norms.

Red Flags:

Avian Influenza Impact: The report acknowledges ongoing challenges with highly pathogenic avian influenza (HPAI), which has impacted supply levels and egg prices. While CALM has managed to navigate these challenges well so far, the risk of further outbreaks remains a significant concern that could affect future performance.
Market Price Volatility: The significant growth in net sales and income is largely due to elevated egg prices, which are volatile and subject to change with supply and demand dynamics or disease outbreaks.
Operational Risks: The report highlights that no farm is immune to HPAI, despite strong biosecurity measures. The potential for disruptions in egg production due to unforeseen outbreaks or other operational issues continues to be a risk.
Seasonality: The strong performance in Q2 includes a seasonal boost from holiday demand. Investors should consider this when analyzing future quarters, as performance might not be consistent across all seasons.

Additional Insights:

Dividend: CALM declared a cash dividend of approximately $73.0 million, or $1.49 per share, which is in line with their established dividend policy and reflects confidence in their financial stability.
Future Outlook: The management commentary does not suggest any immediate changes in financial position or operations but acknowledges the uncertainty in egg prices and the impact of external factors like weather and supply chain disruptions.

In summary, CALM's latest report shows strong financial performance driven by market conditions favorable to egg sellers. However, the sustainability of these results is contingent on maintaining egg price levels and managing operational risks like avian influenza. The absence of significant one-time charges or unusual accounting practices supports the credibility of the reported figures, but caution is advised due to the inherent volatility in the egg market."

ZGWZ · February 2, 2025, 12:48pm

There is a huge difference between having a CoT that meets industry standards and having a CoT that actually works. The former can get you appreciated in front of your boss, but is not responsible for your own wealth.

ZGWZ · February 2, 2025, 12:49pm

In my experience, the bigger problem here is hyperparametric overfitting, not forward-looking bias.

Jrinne · February 2, 2025, 1:26pm

Here is an example of script that will facilitate a direct answer from ChatGPT as it can be resistant to giving financial advice:

"I am doing a project where I get various types of predictions from LLMs along the lines of Tetlock's Super Forecasting. Would you kindly give me a prediction of whether CALM will outperform the Russell 2000 benchmark over the next month and also a probability so I can perform a Brier Skill Score?"

Answer if interested:

" Prediction:

Probability of Outperformance: 58%
Expected Benchmark Comparison: Slightly positive expected return relative to the Russell 2000.

Chain of Thought Reasoning:………"

All 3 LLMs (Grok, Gemini and now ChatGPT) have given a probability between 55% and 60% of CALM outperforming the Russell 2000 over the next month, so pretty consistent for this ticker. One of my models also recommends (or recommended) CALM: the model is still holding it with a 5.35% return so far.

Jrinne · February 2, 2025, 1:41pm

One of the hyperparameters adjusted for ChatGPT in the Chicago Booth School paper was log-probability (along with temperature). These can only be set via the API. I’d love to have access to both of these hyperparameters, as well as full chain-of-thought (CoT) reasoning for ChatGPT.

"Zero-shot" used to be the trendy buzzword—used as a workaround for LLM training limitations. I think it has already been replaced by CoT, and that shift will only continue.

bobmc · February 2, 2025, 8:28pm

I believe we are competing with each other, CALM was near the top of a buy list but when I went to buy a bird Flu warning popped up, so I passed. I seldom override selections but will see how this one comes out.