Don't use LLMs for accounting measures

I want to warn users not to use LLMs (e.g. ChatGPT) to figure out accounting measures to use in screens and ranking systems. It makes a huge number of horrible mistakes.

For example, I wanted to figure out how to get from a mining company's financial statements to a number that would resemble the all-in sustaining costs. The number of blatant errors that ChatGPT made would make an accounting professor pull her hair out, and all the while it was pretending that it was an absolute authority. To name a few:

  • Using Compustat codes, it wanted to add XRD to XSGA when XSGA already includes XRD.
  • It thought that exploration costs for mining companies would be classified as R&D expenses. That's almost never the case.
  • It suggested that I use operating cash flow as part of a formula for expenses. Operating cash flow is adjusted income, not expenses.
  • It suggested subtracting investing cash flow from that, but since investing cash flow is usually negative, that would give the opposite result from what it suggested.
  • It suggested that D&A is always included in COGS, while in many income statements it's a separate line item that distinguishes EBIT from EBITDA.

I only caught these errors because I've studied accounting measures and have been using them for years. I can imagine a user who's unfamiliar with accounting measures being convinced that the nonsensical formulas that AI comes up with have some validity. I haven't figured out the answer to my question yet, but using ChatGPT was a huge time suck. I now understand why coders have been found to work faster without using LLMs.

Many users have found AI useful for answering questions about machine learning. Maybe it's a lot better dealing with machine learning questions than accounting questions, but I'm not going to trust it for practically anything anymore.

4 Likes

The way I experience these tools, is much like I experienced lecture summaries in college. These were summaries that other students made while attending the lectures and assignment classes - while I was, well… doing other things.

I did buy these class summaries and for me it made studying much more efficient. I would recognize inconsistencies and would spot the mistakes pretty quickly. When I spotted them, I would do my own research and that way I got a pretty quick overview of the course material and some idea of how other students perceived the texts and classes. It gave me a good idea what other students thought was hard and what they found easy (because those parts wouldn’t contain that many mistakes or inconsistencies). That would give me an advantage, because I could see what parts of the material to focus on - which parts separated the wheat from the chaff. I usually got good grades using those summaries.

Some other students however, had a horrible time using those summaries. They would often tell me: sometimes these summaries say “yes”, when it should be saying “no” and that they wouldn’t recommend me using it.

Interestingly enough, it usually were the very good students and the very bad ones (defined by their past grades) that made these comments. The very good students liked to go through the material and classes on their own and spent a lot of time to make better notes to the point that they saw no use for it. They were less interested in taking a meta view of what other students were thinking, because what they did just worked. The bad students bought the summaries while putting in zero effort to think on their own. So they got horrible grades.

For ChatGPT, I would say, it is quite similar.

If you have done a lot of work yourself to the point that theres limited knowledge to add and you have had great results that way, probably theres better use for your time. Perhaps telling it to reference to reputable sources for each statement it makes can help. But that’s it (at least for now).

If you haven’t done any work, then do the work first. Because you will be mislead and get bad results.

Or if you are a bit like me and enjoy analyzing reality, but also enjoy the thought patterns that govern other peoples behavior, spotting mistakes and inconsistencies then it might be time well spend.

I’d be curious to see if some of the pros here would test out Claude Sonnet 4.5 as it is supposed to score quite well (relatively!) on financial literacy expected of an entry level financial analyst: Finance Agent

1 Like

More practical input:

Did you use “GPT 5.0 - auto / instant” (top left of the dashboard)?

Manually selecting “GPT 5.0 - Thinking”, and then selecting ‘long think’ on the bottom will usually result in better results. Though I cannot guarantee it would lead to better results than you got.

Thanks Yuval,

Accounting measures is not really my strength for now.

But for coding it is good to have one LLM checking another. And it’s not just me who finds this to be true. Grok 4 Heavy is actually just multiple Grok 4 models working and selecting the best answer.

Great idea but at $400 a month I think I will pass.

Instead, I tend to use ChatGPT then have Claude and Gemini look for flaws and improvements. Major mistakes have been found as well is big improvements.

Note, this is different and possibly better than Grok 4. Grok 4 is not adversarial with one model questioning and looking for flaws in the other LLM’s answer. And allowing the models to improve their responses based on feedback from the other LLM helps. As a bonus, it’s quite a bit cheaper than Grok 4 Heavy. .

1 Like

You are absolutely right Yuval…it s a profesional liar…you should know a lot about matter before ask something and take the answers with a pinch of salt like a braindstorming but never like a fact…is like a baby with some good and some horrible ideas…in very specific context it just gives horrible answer sometimes

I asked my agent for All-In Sustaining Cost (AISC) proxy and it gave me this formula:
CostGTTM + SGandATTM + DepAmortTTM

And added explanations:

Correctness Check Summary:

  • CostGTTM + SGandATTM: Correctly captures the vast majority of cash and non-cash operating costs and corporate overheads (items a, b, c, e, f, g, h, i, j, m).

  • DepAmortTTM: Serves as a stable, accounting-based proxy for sustaining capital and reclamation costs (items n, o, p, q, r).

  • Limitations Acknowledged: The model cannot subtract by-product credits (k) and uses an accounting proxy for cash-based sustaining capital. However, it is the most robust and stable methodology possible given the constraints of standardized financial data.

What does this mean? An LLM you trained yourself? I have my doubts about DepAmort because it's a non-cash item, but I have heard of it being used as a proxy for maintenance capex.

I've been using the "Auto" setting. Switching to the "Thinking" setting might help.

I think it's terrific that they gave all these LLMs a test with a database of 537 questions. The fact that the Claude models all outperformed ChatGPT 5.0's reasoning model is nice, but the level of outperformance is not terribly impressive. ChatGPT scored around 47% and Claude's thinking models scored between 51% and 56%. Both would likely get failing grades in a finance class.

1 Like

Yeah they’re improving, but still way too often confidently wrong to trust them with critical tasks like how to understand and interpret accounting measures if you do not yourself understand enough to spot errors when they make them. And in most cases you’re probably better off just using your own expertise. This thread is a good warning about it.

I want to ensure I understand this thread. What’s the issue with non-cash expenses in your use case?

So if the errors are random and the models are missing 50% of the questions. Then the chance that ALL of LLMs will be wrong on a particular answer is 0.5^4 or just 6% with 4 LLMs.

6% is an A in some classes. That assumes you have a way of finding which LLM is right before you give the final answer of course. Maybe we open a book or do a separate search.

But easier than that, when you ask them to compare notes (responses) they usually come to quick agreement.

Maybe the errors are not random but that is a pretty simple calculation (0.5^4) that Elon Musk might have calculated when he decided that he could get $400 dollars per month for Grok Heavy that basically generates multiple answers and finds the best one. BTW I could not find how many separate “agents” Grok Heavy uses but probably more the 4.

Correlation of the errors is a potential problem but having the LLMs looking for faults in the others analysis is an improvement over Grok Heavy I suspect.

I have not look at this in depth–correlation of error being the biggest potential problem.

A human should be involved in the “discussion” also–taking the error rate down further.

I can say it makes a HUGE difference in coding.

I don’t accept a single answer from ChatGPT either. We certainly agree on that. On important questions i check it 3 times (2 LLMs and me).

So close agreement in my case. But no one said it would be THAT easy to beat the hedge funds: ChatGPT and you are done? :thinking: :rofl:

1 Like

I use the same technique, using either the same model and/or a different model to check the outputs, which does seem to effectively (at least in my anecdotal usage) catch hallucinations, simplified explanations or interpretations, and other errors.

And yes, if it were only that easy to make that stock market money, unfortunately not!

2 Likes

While I use Grok 4, I have seen many benchmarks rank Gemini as the least likely to hallucinate sometimes by half as much as ChatGPT or Grok so it is perhaps the winner there.

2 Likes

Gemini is strong on some types of coding. Not sure I can explain which ones. But, for example, ChatGPT gave me some good code. Claude liked it too. It ran fine and gave good answers.

When I showed it to Gemini, it said this is “embarrassingly parallel” which is a programming term that I am embarrassed to say I was previously unaware of.

My program–thanks to Gemini and Joblib–runs eight times faster using 8 of my (CPU) processors now. And I know what they mean by “embarrassingly parallel” and how to use Joblib as a bonus.

all those AI,s very good if you are master of your own subject, if not, your lost in space

1 Like

I used Gemini Pro 2.5, which in my view is the best model for designing stock factors. (Claude is the best for me when it comes to complex programming tasks.) I use Gemini quite often, asking it to design sets of specific factors and complex formulas. It consistently provides correct Portfolio123 syntax, and in most cases, the factors quality are “okay” — though not groundbreaking.

I asked Gemini how it would describe my prompting strategy and this is the answer from Gemini:

In-Context Learning: This is the primary term. It refers to the LLM's ability to "learn" and perform a task based only on the information provided within the context window of the prompt. The model's underlying weights are not changed (this is not training). It simply uses the data you provided as its source of truth for the duration of our conversation to generate a specific, constrained output.

Grounding: You are grounding the model's response in a specific set of facts. A common problem with LLMs is "hallucination," where they invent plausible-sounding but incorrect information. By providing the CSVs, you grounded my output to your specific reality (your factors and functions), dramatically reducing the chance of hallucination and ensuring the output is factually correct relative to your system.

Retrieval-Augmented Generation (RAG): What you did is a manual implementation of the core principle behind RAG. RAG is an advanced architecture where, in response to a user query, a system first retrieves relevant documents (like your CSVs) from a knowledge base and then augments the user's prompt by adding this retrieved information as context. The final, augmented prompt is then sent to the LLM. You acted as the "retrieval" system, found the relevant documents, and added them to the prompt for me to use. This is a state-of-the-art technique for making LLMs more accurate, reliable, and context-aware.

Few-Shot Prompting: The examples_of_factors.csv file also makes this an example of "few-shot prompting." By showing the model examples of what you don't want (and implicitly, the style of what you do want), you are giving it reference points to guide its generation, leading to a much better and more aligned result than if you had given it zero examples.

To summarize, you used a sophisticated prompting strategy that leveraged in-context learning and grounding, manually performing the steps of a Retrieval-Augmented Generation (RAG) system to ensure a precise, non-hallucinated, and executable response. This is precisely how LLMs are intended to be used for complex, domain-specific tasks.

2 Likes

I just had a discussion with ChatGPT and Gemini about overfitting and Kolmogorov complexity also known as Algorithmic Information Theory (AIT). Which is actually a simple concept: The longer it takes to describe a method the more complex the method is (complexity measured by the information needed to describe the method). How simple is that?

I am comparing 2 methods. One that would take a very long post which would be difficult for anyone to wade through (code savvy or not) and one that I can fit into a sentence. According to AIT the one difficult to describe has a lot of imbedded information that can lead to overfitting.

This is the same code where I now take advantage of the “Embarrassingly Parallel” nature of my code–provided by Gemini.

I think the LLMs got that (and more) right. The forum would not have given me a similar discussion. And I will not bore you with the full details here.

For code and simple ML concepts like (AIT), LLMs are indispensable.

I guess the other thing that actually confirms what Yuval and others are saying is that the LLMs probably got which method is more likely to lead to overfitting wrong initially.. Initially, they ALL liked the model that took paragraphs to describe. Looking at all of the LLMs answers, synthesizing what they wrote and pushing back gave a more nuanced answer at least. And probably a better one.

If the point is that you cannot blindly follow anyone (or even a group of LLMs) I could not agree more.

1 Like

Gemini performed significantly worse than Claude and ChatGPT on the finance exam referenced by parallaxblue above, FWIW. I don't know. Every time I switch to a new AI model I end up getting garbage answers to straightforward questions. Claude Sonnet was fine until I asked it a legal/tax question and it cited cases in which the ruling was opposite to what it was arguing for. I simply can't understand why nobody seems to have trained these AI models to check their sources when they're presenting information. It seems it would be so easy to program that one little extra step.

1 Like