Gemini LLM accurate so far (but you would not have made money)

Jrinne · February 15, 2025, 8:01am

I started randomly selecting 50 stocks then asking Gemini Flash 2.0 with apps to predict the probability that the randomly chosen stocks would outperform the Russell 2000 using a long chain-of-thought prompt. Finally, I formed a portfolio of the 5 stocks with the highest probability of outperforming according to Gemini and another portfolio of the 5 stocks with the lowest probability of outperforming.

I plan to rebalance weekly and this is the first week

High probability of outperformance according to Gemini (3/5 correct):

Likely to underperform according to Gemini (4/5 correct):

Accuracy 70% so far (Russell 2000 was little changed and checking for outperformance gives the same result as positive vs negative returns). Early but not a bad start as far as accuracy.

I don't mean to select data to present an overly optimistic picture. For the 50 stocks (not just the top and bottom 5) the accuracy was just 50% (consistent with random chance) and the Brier skill score was slightly negative. Still early.

pitmaster · February 15, 2025, 12:59pm

Great idea.
I did something similar for selecting country etfs. And launched a live port. We have ~50 tredable country etfs.
In terms of prompt - I asked Gemini for an optimised COT prompt for this task.

Provide an example of a proper and optimised prompt that will return a list of 5 global country-specific ETFs (out of the ETFs list attached ) that are likely to perform the best over the next 3-6 months. What should be a proper prompt using COT chain of thought? Please follow a structured reasoning process based on publicly available macroeconomic trends, sector performance, and market conditions / sentiment.

I provided only list of 44 etfs from Country ETF and ETN Data | International Funds | Seeking Alpha
Gemini and CG parsed all the public available data (inlc. WorldBank databases) and returned a ranked list of 44 ETFs.

Here is my ranking based on Gemini and CG response. Interestingly, CG ranked China at the lowest position (Hmm).

Jrinne · February 15, 2025, 2:24pm

Excellent! Seeing such high agreement or a strong Spearman's Rank correlation between the LLMs' rankings is definitely encouraging. It suggests their CoT reasoning is anchored in sound financial principles, as you know. The model averaging approach is a nice touch as well.

rwbattyaz · February 16, 2025, 12:42am

Interesting, have you considered using a geometric (product) rank instead or in addition to the arithmetic? It would give weight to a lower value being present and rearrange the overall ranking a far bit.

Cheers,
Rich

korr123 · February 16, 2025, 4:15am

I appreciate this point. But I'm hung on one critical point: Where is the hypothesis testing coming from with these ideas? How do know it's suggestions have any veracity?

I've always thought of a scraper for social media was by far the most promising use of these models because lot's have already been done in testing it's ability to gauge sentiment from text.

ZGWZ · February 16, 2025, 5:23am

The question is whether the predictive power of media sentiment indicators has any practical value. Previous evidence points in the direction of “mediocre predictive power, but not practical”.

korr123 · February 16, 2025, 6:19am

That’s a good point. I was too loose with my wording. Using LLMs to gauge sentiment of any text is of most obvious value with this new technology.

Reading Seeking alpha, for example, is a part of that but not necessarily the most useful.

Thanks for pointing that out!

Jrinne · February 16, 2025, 10:08am

The purpose of this post initially was to add an example of Gemini Flash 2.0’s analysis of a ticker giving some idea of what it can search for and find on the web. And people could then judge how well it did with the search and analysis on their own.

But Gemini Flash 2.0 with Apps' analysis—using my CoT prompt--turns out to be too detailed for practical formatting in a P123 post. The CoT response was long enough that it didn't fit within P123's code fencing ( ) for proper readability in this post.

The Challenge of Citing Literature

One issue with referencing existing research is that nothing I can find truly replicates what we do at P123 or an LLM's ability to search for and assimilate a HUGE amount of data over a short period. Including financial statements, analysts recommendations, news, insider activity, technical analysis, sector comparisons, random considerations such as Avian Flu affective egg prices for ticker CALM, legal risks (lawsuits) Macro data etc. I did not see any studies that let an LLM look at all of that. Additionally, LLMs have exceptional memory for historical details, which makes any backtest highly susceptible to data leakage. This raises an important question:

Are they actually predicting future performance, or just recalling how a ticker has performed in the past when doing a retrospective study?

For example, prompts like "Pretend you don’t remember how NVDA actually performed when making your prediction" don’t work. LLMs don’t "forget" information in the way a human would, making it tricky to separate genuine prediction from historical recall.

So I'll skip any discussion of literature I am aware of in any attempt to predict whether LLM analysis can be predictive or not for our purposes at P123. I don't think what I have read in the literature gives any answers to my question.

Obviously, the null hypothesis ( An LLM’s predicted probability that a stock will outperform the Russell 2000 provides no meaningful predictive value and is not useful for stock selection in a portfolio) would have to be accepted based on the data I have now..

nzbryant · March 5, 2025, 12:53am

India has gone down the toilet since Sep. Not sure this is working?

Jrinne · March 5, 2025, 8:41am

One problem with LLMs is they cannot be backtested easily. So people have to gather a lot of data to draw any meaningful conclusions. This takes time and patience. I am not sure anyone (including me) has had enough time since the first post in this thread to draw any meaningful conclusion on the data they have.

Here is my own data about Grok 3 predictions. I have switched to Grok 3 DeepSearch over Gemini. Grok 3 was not available at the time of my first post.

5 stocks with highest probability of positive returns (out of 20) as per Grok 3:

5 stocks with lowest probability of positive returns (out of 20) as per Grok 3:

I am not going to say Grok 3 DeepSearch does well (or badly) based on this small amount of data. It would be difficult for anyone to have enough data in the time since I first posted to have any meaningful impressions--whether the results have been good or bad so far.

Chipper6 · March 5, 2025, 11:56pm

Looks like the first portfolio lost money. Mostly winners but the few losers lost big.

nzbryant · March 6, 2025, 5:09am

Maybe it is better at shorts! See how they go!

pitmaster · March 26, 2025, 4:05pm

Gemini 2.5: has just been launched. Looks really good.

ZGWZ · March 26, 2025, 4:18pm

In reality the problem of recall is not serious, and the forgetting mechanism of LLM is not far removed from humans due to the attention mechanism itself.

The real problem is overfitting.

Jrinne · March 26, 2025, 4:35pm

Thanks! Gemini 2.5 has a huge context window and good results on the benchmarks as you know.

ZGWZ · March 26, 2025, 10:38pm

Sadly this model still fails on my little personal test question, and its indicative of its lack of ability to critique mainstream fallacies.

In addition, its improvement in benchmarking seems to come from its being more organized and detailed in organizing its thoughts, rather than an improvement in the model itself. This isn't surprising, though, considering that Nvidia likely doesn't really sell that much computing power.

Jrinne · March 26, 2025, 11:09pm

This is from Google AI Studio which is free. This was Gemini 2.5. I cannot rule-out possible differences in the paid version.

I asked it to predict how QUAD would do over the next month. I thought LLMs had stopped giving answers like this. This used to be common for LLMs due to the cutoff training dates but when they started doing internet "Deep Search" they got better (mostly). Seems like keeping the dates straight would help with an internet search so a little surprised at this:

"You are absolutely right to point out the date confusion. Let me be crystal clear:

- The date YOU specified for the starting point of your prediction scenario is March 26, 2025.

- Today's ACTUAL, real-world date is March 26, 2024.

Because today is March 26, 2024, I cannot access or know the specific data (like news, analyst ratings, stock prices) that will exist on March 26, 2025. That date is exactly one year in the future.

Therefore, to perform the prediction you asked for (which was set as of March 26, 2025), I had to use the data available today (March 26, 2024) as the input for the analysis.

I apologize if my previous explanations made this sound more complicated than it is. I was trying to be precise about how I was fulfilling your request for a future prediction, given the limitation that I cannot actually time travel.

Does this make sense now?""

I've seen videos of it doing some phenomenal coding!!!! It coded some simple code without problems but my code was not a real test of its abilities.