Gemini LLM accurate so far (but you would not have made money)

I started randomly selecting 50 stocks then asking Gemini Flash 2.0 with apps to predict the probability that the randomly chosen stocks would outperform the Russell 2000 using a long chain-of-thought prompt. Finally, I formed a portfolio of the 5 stocks with the highest probability of outperforming according to Gemini and another portfolio of the 5 stocks with the lowest probability of outperforming.

I plan to rebalance weekly and this is the first week

High probability of outperformance according to Gemini (3/5 correct):

Likely to underperform according to Gemini (4/5 correct):

Accuracy 70% so far (Russell 2000 was little changed and checking for outperformance gives the same result as positive vs negative returns). Early but not a bad start as far as accuracy.

I don't mean to select data to present an overly optimistic picture. For the 50 stocks (not just the top and bottom 5) the accuracy was just 50% (consistent with random chance) and the Brier skill score was slightly negative. Still early.

1 Like

Great idea.
I did something similar for selecting country etfs. And launched a live port. We have ~50 tredable country etfs.
In terms of prompt - I asked Gemini for an optimised COT prompt for this task.

Provide an example of a proper and optimised prompt that will return a list of 5 global country-specific ETFs (out of the ETFs list attached ) that are likely to perform the best over the next 3-6 months. What should be a proper prompt using COT chain of thought? Please follow a structured reasoning process based on publicly available macroeconomic trends, sector performance, and market conditions / sentiment.

I provided only list of 44 etfs from Country ETF and ETN Data | International Funds | Seeking Alpha
Gemini and CG parsed all the public available data (inlc. WorldBank databases) and returned a ranked list of 44 ETFs.

Here is my ranking based on Gemini and CG response. Interestingly, CG ranked China at the lowest position (Hmm).

image

1 Like

Excellent! Seeing such high agreement or a strong Spearman's Rank correlation between the LLMs' rankings is definitely encouraging. It suggests their CoT reasoning is anchored in sound financial principles, as you know. The model averaging approach is a nice touch as well.

Interesting, have you considered using a geometric (product) rank instead or in addition to the arithmetic? It would give weight to a lower value being present and rearrange the overall ranking a far bit.

Cheers,
Rich

I appreciate this point. But I'm hung on one critical point: Where is the hypothesis testing coming from with these ideas? How do know it's suggestions have any veracity?

I've always thought of a scraper for social media was by far the most promising use of these models because lot's have already been done in testing it's ability to gauge sentiment from text.

The question is whether the predictive power of media sentiment indicators has any practical value. Previous evidence points in the direction of “mediocre predictive power, but not practical”.

That’s a good point. I was too loose with my wording. Using LLMs to gauge sentiment of any text is of most obvious value with this new technology.

Reading Seeking alpha, for example, is a part of that but not necessarily the most useful.

Thanks for pointing that out!

The purpose of this post initially was to add an example of Gemini Flash 2.0’s analysis of a ticker giving some idea of what it can search for and find on the web. And people could then judge how well it did with the search and analysis on their own.

But Gemini Flash 2.0 with Apps' analysis—using my CoT prompt--turns out to be too detailed for practical formatting in a P123 post. The CoT response was long enough that it didn't fit within P123's code fencing ( ) for proper readability in this post.

The Challenge of Citing Literature

One issue with referencing existing research is that nothing I can find truly replicates what we do at P123 or an LLM's ability to search for and assimilate a HUGE amount of data over a short period. Including financial statements, analysts recommendations, news, insider activity, technical analysis, sector comparisons, random considerations such as Avian Flu affective egg prices for ticker CALM, legal risks (lawsuits) Macro data etc. I did not see any studies that let an LLM look at all of that. Additionally, LLMs have exceptional memory for historical details, which makes any backtest highly susceptible to data leakage. This raises an important question:

Are they actually predicting future performance, or just recalling how a ticker has performed in the past when doing a retrospective study?

For example, prompts like "Pretend you don’t remember how NVDA actually performed when making your prediction" don’t work. LLMs don’t "forget" information in the way a human would, making it tricky to separate genuine prediction from historical recall.

So I'll skip any discussion of literature I am aware of in any attempt to predict whether LLM analysis can be predictive or not for our purposes at P123. I don't think what I have read in the literature gives any answers to my question.

Obviously, the null hypothesis ( An LLM’s predicted probability that a stock will outperform the Russell 2000 provides no meaningful predictive value and is not useful for stock selection in a portfolio) would have to be accepted based on the data I have now..

India has gone down the toilet since Sep. Not sure this is working?

One problem with LLMs is they cannot be backtested easily. So people have to gather a lot of data to draw any meaningful conclusions. This takes time and patience. I am not sure anyone (including me) has had enough time since the first post in this thread to draw any meaningful conclusion on the data they have.

Here is my own data about Grok 3 predictions. I have switched to Grok 3 DeepSearch over Gemini. Grok 3 was not available at the time of my first post.

5 stocks with highest probability of positive returns (out of 20) as per Grok 3:

5 stocks with lowest probability of positive returns (out of 20) as per Grok 3:

I am not going to say Grok 3 DeepSearch does well (or badly) based on this small amount of data. It would be difficult for anyone to have enough data in the time since I first posted to have any meaningful impressions--whether the results have been good or bad so far.

2 Likes

Looks like the first portfolio lost money. Mostly winners but the few losers lost big.

Maybe it is better at shorts! See how they go!

Gemini 2.5: has just been launched. Looks really good.

1 Like

In reality the problem of recall is not serious, and the forgetting mechanism of LLM is not far removed from humans due to the attention mechanism itself.

The real problem is overfitting.

Thanks! Gemini 2.5 has a huge context window and good results on the benchmarks as you know.