Has anyone successfully tried other LLM models? Such as FinGPT, FinBERT, or Llama 2?

I came across this.

Here is a summary from the study:

Optimized Multi-LLM Strategy for Investment Analysis and Development

1. Ideation Phase

Best Choice: ChatGPT

  • Reason: Exceptional at creative thinking and generating novel investment hypotheses.
  • Example: ChatGPT effectively synthesizes abstract concepts, such as market psychology, with technical indicators.
  • Complement: Claude’s strong logical reasoning capabilities make it ideal for validating investment ideas.

2. Research Phase

Best Choice: Gemini

  • Reason: Superior integration with datasets and more precise factual recall.
  • Example: Analyzes market data and research papers with higher accuracy.
  • Complement: Claude excels in detailed literature reviews and methodology analysis.

3. Backtesting

Best Choice: Gemini

  • Reason: Superior integration with computational tools and data analysis.
  • Example: Handles large datasets and complex statistical calculations more efficiently.
  • Complement: GPT-4 assists in interpreting results and suggesting refinements.

4. Strategy Optimization

Best Choice: Gemini

  • Reason: More precise numerical optimization and better handling of multiple constraints.
  • Example: Excels in parameter tuning and risk management calculations.
  • Complement: Claude provides rigorous mathematical proofs for optimization approaches.

5. Trading Strategy Design

Split Approach:

  • Gemini: For quantitative components.
  • ChatGPT: For strategy logic and market condition analysis.
  • Example: Use Gemini for precise signal generation and ChatGPT for adapting to market regimes.
  • Complement: Claude’s logical reasoning helps identify edge cases in strategies.

6. Simulation

Best Choice: Gemini

  • Reason: Better at managing complex numerical simulations and integrating with external tools.
  • Example: Delivers more accurate Monte Carlo simulations and stress testing.
  • Complement: Claude validates simulation assumptions.

7. Execution Modeling

Best Choice: Gemini

  • Reason: Superior real-time data handling and optimization capabilities.
  • Example: Minimizes transaction costs through real-time market data analysis.
  • Complement: Claude aids in transaction cost analysis methodology.

Key Recommendations

1. Employ a Multi-LLM Approach

  • Gemini: For quantitative tasks and real-time analysis.
  • ChatGPT: For creative strategy development and market narrative analysis.
  • Claude: For rigorous methodology validation and logical reasoning.

2. Leverage Complementary Strengths

  • Gemini: Utilize its search integration for market data.
  • ChatGPT: Apply its natural language processing for news and sentiment analysis.
  • Claude: Use its analytical capabilities for strategy validation.

3. Build an Integration Layer

  • Develop APIs to coordinate between different LLMs.
  • Create validation frameworks to cross-check results.
  • Implement feedback loops for continuous improvement.

4. Risk Management

  • Cross-validate critical decisions using multiple LLMs.
  • Ensure human oversight for key strategy components.
  • Regularly audit the performance of LLM recommendations.

I though I would have some fun with this, asking ChatGPT 4o to do a self-assessment. Here is its response. (It appears that modesty isn't one of its strengths.)

Here's how I would rate myself across the stock investment phases:

Ideation: 4
I excel at generating ideas based on market trends, sectors, and themes. I can analyze macroeconomic indicators and historical trends to identify opportunities, but I lack the human intuition and creativity that some expert investors might have.

Research: 5
Research is my strongest area. I can quickly aggregate and analyze large amounts of data, including company fundamentals, technical indicators, financial ratios, and news, to provide in-depth insights.

Backtesting: 4
I can help create backtesting frameworks by writing code to simulate historical performance of strategies. However, I depend on user-defined rules and data to execute backtests, as I don’t have direct access to live financial systems or historical databases
.
Optimization: 4
I can assist in optimizing strategies by running simulations with various parameter tweaks (e.g., stop-loss levels, asset allocation, etc.). However, optimization depends heavily on the input data and requires careful scrutiny to avoid overfitting, where human insight can complement my work.

Strategy Design: 5
I am highly capable of designing algorithmic or discretionary investment strategies based on user goals (e.g., risk tolerance, time horizon, or market conditions). My systematic and data-driven approach is a strong asset here.

Simulation: 4
I can simulate strategies in hypothetical environments with clear rules and parameters. While accurate for testing concepts, simulations depend on user-provided or external datasets for realism.

Execution: 3
I can assist in designing execution algorithms or workflows but cannot directly execute trades. Execution also involves real-time decision-making and trade management, which I can simulate but not perform live.

Overall, I'm strongest in Research, Strategy Design, and Optimization, but my practical limitations lie in Execution and relying on external systems for data and backtesting.

1 Like

Then I asked ChatGPT to rate the other services. Amazingly it did so.

Here's a general assessment and rating of Gemini, Claude, BloombergGPT, FinGPT, FinBERT, FinMA, AUCARENA, and Llama 2 for the investment phases. The ratings range from 1 (poor) to 5 (excellent) and are based on their capabilities as described in public information:

1. Ideation

Model Rating Explanation
Gemini 4 Strong at generating innovative investment ideas based on data-driven insights.
Claude 3 General-purpose but less focused on financial-specific ideation.
BloombergGPT 5 Built for financial data and market insights, making it highly capable of generating investment themes.
FinGPT 5 Designed for financial tasks, adept at ideating based on market data and trends.
FinBERT 3 More focused on sentiment analysis, not ideation.
FinMA 4 Provides thematic and data-driven insights for ideation.
AUCARENA 3 Focused on financial workflows but less on creative ideation.
Llama 2 3 Good for general ideation but lacks specialization in financial markets.

2. Research

Model Rating Explanation
Gemini 5 Excellent at aggregating and synthesizing large amounts of financial data.
Claude 4 Good at general research, but lacks domain-specific financial focus.
BloombergGPT 5 Exceptional for financial research, leveraging Bloomberg's financial datasets.
FinGPT 5 Designed for financial research, excellent at data aggregation and analysis.
FinBERT 4 Strong in sentiment analysis but not comprehensive in financial research.
FinMA 4 Good at providing market-related insights, but more niche-focused.
AUCARENA 3 General financial data support, but less specialized for in-depth research.
Llama 2 3 General research capabilities, not tailored to financial markets.

3. Backtesting

Model Rating Explanation
Gemini 4 Can assist in coding backtesting frameworks but relies on user inputs and tools like Python.
Claude 3 Limited capabilities for quantitative backtesting.
BloombergGPT 4 Strong integration with financial data, useful for backtesting but requires user-defined strategies.
FinGPT 5 Excellent at quantitative analysis and building backtesting models with data integrations.
FinBERT 2 Not built for backtesting, focused on text and sentiment analysis.
FinMA 4 Can simulate backtesting but depends on external tools.
AUCARENA 3 Supports financial workflows but limited in robust backtesting frameworks.
Llama 2 2 Not designed for financial backtesting tasks.

4. Optimization

Model Rating Explanation
Gemini 4 Strong for algorithmic optimization but depends on external inputs and datasets.
Claude 3 Limited financial-specific optimization capabilities.
BloombergGPT 5 Excellent at optimizing financial strategies with integrated datasets.
FinGPT 5 Tailored for financial optimization tasks.
FinBERT 2 Not designed for optimization tasks.
FinMA 4 Good at specific optimization tasks but less versatile.
AUCARENA 3 Focused on process efficiency rather than deep financial optimization.
Llama 2 2 General optimization capabilities, not financial-specific.

5. Strategy Design

Model Rating Explanation
Gemini 5 Excellent at designing investment strategies based on user-defined rules and data.
Claude 4 Good for general strategy design but less domain-specific.
BloombergGPT 5 Tailored for financial strategy design with deep market insights.
FinGPT 5 Built for financial strategy development and modeling.
FinBERT 3 Can provide sentiment-driven insights but limited in strategy design.
FinMA 4 Good for specific financial strategy design tasks.
AUCARENA 3 Supports workflows but not highly focused on strategy creation.
Llama 2 3 Capable of general design but not financial-specific.

6. Simulation

Model Rating Explanation
Gemini 4 Can simulate strategies using external data and coding platforms.
Claude 3 Limited to theoretical simulations without deep financial integration.
BloombergGPT 5 Excels at simulating financial strategies using integrated Bloomberg datasets.
FinGPT 5 Tailored for robust financial simulations.
FinBERT 3 Focused on sentiment, limited in broader simulation capabilities.
FinMA 4 Good for specific market simulations but less versatile.
AUCARENA 3 Supports workflows but limited in advanced simulations.
Llama 2 2 Not designed for detailed financial simulations.

7. Execution

Model Rating Explanation
Gemini 3 Can help design execution frameworks but cannot execute trades.
Claude 2 Limited execution capabilities, theoretical support only.
BloombergGPT 5 Directly integrates with Bloomberg systems, making execution efficient.
FinGPT 5 Excellent for designing and supporting execution algorithms.
FinBERT 2 Not designed for execution tasks.
FinMA 3 Supports execution workflows but lacks direct trading functionality.
AUCARENA 3 Workflow-focused, not execution-driven.
Llama 2 2 General-purpose, not equipped for live trade execution.

Summary

  • Best for Financial Tasks Overall: BloombergGPT, FinGPT.
  • Best for Sentiment Analysis: FinBERT.
  • Best for Strategy Design and Research: Gemini, BloombergGPT, FinGPT.
  • Execution and Integration Leader: BloombergGPT.

Each tool has its strengths, so the best choice depends on your specific phase and use case!

How does this compare to the study? ChatGPT seems to like BloombergGPT and FinGPT a lot.

BloombergGPT has gone dark after failing to outperform!

From Ethan Mollick Associate Professor at Warton:

This remains one of the most consequential experiments in A1: Bloomberg spent over $10M training a GPT-3.5 class Al on their own financial data last year,..

...only to find that GPT-4 8k, the Al available to billions of people around the world, and without specialized finance training, beat it on almost all finance tasks!

There was a moment that we thought proprietary data would let organizations train specialized Ais that could compete with frontier models, it turns out that probably isn't going to happen. The largest frontier models are just much better at most complex tasks than smaller models.”

1711024042984 (800×600)

2 Likes

A new open-source LLM to consider. A NY Times article: How Chinese A.I. Start-Up DeepSeek Is Competing With Silicon Valley Giants

I think it deserves a mention because of this from the NY Times article: "DeepSeek is run by a quantitative stock trading firm called High Flyer. "

So potentially designed from the outset for quantitative finance, open source, more computational efficient that ChatGPT, less code, and on-par with ChatGPT for most metrics.

Mixture-of-Experts (MoE) Architecture:

  • Like Grok which I find to be super-fast, DeepSeek employs a Mixture-of-Experts model, a cutting-edge approach that activates only a subset of its parameters for a given task. This makes it computationally efficient while maintaining high performance.

CNBC

Fortune. Use the reader with Safari to bypass the paywall.

" It’s David versus Goliath in the world of AI.

On one side: The U.S. and its new Stargate Project, a massive $500 billion pledge with backing from the biggest names in tech to build data centers across the U.S. On the other, a scrappy Chinese startup that’s released an impressive open-source model trained on a shoestring budget."

Wow! This today at CNBC: Nvidia falls 11% in premarket trading as China’s DeepSeek triggers global tech sell-off

The end of the bubble or just a one-day thing? Does seem like Open AI will copy DeepSeek's Mixture-of-Experts (MoE) Architecture which is open source (if Open AI wants to copy anything) and fewer GPUs will be needed.

People will be rethinking the The Stargate Project, I would hope, even if they go ahead with a revised plan.

The Global AI Index by Tortois still shows the US AI ranking comparing countries at 100 and China at 54 and tiny Singapore at 32.

But I believe a large part of the US number is based on infostructure rather than the number of AI researchers / developers. A few years ago I started noticing the very large number of AI open source software downloads going into China, the large > 2000 Chinese STEM students that were here before Florida’s governor DeSantis signed a law halting all Chinese researchers and students.

DeepSeek was far better at answering my questions about Adobe Lightroom Classic than ChatGPT 4.0, FWIW. I really like it so far.

A couple of good pieces to put things in perspective: DeepSeek FAQ – Stratechery by Ben Thompson and a thread on Twitter by Karen Hao

Me, I'm happy because my put-options-based hedge appreciated 43% today (I think). APLD (down 28%) was a big part of that.

4 Likes

DeepSeek’s transparency in its "chain-of-reasoning" approach is particularly intriguing, as it often outlines the logic it employs before delivering a final answer, keeping the reasoning process and conclusion distinct.

For example, I received the following response to a question:

…might lead to better generalization.
But I need to check if this logic holds. Maybe in backtesting, …Also, random forests might overfit to spurious patterns in the training data, which don’t hold up in validation. The user’s method, by focusing…

This response spanned a couple of pages of detailed reasoning for a single question. Following this, I received a concise answer—similar to what ChatGPT might generate—along with the logical breakdown.

What’s fascinating is the ability to review the logical steps taken. It’s almost as if the LLM is asking itself a series of questions to systematically work through the problem.

This transparency in process allows users to see exactly how DeepSeek arrives at its conclusions. Articles and discussions have praised this feature, and I see what they mean.


3 Likes

DeepSeek has its blind spots . . .

2 Likes

I have tried it with the same questions I posed to ChatGBT. DeepSeek comes up with utter nonsense.

At least it was objective about Portfolio 123. Very positive and pretty informed recommendation: "Portfolio123 is a powerful tool for serious investors who are comfortable with quantitative analysis and strategy development…"

haha, who is "we" ?

There's the proof of a real and present danger if "they" are in charge.

I saw a post on X where they got rid of the censored responses by asking it to answer using special characters for some of the letters (4 for a, 3 for etc).

DeepSeek first cracks: super slow?

Below is link of a live video of the thought process of DeepSeek responding to "How many Rs are in the word 'strawberry' ?"

Conclusion: I think the world is safe for now!

It's soooo slow! It takes 1.5 min to respond! And what's up with all those self-doubts for a stupid simple prompt? How long will it take to process a 10K filing? Or even a press release? Sure, the X thread is about a local DeepSeek deployment with a $6,000 dual CPU system, so a GPU server would be faster, but still. Not a cheap setup for something that thinks like a 3 year old. And CPUs only make it slower, not dumber.

Thoughts? Or specific use-cases of LLMs for P123 that are low hanging fruits?

And no, I don't think LLMs will be better than our AI Factors, or replace them. After all numbers don't lie and express no opinions, while LLMs hallucinate and can be easily fooled. But LLMs could be useful to generate new simple data points, and customer service.

Thanks!

PS. I'm a newbie. I'm starting to familiarize myself with LLMs to see how we can use them, whether to self host or not, etc. Seems like inference at scale is very expensive. Self hosting LLM to process nightly data maybe ok. But live LLM to answer user questions probably better with an API. For processing huge amounts of data, like 10Ks, will require tremendous investment (whether self-hosted or not) with dubious results.

Video

Screenshot of thought process

WTF?

2 Likes

This is a well-known weakness of LLMs. I received a long technical explanation about why they struggle with these simple tasks, but I can't remember the details. In general, though, the more complex the prompt, the better the answer.

For whatever reason, I can't log in to my ChatGPT right now, but normally, simply separating the letters or repeating the word a few times makes it easier for the LLM to handle simple tasks. For example:
"How many 'r's are there in the word 's t r a w b e r r y'?"

Anyway, DeepSeek managed it without complicating the question.

The field of artificial intelligence is a perfect fit for China, even more so than the internet sector. Just look at how Tiktok beat Meta (and Alphabet) in the video recommendation modeling race.

This could be due to having fewer established advantage of developed countries in this field and there are more open-source code and papers to learn from, or it could be due to the fact that China is intentionally trying to outperform the West in this field as it has done with its development of electric cars.

1 Like

You should reconsider the power of LLMs. The key, if I may suggest, is understanding how they work. The market isn’t dumb—LLMs provide a unique lens to analyze and interpret it in ways that traditional methods might overlook.

There are huge possibilities with P123 as well. Thinking differently may open up entirely new perspectives and opportunities.

For example, here’s a system that ChatGPT built for me entirely on its own. It seems pretty solid—without seeing any of the data. That should say something about the potential here, I think. Pretty good coldish start if you are brand new to building complex ranking systems.

https://www.portfolio123.com/rank_about.jsp?rankid=506300