I though I would have some fun with this, asking ChatGPT 4o to do a self-assessment. Here is its response. (It appears that modesty isn't one of its strengths.)
Here's how I would rate myself across the stock investment phases:
Ideation: 4
I excel at generating ideas based on market trends, sectors, and themes. I can analyze macroeconomic indicators and historical trends to identify opportunities, but I lack the human intuition and creativity that some expert investors might have.
Research: 5
Research is my strongest area. I can quickly aggregate and analyze large amounts of data, including company fundamentals, technical indicators, financial ratios, and news, to provide in-depth insights.
Backtesting: 4
I can help create backtesting frameworks by writing code to simulate historical performance of strategies. However, I depend on user-defined rules and data to execute backtests, as I don’t have direct access to live financial systems or historical databases
. Optimization: 4
I can assist in optimizing strategies by running simulations with various parameter tweaks (e.g., stop-loss levels, asset allocation, etc.). However, optimization depends heavily on the input data and requires careful scrutiny to avoid overfitting, where human insight can complement my work.
Strategy Design: 5
I am highly capable of designing algorithmic or discretionary investment strategies based on user goals (e.g., risk tolerance, time horizon, or market conditions). My systematic and data-driven approach is a strong asset here.
Simulation: 4
I can simulate strategies in hypothetical environments with clear rules and parameters. While accurate for testing concepts, simulations depend on user-provided or external datasets for realism.
Execution: 3
I can assist in designing execution algorithms or workflows but cannot directly execute trades. Execution also involves real-time decision-making and trade management, which I can simulate but not perform live.
Overall, I'm strongest in Research, Strategy Design, and Optimization, but my practical limitations lie in Execution and relying on external systems for data and backtesting.
Then I asked ChatGPT to rate the other services. Amazingly it did so.
Here's a general assessment and rating of Gemini, Claude, BloombergGPT, FinGPT, FinBERT, FinMA, AUCARENA, and Llama 2 for the investment phases. The ratings range from 1 (poor) to 5 (excellent) and are based on their capabilities as described in public information:
1. Ideation
Model
Rating
Explanation
Gemini
4
Strong at generating innovative investment ideas based on data-driven insights.
Claude
3
General-purpose but less focused on financial-specific ideation.
BloombergGPT
5
Built for financial data and market insights, making it highly capable of generating investment themes.
FinGPT
5
Designed for financial tasks, adept at ideating based on market data and trends.
FinBERT
3
More focused on sentiment analysis, not ideation.
FinMA
4
Provides thematic and data-driven insights for ideation.
AUCARENA
3
Focused on financial workflows but less on creative ideation.
Llama 2
3
Good for general ideation but lacks specialization in financial markets.
2. Research
Model
Rating
Explanation
Gemini
5
Excellent at aggregating and synthesizing large amounts of financial data.
Claude
4
Good at general research, but lacks domain-specific financial focus.
BloombergGPT
5
Exceptional for financial research, leveraging Bloomberg's financial datasets.
FinGPT
5
Designed for financial research, excellent at data aggregation and analysis.
FinBERT
4
Strong in sentiment analysis but not comprehensive in financial research.
FinMA
4
Good at providing market-related insights, but more niche-focused.
AUCARENA
3
General financial data support, but less specialized for in-depth research.
Llama 2
3
General research capabilities, not tailored to financial markets.
3. Backtesting
Model
Rating
Explanation
Gemini
4
Can assist in coding backtesting frameworks but relies on user inputs and tools like Python.
Claude
3
Limited capabilities for quantitative backtesting.
BloombergGPT
4
Strong integration with financial data, useful for backtesting but requires user-defined strategies.
FinGPT
5
Excellent at quantitative analysis and building backtesting models with data integrations.
FinBERT
2
Not built for backtesting, focused on text and sentiment analysis.
FinMA
4
Can simulate backtesting but depends on external tools.
AUCARENA
3
Supports financial workflows but limited in robust backtesting frameworks.
Llama 2
2
Not designed for financial backtesting tasks.
4. Optimization
Model
Rating
Explanation
Gemini
4
Strong for algorithmic optimization but depends on external inputs and datasets.
BloombergGPT has gone dark after failing to outperform!
From Ethan Mollick Associate Professor at Warton:
“This remains one of the most consequential experiments in A1: Bloomberg spent over $10M training a GPT-3.5 class Al on their own financial data last year,..
...only to find that GPT-4 8k, the Al available to billions of people around the world, and without specialized finance training, beat it on almost all finance tasks!
There was a moment that we thought proprietary data would let organizations train specialized Ais that could compete with frontier models, it turns out that probably isn't going to happen. The largest frontier models are just much better at most complex tasks than smaller models.”
I think it deserves a mention because of this from the NY Times article: "DeepSeek is run by a quantitative stock trading firm called High Flyer. "
So potentially designed from the outset for quantitative finance, open source, more computational efficient that ChatGPT, less code, and on-par with ChatGPT for most metrics.
Mixture-of-Experts (MoE) Architecture:
Like Grok which I find to be super-fast, DeepSeek employs a Mixture-of-Experts model, a cutting-edge approach that activates only a subset of its parameters for a given task. This makes it computationally efficient while maintaining high performance.
Fortune. Use the reader with Safari to bypass the paywall.
" It’s David versus Goliath in the world of AI.
On one side: The U.S. and its new Stargate Project, a massive $500 billion pledge with backing from the biggest names in tech to build data centers across the U.S. On the other, a scrappy Chinese startup that’s released an impressive open-source model trained on a shoestring budget."
The end of the bubble or just a one-day thing? Does seem like Open AI will copy DeepSeek's Mixture-of-Experts (MoE) Architecture which is open source (if Open AI wants to copy anything) and fewer GPUs will be needed.
People will be rethinking the The Stargate Project, I would hope, even if they go ahead with a revised plan.
The Global AI Index by Tortois still shows the US AI ranking comparing countries at 100 and China at 54 and tiny Singapore at 32.
But I believe a large part of the US number is based on infostructure rather than the number of AI researchers / developers. A few years ago I started noticing the very large number of AI open source software downloads going into China, the large > 2000 Chinese STEM students that were here before Florida’s governor DeSantis signed a law halting all Chinese researchers and students.
DeepSeek’s transparency in its "chain-of-reasoning" approach is particularly intriguing, as it often outlines the logic it employs before delivering a final answer, keeping the reasoning process and conclusion distinct.
For example, I received the following response to a question:
…might lead to better generalization. But I need to check if this logic holds. Maybe in backtesting, …Also, random forests might overfit to spurious patterns in the training data, which don’t hold up in validation. The user’s method, by focusing…
This response spanned a couple of pages of detailed reasoning for a single question. Following this, I received a concise answer—similar to what ChatGPT might generate—along with the logical breakdown.
What’s fascinating is the ability to review the logical steps taken. It’s almost as if the LLM is asking itself a series of questions to systematically work through the problem.
This transparency in process allows users to see exactly how DeepSeek arrives at its conclusions. Articles and discussions have praised this feature, and I see what they mean.
At least it was objective about Portfolio 123. Very positive and pretty informed recommendation: "Portfolio123 is a powerful tool for serious investors who are comfortable with quantitative analysis and strategy development…"
I saw a post on X where they got rid of the censored responses by asking it to answer using special characters for some of the letters (4 for a, 3 for etc).
Below is link of a live video of the thought process of DeepSeek responding to "How many Rs are in the word 'strawberry' ?"
Conclusion: I think the world is safe for now!
It's soooo slow! It takes 1.5 min to respond! And what's up with all those self-doubts for a stupid simple prompt? How long will it take to process a 10K filing? Or even a press release? Sure, the X thread is about a local DeepSeek deployment with a $6,000 dual CPU system, so a GPU server would be faster, but still. Not a cheap setup for something that thinks like a 3 year old. And CPUs only make it slower, not dumber.
Thoughts? Or specific use-cases of LLMs for P123 that are low hanging fruits?
And no, I don't think LLMs will be better than our AI Factors, or replace them. After all numbers don't lie and express no opinions, while LLMs hallucinate and can be easily fooled. But LLMs could be useful to generate new simple data points, and customer service.
Thanks!
PS. I'm a newbie. I'm starting to familiarize myself with LLMs to see how we can use them, whether to self host or not, etc. Seems like inference at scale is very expensive. Self hosting LLM to process nightly data maybe ok. But live LLM to answer user questions probably better with an API. For processing huge amounts of data, like 10Ks, will require tremendous investment (whether self-hosted or not) with dubious results.
This is a well-known weakness of LLMs. I received a long technical explanation about why they struggle with these simple tasks, but I can't remember the details. In general, though, the more complex the prompt, the better the answer.
For whatever reason, I can't log in to my ChatGPT right now, but normally, simply separating the letters or repeating the word a few times makes it easier for the LLM to handle simple tasks. For example: "How many 'r's are there in the word 's t r a w b e r r y'?"
Anyway, DeepSeek managed it without complicating the question.
The field of artificial intelligence is a perfect fit for China, even more so than the internet sector. Just look at how Tiktok beat Meta (and Alphabet) in the video recommendation modeling race.
This could be due to having fewer established advantage of developed countries in this field and there are more open-source code and papers to learn from, or it could be due to the fact that China is intentionally trying to outperform the West in this field as it has done with its development of electric cars.
You should reconsider the power of LLMs. The key, if I may suggest, is understanding how they work. The market isn’t dumb—LLMs provide a unique lens to analyze and interpret it in ways that traditional methods might overlook.
There are huge possibilities with P123 as well. Thinking differently may open up entirely new perspectives and opportunities.
For example, here’s a system that ChatGPT built for me entirely on its own. It seems pretty solid—without seeing any of the data. That should say something about the potential here, I think. Pretty good coldish start if you are brand new to building complex ranking systems.