A few months ago I asked an LLM to generate twenty trading strategies.
Fourteen were the same thing.
Not similar ideas. Not variations on a theme. The same mean-reversion logic with different lookback windows and parameter names.
I gave it historical price data, told it to find patterns, output entry/exit rules in Python. Ten minutes later I had twenty strategies. Clean code, proper docstrings, sensible-looking parameters.
I backtested all twenty. Twelve looked profitable. Some showed 200%+ annual returns.
Then I actually read the code.
Same structure. Same assumptions. Same failure mode: in a trending market, they'd all keep buying into a falling asset with no awareness anything had changed.
That's when I stopped thinking of LLMs as strategy generators and started thinking of them as very confident interns who hand you the same report twenty times with different cover pages.
The demos don't help
On GitHub right now there's a repo with 56K stars where LLM personas of Warren Buffett and Charlie Munger debate trades. I watched a similar multi-agent setup for a while. Four agents, elaborate memory system, consensus mechanism. The actual trade logic underneath could have been a moving average crossover.
nof1.ai gave six frontier models $10K each in real money last October. Two made money. Four got destroyed. Their second round on US stocks, Grok won with +12.1%, mostly because it was processing 68 million tweets per day while the others were stuck on 15-minute delayed summaries.
People keep asking "which LLM is best for trading" and it's just the wrong question. The data pipe is doing most of the work.
How we got here
Trading software has been through a few cycles of this same pattern. Tools get better, people find faster ways to fool themselves.
MT4 was when indicators became actual software. RSI, moving averages, MACD stopped living in books and forums and turned into drag-and-drop components. Before MT4, that stuff was tribal knowledge. You picked it up from other traders, maybe a book if you were lucky. MT4 turned it into reusable components.
Python stack pushed things up a level. Backtrader, freqtrade, vnpy. People started packaging full strategies: entries, exits, sizing, optimization. Genetic algorithms to find "optimal" parameters, which in practice usually meant finding parameters that happened to work on that exact dataset. I burned a lot of time on that before I figured out what was happening.
Then ML platforms. QuantConnect, WorldQuant BRAIN. Less about tuning rules, more about building a feature pipeline that can survive training, validation, and execution. At that point the pipeline is the product.
Each cycle crystallized something. Indicators, then strategies, then systems. Each one also hit the same wall: backtest looks great, live performance doesn't.
And now LLMs show up and people try to skip the entire stack. All of it. The indicators, research workflows, validation, execution logic. Stuff that took each previous generation years to build up.
I get why. LLMs have absorbed all of those frameworks through training data: indicator libraries, strategy templates, backtesting patterns, risk heuristics, market commentary going back decades. Ask one for a strategy and it can produce something that sounds like it has years of market practice baked in.
Then you try to run it and realize fees aren't modeled. Or the backtest assumed you could fill at the close. Or the position sizing doesn't account for slippage.
What actually breaks
After the twenty-clones incident and watching arena results, two failure modes keep showing up.
Strategy Hallucination. The LLM generates strategies that look structurally valid but encode no real market insight. My clones were this. Proper entry/exit logic, proper position sizing. Also all exploiting the same artifact in the training data.
A human quant would have caught it in five minutes. I caught it in two hours. Someone less experienced might not catch it at all.
Backtest Overfitting Blindness. The LLM doesn't understand that a beautiful backtest is a warning sign. When I asked it to generate strategies with "strong backtesting performance," it optimized for exactly that. Curve-fitted parameters, lookahead bias in feature construction, survivorship bias in asset selection. Every quant knows these traps. The LLM walked into all of them with total confidence.
Here's what one looked like:
# What the LLM generated (looks clean):
def signal(prices, window=14, threshold=2.0):
zscore = (prices - prices.rolling(window).mean()) / prices.rolling(window).std()
return zscore < -threshold # buy when "oversold"
# What it didn't tell you:
# - window=14 was fit to this specific dataset
# - threshold=2.0 maximized backtest returns
# - this exact pattern appears in 14 of 20 "different" strategies
# - in a trending market, zscore stays below -threshold for weeks
# and you keep buying into a falling knife
These compound. The LLM hallucinates strategies, then fits them perfectly to historical data. And the more strategies you generate, the more likely at least one shows amazing backtest results purely by chance.
The boring stack nobody wants to build
What all of the demos and arenas skip over is the infrastructure that previous generations had to build by hand: data cleaning, feature engineering, simulation assumptions, market impact, fee modeling, routing, inventory control, risk management. The model appears to have internalized it. So people don't build it. And then they're surprised when things break in the ways that stuff was supposed to prevent.
The trading agent experiments from last year showed this pretty clearly. The ones that held up had real infrastructure underneath: research loops, execution logic, constraints, context handling. The ones that blew up had an LLM and a brokerage API. One system I read about was basically polling a model every few seconds and sending market orders based on the response. That's not a trading system, that's a random number generator with extra steps.
Jane Street is interesting here. People point to them as proof that ML wins at trading. And they do use deep learning. Tens of thousands of GPUs, custom CUDA kernels, architectures from the same transformer research that produced LLMs. But what they're doing with all of that is market making. Pricing 16,000+ bonds in real time, handling 41% of US bond ETF volume. Their models process numerical market microstructure data. Not news, not tweets. One of their engineers described it as "1 unit of useful data and 99 units of garbage."
The model is one layer. Around it sits a pricing engine, execution logic that handles routing and queue position and partial fills, risk controls, inventory management, monitoring, post-trade review.
Model + tools. The model makes judgments, the tools constrain and execute and audit those judgments. Take away the tooling and you're left with confident numbers that nobody's checking.
Where I landed
After the clone incident I changed how I use these models. They're good at proposing structure: indicator combinations, entry logic ideas, risk rules. But the moment they start picking specific numbers, I don't trust them. Those numbers will be curve-fitted to whatever history they've seen.
The diversity problem turned out to be worse than I expected. If you generate fifty strategies without clustering them first, there's a good chance you end up with five actual ideas wearing ten costumes each. I should have clustered before getting excited about twelve profitable backtests.
And honestly I still don't have a clean workflow for this. Maybe I'm over-indexing on the diversity problem specifically. But whenever someone shows me an LLM trading system, the first thing I want to know is what catches the model when it's wrong. If the answer is "the model corrects itself," I've seen that movie.
What does your setup look like? Has anyone else tried running LLM-generated strategies through actual backtesting infrastructure and survived? Curious what failure modes you hit that I haven't.
Top comments (0)