Jay

Posted on May 12 • Originally published at futureagi.com

I Compared the Best LLMs in May 2026: What Actually Matters in Production

#ai #llm #aimodels #developer

A practical May 2026 breakdown of the best LLMs for coding, agents, multimodal work, retrieval, and production cost, based on the trade-offs that actually show up after launch.

I spent time re-checking the current LLM stack for the same reason most teams do, the leaderboard keeps moving, but production choices still feel harder than they should.

My main takeaway is that model choice still matters, but it explains less of the final outcome than it did a year ago.

The best model on paper is often not the model that holds up once you add cost, harness quality, and long agent traces, which is where the short version helps first.

TL;DR

If I had to compress May 2026 into a few picks, this is where I would land.

Multi-file coding: Claude Opus 4.7
Raw coding benchmark dominance: Qwen 3.6 Max-Preview
Agentic terminal work: GPT-5.5
Hallucination resistance through multi-agent debate: Grok 4.20 Multi-Agent Beta
Long-context multimodal work: Gemini 3.1 Pro
Best open-weight model on the AA Intelligence Index: Kimi K2.6
Cheap open-weight frontier option: DeepSeek V4-Pro
Cheapest listed frontier-class API price in this snapshot: DeepSeek V4-Flash
Largest MIT-licensed open model: GLM-5.1
Best non-Chinese open-weight all-around option: Mistral Large 3
Open-weight long context: Llama 4 Scout
On-device and smaller self-hosted deployments: Mistral Small 4 or Gemma 4
Capability ceiling, but not generally available: Claude Mythos Preview

If I had to reduce that even further, I would say GPT-5.5 for terminal agents, Claude Opus 4.7 for codebase reasoning, Qwen 3.6 Max-Preview for benchmark-heavy coding work, Grok 4.20 when hallucination cost is high, and DeepSeek V4-Flash when price dominates.

That quick list makes more sense once you zoom out and look at what actually changed in May.

What changed in May

The first week of May felt quieter at the model layer than late April, but the rest of the stack got louder.

Apple said iOS 27 will let users choose from multiple third-party AI models for text, editing, and image work. OpenAI launched an Ads Manager beta for U.S. advertisers. Greg Brockman also told the U.S. Senate that OpenAI expects to spend about $50 billion on infrastructure in 2026, up from $30 million in 2017.

The U.S. Commerce Department also widened pre-release safety testing access to more frontier labs, which means release timing is now partly a regulatory problem.

What surprised me here was not any single launch. It was how clearly the bottleneck moved above the model, which is why the closed models are still worth separating first.

Closed models

Claude Opus 4.7

If the job is multi-file code reasoning, this is still the cleanest pick for me.

Anthropic released it on April 16, 2026, with 1M context and native vision, and it leads SWE-bench Verified at 87.6%. It is especially compelling for codebase Q&A, ticket-to-PR workflows, and architectural review where broad context actually matters.

The trade-off is that it does not lead terminal work, does not beat Gemini on very long multimodal context, and is far from the cheapest option at about $5 input and $25 output per million tokens.

I still think this is the model to beat when the codebase itself is the problem, which naturally leads to the model that currently owns terminal automation.

GPT-5.5 and GPT-5.5 Pro

If the work looks more like terminal control, OS-level action, and function-heavy agents, GPT-5.5 is the stronger fit.

OpenAI released it on April 23, 2026, with native vision and audio, and the headline number here is 82.7% on Terminal-Bench 2.0. It also posts 78.7% on OSWorld-Verified, 84.9% on GDPval, about 88.7% on SWE-bench Verified, and 58.6% on SWE-bench Pro.

GPT-5.5 Pro adds parallel test-time compute, which means multiple reasoning paths get run and merged before answering. That is why it reaches 39.6% on FrontierMath Tier 4 and why external evaluators reported fewer major reasoning errors than GPT-5 thinking.

GPT-5.5 Instant also became the default ChatGPT model on May 5, with OpenAI claiming 52.5% fewer hallucinated claims on high-risk topics. I treat that as directional until it gets reproduced outside OpenAI.

This is the model I would start with when tool use is the center of the app, which makes the next comparison about cost and context.

Gemini 3.1 Pro

Gemini 3.1 Pro is still the most practical long-context multimodal closed model in the group.

Google released it on February 19, 2026, made it the API default on March 6, and it gives you a true 1M-token production context window. It also leads GPQA Diamond at 94.3% and remains the cheapest US frontier model in this snapshot at $2 input and $12 output per million tokens up to 200K prompt size.

With Forge Code as the harness, it also reaches 78.4% on Terminal-Bench 2.0, which matters because it shows how much the wrapper now changes the result.

If your stack needs long documents, image understanding, audio, and budget discipline in one place, this is still a very hard model to ignore, which makes the restricted ceiling worth mentioning next.

Claude Mythos Preview

Claude Mythos Preview matters because it sets the capability ceiling, even if most teams cannot use it.

Anthropic released it on April 7, 2026, but restricted it to Project Glasswing partners. Anthropic reports 93.9% SWE-bench Verified, 94.6% GPQA Diamond, 97.6% USAMO, 82.0% Terminal-Bench 2.0, 83.1% CyberGym, 100% pass@1 on Cybench, and 64.7% on Humanity's Last Exam with tools.

Anthropic's framing is that Mythos is a new tier above Opus, and the company said it found enough zero-day vulnerabilities across major operating systems and browsers that public release was too risky.

I include it mostly because it shows where the public ceiling is headed, and because that same gap between public and restricted models makes Grok's architecture more interesting.

Grok 4.20 Multi-Agent Beta

Grok 4.20 is the most structurally interesting model in this list.

xAI released it on March 9, 2026, and the key point is that the default inference path is a debate system, not a single pass. Low and medium reasoning use 4 agents, while high and xhigh use 16 agents arguing through the problem before a final synthesis.

That setup is why it posts 49 on the Artificial Analysis Intelligence Index with reasoning enabled, 78% on AA-Omniscience, 93.3% on AIME, a 68.7 agentic index score on AA, and a 2M-token context window with 267 tokens per second serving speed.

Pricing is $1.25 input and $2.50 output per million tokens, which is not cheap, but the architecture is doing more work than a single call. If hallucination cost is high, I think this is one of the few cases where paying for the structure can make sense.

That covers the main proprietary picks, which is exactly where the open-weight side gets more interesting than it used to be.

Open-weight models

The open-weight story in May 2026 is not just about price anymore.

There are now several open or downloadable options that you can defend on capability, not only on cost. I have stopped treating the open tier as a fallback because that mental model is already outdated.

DeepSeek V4-Pro and V4-Flash

DeepSeek V4 is the cost-performance story that production teams cannot ignore.

V4-Pro uses a 1.6T MoE with 49B active parameters per token and a hybrid attention design that combines Compressed Sparse Attention and Heavily Compressed Attention. V4-Flash is the smaller 284B total and 13B active version, also with hybrid attention and a 1M-token context window.

On benchmarks, V4-Pro posts 52 on the Artificial Analysis Intelligence Index, 80.6% SWE-bench Verified, 55.4% SWE-bench Pro, 93.5% LiveCodeBench, and 3,206 Codeforces Elo.

On API pricing, V4-Pro sits at $0.435 input and $0.87 output per million tokens at standard pricing, while V4-Flash drops to $0.04 input and $0.07 output per million tokens. That is the structural pricing fact of this whole snapshot.

The part I would not ignore is the gap between Verified and Pro. Production behavior on novel codebases is likely closer to 55.4% than 80.6%, so I would absolutely run a domain reproduction before trusting the headline score.

This is the model family that forces you to rethink cost first, which leads naturally into long context.

Llama 4 Scout and Maverick

Meta's Llama 4 line still matters mainly because of context.

Scout gives you 1M context, while Maverick pushes to 10M tokens. If you are building long-running agents or whole-codebase workflows that currently rely on repeated summarization, that changes the system shape more than a few benchmark points do.

The real value here is not novelty. It is removing the need to constantly compress history back into the model.

That makes Llama relevant in a different lane than Qwen, which is now a bigger strategic story than just another benchmark win.

Qwen 3.6 Max-Preview

Qwen 3.6 Max-Preview is probably the most important release in the late-April to early-May window for production teams.

It leads six major coding and agent benchmarks at once, including SWE-bench Pro, Terminal-Bench 2.0, SkillsBench, QwenClawBench, QwenWebBench, and SciCode. No other model in this snapshot owns that many top spots at the same time.

The bigger story is that Alibaba closed weights on the flagship. After years of Qwen releases with downloadable weights, the top tier is now API-only through Qwen Studio and Alibaba Cloud Model Studio.

It also ships with a 260K-token context window, OpenAI- and Anthropic-compatible endpoints, and this feature:

preserve_thinking

That feature keeps the reasoning trace across multi-turn workflows so the agent keeps its internal thread between tool calls.

If your workload maps closely to coding, terminal action, or web tasks, this is hard to ignore. If open weights are a hard requirement, the open Qwen family matters more.

Qwen 3.6 open-weight family

The open-weight Qwen 3.6 line is the best Apache 2.0 multilingual base in this group.

The April releases included Qwen 3.6-72B-dense, Qwen 3.6-35B-A3B, and Qwen 3.6-27B. The 72B dense version ships with 256K context, 94.8% HumanEval, 68.2% SWE-bench Verified, and 71.4% LiveCodeBench.

This family looks especially strong for multilingual production, East Asian language performance, self-hosted deployments, and fine-tuning bases where Apache 2.0 matters.

That still leaves the non-Chinese open-weight option many teams ask about first.

Mistral Large 3

Mistral Large 3 is still the strongest non-Chinese open-weight model I would put in a serious production shortlist.

It is a 675B total and 41B active MoE under Apache 2.0 with 256K context and multimodal support. It shipped through Hugging Face as:

Mistral-Large-3-675B-Instruct-2512

If your legal path, customer location, or internal policy makes Apache 2.0 and non-Chinese origin matter, this is the cleanest answer in that slot.

That also sets up the smaller self-hosted story, where the question is less about absolute quality and more about what can run locally.

Mistral Small 4, Gemma 4, GLM-5.1, and Kimi K2.6

Mistral Small 4 is the on-device or smaller self-hosted pick in spirit, though not in the laptop-demo sense people sometimes assume.

It uses 119B total parameters with 6.5B active per forward pass, and Mistral's own docs put minimum production infrastructure at 4×H100, 2×H200, or 1×DGX-B200. That is small relative to frontier MoEs, not cheap in an absolute sense.

Gemma 4 remains a strong Google-aligned open family for fine-tuning and smaller deployment sizes, with sizes up to 31B and good support across Vertex AI, Hugging Face, llama.cpp, and Ollama.

GLM-5.1 is strategically important because it is a 754B MoE under MIT license. That gives teams another permissive-license option for frontier-class self-hosting when DeepSeek V4 is too large or operationally expensive.

Kimi K2.6 is the current top open-weight model on the Artificial Analysis Intelligence Index. It is a 1.1T MoE, and the important claim is not just raw intelligence but longer agentic stability across extended tool-call sessions.

That rounds out the open-weight picture, which makes the multimodal split easier to read next.

Multimodal, audio, and retrieval

Multimodal is no longer one category. Vision, image generation, video generation, audio understanding, and embeddings each have different leaders and different economics.

For vision, Gemini 3.1 Pro is the strongest closed model I would start with for general image and document understanding. On the open side, Qwen 3.6-VL is the most practical answer right now.

For image generation, there is no single best model anymore. Imagen 4 Ultra is the photorealism pick, Recraft V4 is strong for editorial and blog visuals, Nano Banana 2 is the speed play, FLUX 2 Pro is the best general balance, Ideogram v3 is strong for text in images, Midjourney v8 still matters for aesthetics, and Z-Image Turbo wins on pure price.

For video generation, Veo 3.1 is the best all-around option, Kling 3.0 is strong for multi-shot consistency, Hailuo does especially well on faces and expressions, Runway Gen-4.5 gives more granular creative control, Seedance 2.0 is important for joint audio-video generation, and Sora 2 is still relevant only as a sunset case because OpenAI already announced the consumer product shutdown and API discontinuation timeline.

For audio understanding, GPT-5.5 is the leader because native speech input avoids the STT-then-LLM split and keeps prosody and other signal that transcription pipelines often flatten. Gemini 3.1 Pro is close enough to matter, while open-weight audio adapters still lag.

If you want the separate voice stack, the related post is here: Best Voice AI May 2026.

That said, for a lot of production apps the real bottleneck is still retrieval, not multimodal generation.

Embeddings and coding

Embeddings still come down to two production questions, score on your domain and cost at your scale.

For closed retrieval, Voyage-3-large is still a strong default. Gemini Embedding 2 Preview is the most important multimodal embeddings release because it handles text, image, video, audio, and PDF in one model, supports 100+ languages, uses native Matryoshka Representation Learning, and is priced at $0.20 per million tokens.

If you want cheaper but still strong options, Jina-embeddings-v3 remains compelling. For enterprise multilingual retrieval, Cohere embed-v4 still matters. For the OpenAI path, text-embedding-3-large is the general default and text-embedding-3-small is the cheap default.

On open-source embeddings, Jina v5-text-small, Microsoft Harrier-OSS-v1, and Qwen3-Embedding-8B are the names I would keep on the shortlist.

For reranking, I would still put Cohere Rerank v4 and Voyage-rerank-v3 in the production tier. If you are not reranking yet, you are probably leaving a few ranking points on the table for relatively little cost.

On coding specifically, the leaderboard is compressed enough that the harness now matters almost as much as the model. Claude Opus 4.7 leads multi-file code reasoning, GPT-5.5 leads terminal work, DeepSeek V4-Pro wins cost-sensitive coding and competitive programming, and the harness can move the result by 2 to 6 points on top of the base model.

I have become much less impressed by model-only coding leaderboards because I keep seeing retry logic and intermediate validation move the outcome more than a model swap does, which is why the reasoning section is really about system design too.

Reasoning and math

For graduate-level reasoning, Claude Mythos Preview is the reported ceiling at 94.6% on GPQA Diamond.

For generally available models, Gemini 3.1 Pro is right behind at 94.3%. Claude Mythos Preview also leads USAMO at 97.6%, while Grok 4 Heavy is roughly 100% on AIME 2025 and GPT-5.5 Pro reaches 39.6% on FrontierMath Tier 4.

The practical takeaway is not that one model solves reasoning. It is that test-time compute and inference structure are now product choices.

GPT-5.5 Pro is the clean example. It costs much more than base GPT-5.5, but it reduces major errors on hard reasoning calls. That is useful only if your error cost is high enough to justify the extra compute.

That brings the conversation to the only question that matters in production, which is how to choose when benchmark gaps are this small.

How I would pick one for production

If I were choosing for a production system today, I would not start from a public leaderboard.

I would start with three candidate models, run 100 to 500 of my real prompts through the exact harness I plan to ship, and score the results with an eval stack I trust. The gap between vendor numbers and your own reproduction is still the number that matters.

Then I would measure reliability under longer sessions. Most public benchmarks do not tell you how a model behaves after 50 or more tool calls, which is exactly where real agents start to get weird.

I would also cost-adjust the scores, because price differences of 5 to 10 times are common now. DeepSeek V4-Pro is a huge win if it reproduces on your domain, and a bad decision if the benchmark contamination gap maps directly onto your real traffic.

If I had to reduce it to a simple framework, it would look like this:

Choose Claude Opus 4.7 for multi-file coding agents, codebase Q&A, and longer code traces.
Choose GPT-5.5 for terminal automation, function-heavy agents, and apps that need vision plus audio together.
Choose Gemini 3.1 Pro for long-context multimodal work and budget-sensitive US frontier usage.
Choose DeepSeek V4-Pro when cost dominates and you are willing to verify domain behavior carefully.
Choose Mistral Large 3 when you want a strong non-Chinese open-weight Apache 2.0 option.
Choose Llama 4 Maverick when 10M context is actually useful.
Choose Mistral Small 4 or Gemma 4 for smaller self-hosted or fine-tuned deployments.
Choose Grok 4.20 when hallucination resistance and current data grounding matter enough to pay for the architecture.
Avoid Claude Mythos Preview unless you are actually in Project Glasswing.

The common mistakes are pretty consistent, which is why that list is easier to use after you know what to avoid.

Common mistakes

The first mistake is optimizing for one benchmark.

If you optimize around SWE-bench Verified alone, you can easily choose the wrong model for your actual workload. I would always pair public benchmarks with a domain reproduction.

The second mistake is ignoring total cost of ownership.

List price is not the real bill. Retry rate, test-time compute, tool-call failures, and recovery logic all change the cost curve once the agent is live.

The third mistake is sticking with a familiar brand without running the comparison.

Brand inertia keeps teams on GPT or Claude even when a much cheaper model might handle the workload just fine. I have seen enough teams miss obvious savings here that I no longer trust intuition on this part.

The fourth mistake is forcing the whole app through one model.

Routing is not optional anymore. Cheap models should handle easy work, frontier models should handle hard work, and there should usually be a fallback chain.

That is also why the last part matters more than the model comparison itself.

The layer above the model

The model is no longer the only decision that explains whether an AI product works.

The harness, the retry policy, the eval discipline, the routing logic, and the reliability instrumentation now decide as much as the model family in many real systems.

If your stack is on Google ADK, the ADK Production Eval Loop is a good example of the instrument, score, gate, simulate, and optimize cycle with runnable code.

If you want the previous snapshots, read Best LLMs April 2026 and Best LLMs March 2026.

If you want the voice side, read Best Voice AI May 2026.

The leaderboard still matters, but I would trust the system around the model before I trust a single headline score, which is the real point of the whole exercise.

What are you actually seeing in production right now, raw model limits, harness issues, or reliability decay over longer agent runs?

DEV Community