Stephan Miller

Posted on May 12 • Originally published at stephanmiller.com on May 9

The Cheapskate's Guide to the Arena Leaderboard: Why I Stopped Paying Claude Opus Prices

#largelanguagemodels

I kept noticing this thing while writing the model roundup every week. The “best models” lists all lead with $25-per-million Claude Opus, and then I’d open the Arena leaderboard for creative writing and notice Gemini 3 Flash sitting above Claude Sonnet for one-tenth the price. Or open the coding leaderboard and find GLM 5.1 tying Claude Opus 4.6 inside the top ten while costing seven times less.

So I’d do the math. Every week. By hand. While writing about something else.

This week I made the math the centerpiece. Welcome to the Cheapskate Picks, the cheapest model within striking distance of the leader for every Arena category that matters. This blog post that started because I kept doing this myself now does it for you.

The Compression Problem (Or: Why You’re Probably Overpaying)
The Cheapskate Picks (May 1–8, 2026)
GLM 5.1: The SOTA Nobody’s Pricing In
Tencent’s Hy3 Free Cliff Hits
The Asterisks (Or: Cheap Is Fine If You Know What You’re Losing)
Coming Up: Google I/O May 19, the Gemini 4 Question
The Receipts

The Compression Problem (Or: Why You’re Probably Overpaying)

Here is the structural fact that powers everything else in this post: the Arena leaderboard’s Overall top 20 spans 35 rating points. From #1 (claude-opus-4-7-thinking at 1503) down to #20 (claude-opus-4-5 at 1468). That’s it. The entire visible top end of the leaderboard fits in less than 3% of the rating scale.

Meanwhile the prices fan out 30x. Claude Opus 4.7 costs $25/M output. Gemini 3 Flash, which sits at #16 in that same Overall top 20 with a rating of 1474, costs $3/M output. Twenty-nine rating points apart, about 2% on the scale, eight times the price.

That is the cheapskate problem stated as a math equation. Nobody is going to feel a 2% rating gap. They will absolutely feel an 8x cost difference when the bill arrives.

So here is the heuristic I’m using from now on:

Anchor on the category leader’s Arena rating
Define a competitive band: default 50 rating points below the leader
Sort models in the band by output price
Cheapest in the band is the cheapskate pick. Report rating delta and price ratio so you can judge the trade

The reason this beats “best models under $1” thresholds is that different categories have different price floors. Vision is more expensive than text. Math has its own dynamics. A fixed dollar threshold breaks every category that doesn’t match it. The score-gap-vs-price-gap framing adapts on its own.

I am not saying that Claude Opus 4.7 is bad. It’s the leader on Arena Overall and Coding and Multi-Turn. But the gap you’re paying $22/M extra for might not be there. And in some categorie, coding most loudly, there’s a model in the band that outperforms the leader on the benchmark that actually maps to your job.

Speaking of which.

The Cheapskate Picks (May 1–8, 2026)

Methodology in plain English: cheapest model within 50 rating points of the category leader. Band used everywhere this week, because the data was unusually compressed across the board.

Overall: Gemini 3 Flash, $0.50/$3.00

Leader : claude-opus-4-7-thinking — rating 1503 — $25/M output
Cheapskate pick : Gemini 3 Flash Preview — rating 1474 — $3/M output
Δ rating: −29 points. Price ratio: 8.3x cheaper.

OpenRouter slug: google/gemini-3-flash-preview. Multimodal. 1M context. The boring correct answer of mid-2026.

If you have one model running for general daily-driver work and you are paying $25/M for output, you are subsidizing margin. Twenty-nine rating points on a 1500-point scale is below the threshold any human would notice in an A/B test, much less a production workflow.

Coding: GLM 5.1, the SWE-Bench Pro Killer

Leader : claude-opus-4-7-thinking — rating 1569 — $25/M output
Cheapskate pick : GLM 5.1 (Z.ai) — rating 1525 — $3.50/M output
Δ rating: −44 points. Price ratio: 7.1x cheaper.

OpenRouter slug: z-ai/glm-5.1. MIT-licensed. Weights on Hugging Face.

Here is where the cheapskate framing stops being polite. GLM 5.1 beats Claude Opus 4.6, GPT-5.4, and Gemini 3.1 Pro on SWE-Bench Pro with a score of 58.4. SWE-Bench Pro is the benchmark where the model has to actually fix real GitHub issues in real codebases. The thing the leader is supposed to be the leader at.

So the situation is: on Arena’s vibes-based head-to-head vote (people picking which output looks nicer), Opus 4.7-thinking wins. On the benchmark that maps to the job you are actually paying these models to do, an open-weight Chinese model from a lab most readers haven’t heard of wins. And it is seven times cheaper.

Honorable mention: Kimi K2.6 (Moonshot) at rating 1519 / $3.50: same price tier, similar profile, also open-weight. If you don’t like Z.ai’s politics or licensing, Moonshot is the same trade.

Creative Writing: Gemini 3 Flash

Leader : claude-opus-4-6-thinking — rating 1494 — $25/M output
Cheapskate pick : Gemini 3 Flash Preview — rating 1459 — $3/M output
Δ rating: −35 points. Price ratio: 8.3x cheaper.

This is the category that triggered the methodology. Gemini 3 Flash sits at rating 1459 in creative writing. Claude Sonnet 4.5 sits at 1451. The cheap Google Flash model outranks the mid-tier Anthropic model for prose generation, while costing five times less than Sonnet and twenty-eight times less than the actual category leader.

If you’re writing fiction or marketing copy or anything generative-prose-shaped and paying Sonnet pricing, you are losing on both ends.

Daredevil pick : DeepSeek V4 Pro at rating 1449 / $0.87/M output — that’s 28.7x cheaper than the leader, and it sits at the band edge with −45 rating points. You give up another 10 rating points (still a sub-1% gap on the scale) and save another 3.4x on top of Gemini 3 Flash. For batch creative work where you don’t care about multimodal input, V4 Pro is the cheapest defensible answer.

Math: DeepSeek V4 Pro Thinking, the 17x Discount

Leader : gpt-5.4-high — rating 1515 — about $15/M output (gpt-5.4 base; high-reasoning costs the same per token, you just burn more of them)
Cheapskate pick : DeepSeek V4 Pro (thinking mode) — rating 1479 — $0.87/M output
Δ rating: −36 points. Price ratio: ~17x cheaper.

OpenRouter slug: deepseek/deepseek-v4-pro with reasoning: { effort: "high" } or xhigh.

If you do math with an LLM and you are paying OpenAI prices, stop. DeepSeek V4 Pro with thinking enabled is 36 rating points behind on Arena math, which is roughly 2.4% of the scale, for one-seventeenth the cost. The math category was the one where the price gap most embarrassed the leader.

Conservative runner-up: Gemini 3 Flash at rating 1476 / $3/M output. Five times cheaper than the leader, more conservative than V4 Pro Thinking, multimodal if you need to feed it diagrams.

Instruction Following: MiMo V2.5 Pro

Leader : claude-opus-4-6-thinking — rating 1518 — $25/M output
Cheapskate pick : MiMo V2.5 Pro (Xiaomi) — rating 1468 — $3/M output
Δ rating: −50 points. Price ratio: 8.3x cheaper.

OpenRouter slug: xiaomi/mimo-v2.5-pro.

Yes… the phone company. Their LLM team has been quietly competitive for two product cycles now and MiMo V2.5 Pro lands right at the band edge for instruction following at one-eighth the price. If “deploying a Xiaomi model in production” makes the security team start asking questions, the honorable mention is Claude Sonnet 4.6 at rating 1476 / $15/M output: only 1.7x cheaper than the leader, but you keep your name brand.

This is the category where the band was tightest: only the top 12 models fit in the 50-point window, which means MiMo squeaked in at the edge. That’s a structural note: in the categories where the top is more spread out, the cheapskate pick has more cushion. Instruction Following had the smallest cushion this week.

Hard Prompts: Gemini 3 Flash, Again

Leader : claude-opus-4-6-thinking — rating 1535 — $25/M output
Cheapskate pick : Gemini 3 Flash Preview — rating 1493 — $3/M output
Δ rating: −42 points. Price ratio: 8.3x cheaper.

Same story as Overall and Creative Writing. The Hard Prompts leader has the highest absolute rating of any category (1535), but Gemini 3 Flash still sits comfortably in the band 42 points back. MiMo V2.5 Pro is essentially tied at rating 1492 / $3: pick by ecosystem preference.

Multi-Turn: Gemini 3 Flash, Again Again

Leader : claude-opus-4-7-thinking — rating 1529 — $25/M output
Cheapskate pick : Gemini 3 Flash Preview — rating 1484 — $3/M output
Δ rating: −45 points. Price ratio: 8.3x cheaper.

The conservative pick here is Claude Sonnet 4.6 at rating 1482 / $15/M output. If you specifically want Anthropic’s multi-turn glue (the way Claude tracks state across long conversations), Sonnet is the cheapest Anthropic option in the band. But Gemini 3 Flash is two rating points higher for one-fifth the price, so unless you have a brand-loyalty reason, the math says Flash.

The Quick-Reference Table

Category	Leader	$ leader (out/M)	Cheapskate pick	$ pick (out/M)	Δ rating	Price ratio
Overall	claude-opus-4-7-thinking	$25	Gemini 3 Flash	$3.00	−29	8.3x
Coding	claude-opus-4-7-thinking	$25	GLM 5.1	$3.50	−44	7.1x
Creative Writing	claude-opus-4-6-thinking	$25	Gemini 3 Flash	$3.00	−35	8.3x
Math	gpt-5.4-high	~$15	DeepSeek V4 Pro (thinking)	$0.87	−36	~17x
Instruction Following	claude-opus-4-6-thinking	$25	MiMo V2.5 Pro	$3.00	−50	8.3x
Hard Prompts	claude-opus-4-6-thinking	$25	Gemini 3 Flash	$3.00	−42	8.3x
Multi-Turn	claude-opus-4-7-thinking	$25	Gemini 3 Flash	$3.00	−45	8.3x

The pattern: Gemini 3 Flash wins the cheapskate slot in 4 of 7 Arena categories at $0.50 input / $3 output (Overall, Creative Writing, Hard Prompts, Multi-Turn). It’s the boring correct answer. The interesting picks are where it doesn’t win:Coding (GLM 5.1 because it actually beats the leader on SWE-Bench Pro), Math (DeepSeek V4 Pro Thinking because the price gap is absurd), and Instruction Following (MiMo V2.5 Pro, on a band edge, from Xiaomi).

And none of the seven categories needed a “you’re paying for quality here” caveat. Every category had a sub-$3.50/M output option in the band. As of last week, you can pay under $3.50/M output and stay within 50 rating points of the category leader on every major Arena category.

GLM 5.1: The SOTA Nobody’s Pricing In

Z.ai released GLM 5.1 on April 7, 2026. Mixture-of-experts, 744B total parameters, 40B active per token. MIT license. Weights on Hugging Face. The reviews you can find on it are all the same shape: “wait, this thing is what on coding?”

The numbers from the Renovate QR review:

SWE-Bench Pro: 58.4 — beats Claude Opus 4.6, GPT-5.4, and Gemini 3.1 Pro
CyberGym: 68.7 — about 20 points above GLM-5
8-hour autonomous coding runs with ~1,700 reasoning steps
API pricing on OpenRouter: $1.05 input / $3.50 output — 6 to 10x cheaper than Opus 4.6

Anthropic dominates the Arena leaderboard. Eleven of the top 20 in Instruction Following are Claude variants. Seven of the top 20 in Multi-Turn. The brand wins the popularity contest. But on a benchmark that has to map to “did the model actually fix the bug,” an open-weight model from a Chinese lab is the new state of the art, and it’s almost an order of magnitude cheaper.

This is the under-sold value pick the cheapskate framing rewards. It’s not in the noise of “every new model claims a benchmark win.” It’s tied with the most expensive frontier model on the benchmark closest to the actual job, and the community hasn’t priced in what this means yet.

Tencent’s Hy3 Free Cliff Hits

Last week’s lead story was Tencent’s Hy3 Preview running away with #1 on OpenRouter at +1,356% week-over-week. The catch was that the entire spike was driven by Tencent giving the model away free until May 8 to seed adoption.

If you built a workflow on Hy3’s free tier, you hit the paywall. Migration window: zero. Some of you might have woke up with a billing surprise.

What I’ll be watching next week is the size of the cliff. If Hy3 holds top-five even at paid pricing, the free run was a successful seeding strategy. If it craters out of the top ten the moment the meter starts running, the entire spike was a free-period mirage and the model’s real value was lower all along.

For what to use instead if you got caught flat-footed: Hy3’s nearest like-for-like by price after the cliff is DeepSeek V4 Flash at $0.14/$0.28, which is actually slightly cheaper. And V4 Flash has the agent-default chorus behind it that Hy3 never built. Migration target if you need one: V4 Flash.

The Asterisks (Or: Cheap Is Fine If You Know What You’re Losing)

Gemini 3 Flash MRCR retrieval cliff. This is the one that bit me earlier this year. The Cybernews review confirms it numerically: MRCR retrieval drops from 60.1% accuracy at 128K context to 12.3% at 1M. If you’re running RAG-heavy workflows and pumping the full million-token context window full of documents, the cheapskate pick falls off a cliff at long context. Cap your context at 128K for retrieval-shaped work, or accept the hallucinations. Don’t say I didn’t warn you.

DeepSeek V4 Flash factual recall hole. Artificial Analysis shows V4 Flash scoring 34.1% on SimpleQA versus V4 Pro’s 57.9%. The 25x output savings come with a “won’t reliably know facts” asterisk. V4 Flash is great for agent loops where you’re feeding it grounded context anyway. It’s bad as a free-recall question-answerer. Pair it with retrieval. Don’t ask it to remember.

The Hy3 “you built on a free tier” thing. Predictable, still happening to people today. If you have an LLM in a critical workflow and the only reason you picked it was “free,” that workflow’s billing model is broken by design. The fix is to pick a model where the paid pricing is still cheap enough to justify the workflow.

These are not reasons to not use the cheapskate picks. These are reasons to know what you’re picking. The model card for “I will hallucinate factual recall, but I cost a quarter” is fine if the workflow doesn’t depend on factual recall. It’s catastrophic if it does.

Coming Up: Google I/O May 19, the Gemini 4 Question

Google I/O 2026 is May 19–20 at Shoreline Amphitheatre. The big rumored announcement is Gemini 4 with a claimed 84.6% on ARC-AGI2, integrated image and video generation, and a new “Omni” video model replacing the internal Toucan tool. Rumors also include “Remy,” a 24/7 always-on agent, and a Proactive Assistant that pushes suggestions instead of waiting for prompts.

The reason this matters for the cheapskate analysis is that Google is already winning the cheapskate slot at the Flash tier. Gemini 3 Flash is the boring correct answer for four of seven categories at $3/M output. If Gemini 4 Pro lands at SOTA on the leader benchmarks, the gap from the top of the leaderboard closes downward. The cheapskate band stays the same; the leader’s value proposition gets squeezed harder.

If Gemini 4 doesn’t land well, the leaderboard stays compressed in roughly its current shape and the cheapskate pattern holds. Either way I’ll be writing about it. Either way, my OpenRouter bill is not going up.

The OpenRouter stealth slot is still occupied by Owl Alpha (April 28, free, 1.05M context) per the W18 issue. No fresh signal this week. Claude Mythos is still research-only with no public release update. GPT-6 “Spud” is still rumored for late 2026 with no fresh leaks.

For the full W18 context including the original Hy3 spike and the $300/month Grok 4.3 amnesiac story, see last week’s roundup.

The Receipts

The leaderboard is compressed. The prices aren’t. That’s the whole post.

Concrete numbers from the last week: the entire Arena Overall top 20 fits in 35 rating points. Six of seven Arena categories have a cheapskate pick at $3.50 per million output tokens or less. Three categories have a cheapskate pick that’s eight times cheaper than the leader for under 3% of the rating scale. One category, coding, has a cheapskate pick (GLM 5.1) that’s the new state of the art on SWE-Bench Pro, beating Claude Opus 4.6, GPT-5.4, and Gemini 3.1 Pro at seven times less the cost.

Anthropic charges 8x more for under 3% better. Here are the receipts.

The Cheapskate Picks methodology lives in this weekly blog post from now on. Next week we see what happens to OpenRouter rankings when Hy3’s rocket booster falls off. The week after, we see whether Google I/O makes any of this obsolete. Either way, I am not paying $25 per million output tokens for a 2% rating bump. Neither should you.

DEV Community