DEV Community

Cover image for One AI Model Scored 99. I Still Voted for the One That Scored 95.
Sukriti Singh
Sukriti Singh

Posted on

One AI Model Scored 99. I Still Voted for the One That Scored 95.

Claude scored higher. Llama felt better in the browser. The harder part was figuring out which one actually mattered.

One AI model scored 99. I still voted for the one that scored 95.

That should have made no sense.

The higher-scoring build was technically cleaner, passed almost every automated evaluation check, and looked like the obvious winner on paper. The lower-scoring one came back with flagged quality issues, accessibility deductions, and enough small implementation compromises that it should have been easy to dismiss.

And yet after using both side by side, I trusted the lower-scoring app more.

That contradiction ended up being the most useful part of the exercise, because it exposed something developers are going to run into increasingly often as AI-generated software becomes easier and easier to produce: “looks good,” “scores good,” and “feels right” are three different judgments, and they do not always point to the same winner.

I found that out while running a blind Claude 3 Haiku vs Llama-4-Scout coding duel on VibeCode Arena by HackerEarth using a deceptively simple prompt — build a Regex Translator.

The brief was intentionally small: one input box where a user pastes any regex pattern, one “Explain This” button, and then a plain-English explanation simple enough that even someone non-technical could understand what the expression is checking for. Frontend only. No backend logic. No technical jargon. Just a clean little utility.

Exactly the kind of prompt that should produce one clearly stronger build and one weaker one.

Instead, both outputs came back annoyingly close.

Visually, there was no dramatic collapse on either side. Both were rendered into plausible apps. Both captured the broad task correctly. Both looked functional enough that if I had judged the duel from screenshots alone, I probably would have shrugged, picked one casually, and moved on.

That is usually where a lot of AI app evaluation stops.

Increasingly, that is becoming a problem.

What makes VibeCode Arena more interesting than a standard side-by-side AI comparison is that it does not stop at the browser preview. Both generated builds also get pushed through an automated evaluation layer that scores security, code quality, correctness, performance, and accessibility, then surfaces implementation issues that do not necessarily show themselves just because the UI appears to work.

That second layer changed the duel immediately.

Side-by-side VibeCode Arena comparison showing Claude 3 Haiku and Llama-4-Scout generating a Regex Translator app from the same prompt
At first glance, both generated apps looked close enough that picking a winner felt almost arbitrary
Claude came back with a 99 overall score — essentially clean across the board. Security was perfect. Correctness was perfect. Performance was perfect. Accessibility was untouched. Code quality had only a minor deduction.

Automated VibeCode Arena evaluation highlighting score differences, accessibility deductions, and implementation issues for Claude 3 Haiku.
The browser previews looked similar. The evaluation report for Claude 3 Haiku was where the hidden differences began to show up.
Llama landed at 95.

Automated VibeCode Arena evaluation highlighting score differences, accessibility deductions, and implementation issues for Llama-4-Scout.
The evaluation report for Llama-4-Scout
More importantly, the evaluator surfaced three major issues inside Llama’s implementation: unnecessary character escapes, missing labels on form fields, and insufficient text contrast that would affect accessibility compliance. None of those are catastrophic enough to break the visible preview, but they are exactly the kind of hidden compromises that slip past people when AI-generated code gets judged too quickly.

Become a Medium member
So the technical answer seemed straightforward.

Claude had produced the stronger implementation.

Then I clicked through both previews again.

And this is where the neatness of the score report started to break.

Llama’s result simply felt closer to the actual utility I had in mind. The regex explanation behavior was tighter, the visible response felt more aligned with the original ask, and the app gave me the immediate “yes, this is what I wanted this tool to do” reaction faster than Claude’s did. Claude was the cleaner report. Llama was the more convincing product experience.

So the final vote did not go where the higher number was.

That does not make the automated evaluator less valuable. If anything, it makes it more valuable because it exposes the fact that AI-generated software now has to be judged across several dimensions at once. There is technical cleanliness. There is hidden implementation quality. There is accessibility. And then there is the much messier but very real question of whether the thing actually feels correct when a human uses it.

Those layers overlap.

They do not overlap perfectly.

This duel forced that mismatch into the open.

Without the preview comparison, I would not have had a practical user preference. Without the automated report, I would not have seen the implementation compromises quietly sitting within the lower-scoring build. And without being made to cast a blind winner after seeing both, I probably would have lazily accepted the 99 as the answer and missed the fact that my actual product instinct was pointing elsewhere.

That is what made this more interesting than a simple “which AI model writes better code?” contest.

It became a reminder that AI coding is no longer suffering from a shortage of talent. Models can generate plausible software all day now. The more interesting bottleneck is evaluation, and most developers are still doing far less of that than they think they are.

Try VibeCode Arena here & signup: https://vibecodearena.ai/beattheheat?page=1&pageSize=10&sortBy=responses&sortOrder=desc&utm_source=external&utm_medium=vc5&utm_campaign=beattheheat

Because if an app can look functional, score imperfectly, still feel better in use, and leave you uncertain which dimension should dominate the decision, then the hard part is no longer getting AI to build something.

It is learning how to judge what AI has built.

Try the challenge here: https://vibecodearena.ai/share/6203f289-29cd-417f-b5ed-a0ecdfdaf017

Top comments (0)