DEV Community

Cover image for This AI App Looked Ready to Ship. It Was Hiding a Critical Security Flaw.
Sukriti Singh
Sukriti Singh

Posted on

This AI App Looked Ready to Ship. It Was Hiding a Critical Security Flaw.

I ran a simple VibeCode Arena duel, voted confidently based on what I saw, and then watched the evaluation metrics expose how little of the important stuff I had actually checked.

I nearly trusted an AI-generated app this week, much faster than I should have, and what bothered me afterward was how normal that trust felt in the moment.

I opened the preview, clicked around, saw that everything responded the way it was supposed to, and within less than a minute, I had already mentally filed it under yeah, this seems usable. There was no hesitation there. No deeper inspection. Just a quick visual pass and a growing sense that the output was probably fine.

That confidence lasted until the evaluation metrics loaded and made it painfully obvious that I had reviewed the wallpaper while ignoring the foundation.

This came from a Duel I ran on VibeCode Arena. Same prompt, two models generating side by side, blind vote before the platform reveals what is actually going on under the hood.

The prompt itself was straightforward: build a Tech Debt Tracker using HTML, CSS, and vanilla JavaScript. Single-page app. Let the user add debt items, assign severity, and track overall technical risk through a Debt-O-Meter.

Nothing about that should have been especially dramatic. It is the kind of internal developer utility AI models are usually pretty comfortable producing, which is probably why both outputs looked surprisingly polished at first glance.

Both had dark interfaces, clean card layouts, working forms, severity labels, and a visible progress indicator. In other words, they both looked finished enough to make a fast reviewer feel safe.

I voted for Gemini 2.5 Flash.

Not because the other output looked terrible, but because this one felt slightly tighter in all the ways that create instant visual confidence. The spacing was cleaner, the Debt-O-Meter interaction felt smoother, and the whole thing gave off the impression that more care had gone into it.

So I voted pretty comfortably.

What I did not realize at that point was that I had not actually voted on software quality. I had voted on presentation quality, and those are not the same thing at all.

The first evaluation panel I opened was for the output I did not vote for. Overall score: 92 percent. Still sounds respectable until you scroll into the issue breakdown.

Critical security issue: Potential Cross-Site Scripting vulnerability in script.js.

Major correctness issue: Unhandled Exception in JSON Parsing.

Major accessibility issue: text and background contrast failure.

Then eight smaller code quality flags underneath all of that.

I remember stopping at the XSS line because there was absolutely nothing in the rendered preview to suggest the app posed a browser-side injection risk. It looked like a normal, harmless utility tool. Functional. Stable. Completely unthreatening.
That was the disturbing part.

Cross-Site Scripting is not a cosmetic warning you casually wave away. If user input is handled lazily enough, that opens the door for injected scripts to run inside another user’s browser session. Session theft, forced redirects, malicious payload execution — all the usual things that become very real very quickly once unsafe rendering gets into the mix.

And none of that was visible from the outside.

I keep repeating that because it is the whole lesson here: the preview looked fine. Completely fine.

So naturally, I opened the metrics for the output I voted for, expecting at least a cleaner result there. It did score 93 percent overall, and Security came back at a reassuring 100, which looked good for exactly one second.

Then the rest of the breakdown loaded.

There was a blocker-level CSS Syntax Error in style.css, three separate major accessibility issues, and a much weaker Code Quality score than I had expected from the output I had just felt so sure about.

So the “better” app was not actually clean either. It was simply failing in a different set of places that the preview had done a very good job of hiding.

That somehow made the whole thing worse, because now this was no longer about one flawed output slipping through. This was about me feeling confident after checking the wrong layer entirely.

Top comments (0)