We open our IDE and let a model running somewhere in the cloud read our entire codebase to add a null check - and track our behaviour along the way. We open Google Docs and ask Gemini to fix a typo. We fire up GPT-class models to refine a Slack message, restructure a comment, generate a thumbnail. We're going to shove AI into every single hole that has data for it to be trained on.
I'm not saying we shouldn't - that's the nature of expected technological progress, and there isn't much choice in the matter. But somewhere along the way we stopped asking whether the scale of the model matches the scale of the task. And the answer, more often than we'd like to admit, is no.
This isn't a doom take. We're not being replaced. We're just still in the early adoption phase, when most people don't fully grasp what AI is not and where its limits are, wishful-thinking about it a bit too much. Which means we can still shape it - like we shaped radio, then the internet, then open source. We just need to find a more natural path for this technology, before the current default ossifies into the only option.
The numbers don't support the defaults
Take Qwen3-Coder-Next: 80B total parameters but only 3B active - performing on par with models that have 10-20x more active compute, runnable on high-end consumer hardware (think a 64GB+ Apple Silicon Mac, or a beefy workstation card) instead of a datacenter rack. Go smaller still and it gets more interesting. A Qwen3-4B fine-tuned for a specific task matches a 120B+ model on that task, deployable on consumer hardware. Or take Chandra - a 5B OCR model purpose-built for PDF and image conversion that outperforms both Gemini 2.5 Flash and GPT-5 Mini on multilingual document benchmarks. Not because it's smarter. Because it's focused.
And every major model release is announced like an earth-shattering event, destined to shadow everything before it and boost everything tenfold. Then we actually start using the thing, and we find a modest improvement - mostly specific, mostly a derivative of what the model was trained on. Take the mysterious announcement of Anthropic's Mythos, supposedly "too dangerous to release" - we don't even know yet if it justifies the hype. Meanwhile this experimental article from Aisle already suggests small models can match or outperform it in vulnerability scans - one early experiment, but telling.
This isn't new, either. Chinchilla challenged the "bigger is always better" orthodoxy back in 2022, and since then the evidence has only stacked up - small models trained on high-quality data for a dedicated task can match or beat their much larger cousins. We just kept defaulting to the biggest available thing anyway, partly out of habit, partly because the cloud paradigm is being pushed hard by everyone with a stake in keeping us there. The headline outpaces the reality, and the reality is that for most tasks, we're already past the point of useful returns from going bigger.
A different path
There's another path, and it doesn't look like Cyberpunk 2037. It doesn't require massive H200 clusters just to prettify your CV. It leads to more equal AI distribution, and it doesn't try to substitute anybody.
That path consists of small, dedicated models trained to do one or a few specific things at most. Models that are just smart enough to fulfill their purpose, and small enough to avoid creating the false impression that they're replacing anyone. This is the mass AI of the future - a true symbiosis. Or to be more precise, it's proper tool use.
Because AI is not a being. It's a simulation of one: a very cleverly engineered statistical model that's good at approximation in a way that looks like adaptability. Treating it as a being is what gets us reaching for the largest possible model every time, as if we were asking a person for help. Treating it as a tool is what lets us match the model to the task - the way you don't use a chainsaw to slice bread.
What this looks like in practice is software built AI-native from the ground up, not bolted onto with MCPs and API calls to remote giants. A document editor with small models embedded or pluggable for grammar checks, restructuring, summarization, all running locally. An OCR pipeline that just does OCR, well - paired with a small RAG model that lets you actually search and query a shelf of scanned papers or PDFs locally. A video editor with a small model that clips and tags footage on your machine. An in-game AI that runs on the player's hardware. None of these require breakthroughs - the models already exist, or could be trained without a billion-dollar cluster if there's enough data available.
What's missing is the software paradigm to host them properly - and the orchestration layer to chain them together. If general AI adoption is in its early phase, small-model orchestration is in its infancy: tooling, conventions, ecosystems, all still forming. ComfyUI already lets people chain specialized image and video models into local pipelines - the closest thing we have to a working blueprint, though it's fragile and leans heavily on Python venvs. LM Studio and Ollama make running local models trivial and stable, but they're runtimes more than orchestrators. These are embryos - but they prove the paradigm works. And it's the part worth building out further.
Where the big ones still belong
Large models aren't a dead end. They're the right tool for genuinely hard, open-ended problems - complex coding across unfamiliar codebases, in-depth analysis, anything that genuinely requires reasoning across a wide context. The argument isn't "small models for everything." It's "stop using a trillion-parameter model to fix a typo."
The honest version of the AI future is mixed: large models where their capabilities are actually needed, and small specialized models for the long tail of focused tasks - which is most of them. Treating those two cases the same way is what's wasteful. Not the technology itself.
Why this matters
Using large models for everything is the dead end. Not because it doesn't work, but because of what it costs and where it leads. Every "fix this typo" routed through a frontier model is a small vote for the centralization of compute, the centralization of data, and the centralization of who gets to decide what AI does next. Multiply that by a billion daily prompts and you get the bubble we're currently inflating - one where the only viable AI is the kind that requires a hyperscaler to run.
The small-model path isn't just more efficient. It's more honest about what most AI tasks actually need, and it leaves room for AI to be something other than a service we rent from a handful of hyperscalers.
We can still take that path. Many of the models are already there, others are still to be explored and trained. The hardware is there. What's missing is the will to stop assuming bigger is always better - and the software to make small the new default.
Top comments (25)
The cheapest options to not use AI for a task is always there, but we are so lazy. Lazy for select a perfect MI for a specify task. So your idea can be delegated after AI API level. So on the first layer of that AI decide which model used for answer.
That's one way to address this, having a router model to decide what dedicated model to use. This router model can conveniently be a small one too :) The other part is there's no mature infrastructure this router can delegate execution to yet. Maybe it can be done like OpenRouter does across hosted providers. I hope we'll accelerate in this direction.
The "wrong scale" framing nails it. There's a pattern emerging in production systems: teams start with GPT-4 for everything, then realize 80% of their calls are simple classification or extraction tasks that a 7B model handles fine. The economic math is brutal — a $0.03/call API for a task a local model does in 200ms for free. The real question isn't "cloud vs local" but "what's the minimum viable model for each task in your pipeline?" Most production agents should probably be running 3-4 different sized models routed by task complexity, not one giant model for everything.
I like this take because it’s not anti-AI, it’s anti-waste. Matching the tool to the job should be the default.
Solo dev on a voice AI app here, this hits a nerve. We hit the same wall around month 4: Claude or GPT-class for every voice intent, every UUID resolution, every "did the user mean tomorrow morning or this morning". Latency was fine, the bill was not.
What unblocked it: routing per task. Tiny model for intent classification and tokenized search, mid-tier (DeepSeek V4 Flash family) for multi-turn function calling chains, big model only when the user actually needs reasoning. OpenRouter as the swap layer so we can change the routing without touching app code.
The harder part isn't the routing, it's the eval. A small model that's 95% as good is a small model that's 5% catastrophic on the long tail. We had to log every degraded answer and bump the routing tier when patterns repeated. Static "small for cheap, big for hard" doesn't survive contact with users who ask weird things.
The framing in the post is right though. The default of "shove the biggest model into every hole" stops scaling well past a certain volume of requests. The interesting work is in the boring middle layer where you decide which call gets which brain.
That's a very useful experience that shows that it's often not a simple replacement in loosely deterministic cases. Very cool that you went this way trying to optimize things. Can I ask if you tried to do an in-place replacement with smaller models or fine-tuned them first? Small models have their limits, though often you can sharpen them if there's data available.
Good question. Yes, I tested both paths.
In-place smaller model: tried Llama 3.1 8B and Qwen 2.5 7B locally via Ollama for the simpler intents (date parsing, simple "create task X" with no group context). Worked at maybe 75% of the time, but the failure mode was bad. Wrong UUIDs for "remind me about Sylvie's appointment" when there are 3 Sylvies in the user's memory. The 25% failure rate destroyed the trust faster than the latency win was worth, because each wrong UUID = a task created in the wrong folder, which is harder to spot than no task at all.
Fine-tuning: I haven't gone full SFT yet because I don't have enough labeled data per user (TAMSIV only launched in alpha mid-March, around 100 users actively using the voice). I started collecting (audio, intent, correct UUID) triplets via a thumbs-down feedback loop in the app, with the idea of fine-tuning a 7B once I have a couple thousand corrected pairs per intent class. That's probably 2-3 months out.
What unlocked the cost cut without sacrificing accuracy was routing per-task. DeepSeek V4 Flash for the disambiguation-heavy stuff (multi-Sylvie, fuzzy date), GPT-4o-mini for simple stuff, full Claude only on the rare ambiguous cases. The router itself is a tiny prompt that tags the difficulty class, runs on the cheapest tier, and the rest follows. CPC dropped 60% on inference without quality regression measured on a 200-sample eval set.
Curious about your "loosely deterministic" framing. Is that a term from your team or borrowed? It captures something I've struggled to name when explaining why some tasks feel safe to downsize and others don't.
Thanks for the insight! Yeah, sounds like routing is the right way to go.
I think fine-tuning could help with small models issues, but there are also some fundamental limits to the capacity small models can work with, though again these limits can be stretched if you have an abundance of data. In some cases of lacking data, distillation from large models could help - I had a semi-synthetic dataset once for parsing and structurizing CVs that worked really well for an 8B model, but not sure if this is applicable in your case.
As for "loosely deterministic" framing it's just a term I used to describe the complexity, entropy of the environment model needs to work in.
If you think you’re “asking a smart assistant,” you reach for the smartest one.
If you think you’re “invoking a tool,” you pick the right tool.
Most users are still in the “assistant” mental model.
There's a real tension here between scale and trust surface. Sending your entire codebase to a cloud model to add a null check is the kind of thing that would fail a security review at any company handling sensitive data — and yet that's the default workflow most people adopt without thinking. At Lakaut we work with identity validation and digital signatures, so the 'what exactly goes into the context window' question became a hard constraint, not a nice-to-have. Local or sandboxed models aren't just a performance choice at that point.
The chainsaw to slice bread analogy is the most honest framing of this I've seen. We didn't end up here because anyone made a deliberate decision to over-engineer everything. We ended up here because the cloud paradigm is being pushed hard by everyone with a financial stake in keeping us there, and 'using the biggest available model' became the default before most people thought to question it.
A 5B model, purpose-built for one task, outperforming GPT-class generalists on that task isn't a surprise if you think about it clearly. It's what happens when you match the tool to the job. We just stopped doing that somewhere along the way. The missing piece you're pointing at, the orchestration layer for small model pipelines, is genuinely the most compelling engineering problem right now. ComfyUI proved the paradigm works for image and video. The equivalent for language tasks, something stable, composable, and not held together by Python, is still waiting to be built. I think that feels like the actual frontier, not the next 100B parameter announcement
This resonates a lot. Building AI tools for financial data analysis, I keep seeing the same pattern — teams defaulting to frontier models for tasks a fine-tuned 7B would handle better and cheaper. The "AI as a being" vs "AI as a tool" framing is spot on. Right-sizing models to tasks is probably the most underrated skill in the field right now.
Yeah, that's the thing, you don't need to chaise the latest largest models, you can stick with a small one properly fine tuned and you possibly wouldn't need to touch it for years.
Same here. Once people see AI as a thinking amplifier instead of a shortcut machine, the outputs get a lot better.
The weird part is that this lesson is simple, but most people only learn it after wasting a lot of time on shallow prompts.
The article nails the AI scale problem. In health, generic models lack cultural depth. A US-trained AI, despite its size, won't reliably interpret 'kaaichal' (Tamil for fever) in an Ayurvedic context. Its board forbids \"desi ilaaj\" (traditional remedies) cross-verification. This is a structural moat.
True utility demands culturally relevant data focus, not just raw parameters. I'm building GoDavaii to tackle these deep contextual challenges.
we hit this with agent orchestration - running heavy models for tasks a small classifier could handle. latency and cost stack fast. the tricky part is most teams don’t notice until they’re scaling.
Some comments may only be visible to logged-in visitors. Sign in to view all comments.