Running local LLMs is surprisingly easy to get wrong.
A model technically fitting into VRAM does not mean the setup is actually usable. Context length, KV cache growth, runtime overhead, quantization, and coding-agent workflows can completely change the hardware requirements.
That is why I built the Local AI VRAM Calculator & GPU Planner (Beta).
The planner helps estimate:
- local LLM VRAM requirements
- GPU fit for different models
- Apple Silicon viability
- coding-agent workloads
- context-length scaling
- CPU-only inference tradeoffs
After more real-world testing, I updated the tool to better handle coding agents, Apple Silicon systems, and long-context workloads.
Coding Models vs Coding Agents
One thing I realized while dogfooding the planner is that “Coding” and “Coding Agent” are completely different workloads.
A lightweight coding assistant can get away with much smaller models and shorter context windows. Coding agents running through workflows like OpenCode, Claude Code, or Codex-style harnesses are much more demanding.
Once you start introducing:
- tool calls
- agent loops
- repository-wide reasoning
- long-context sessions
- structured outputs
the model requirements change pretty dramatically.
Some models that feel perfectly usable for autocomplete or small coding tasks become frustrating very quickly in agent-style workflows.
That distinction is now reflected in the planner recommendations.
Apple Silicon vs NVIDIA vs CPU-Only
The original version of the planner mostly assumed a desktop discrete GPU setup.
That turned out to be too limiting.
The planner now supports switching between:
- Discrete GPU
- Apple Silicon
- No GPU
Those environments behave very differently in practice.
Apple Silicon systems benefit from unified memory, which changes how memory pressure and model loading behave. CPU-only inference has very different latency and usability constraints. Discrete GPUs still dominate larger local inference workloads, especially for coding agents and long-context reasoning.
Separating those compute types made the recommendations much more realistic.
How Much VRAM Do You Actually Need for Local LLMs?
This is still the question people search for the most, and the answer is more complicated than it should be.
A quantized 7B model running at 8K context behaves very differently from a coding model running at 128K context with large KV cache growth.
That is why the planner breaks estimates into:
- model weights
- KV cache
- runtime overhead
- estimated total VRAM
In practice, context length is where many local setups start breaking down.
A model may technically fit while still becoming slow, unstable, or frustrating to use.
That becomes especially noticeable with coding agents, tool use, and larger repositories.
Why Most VRAM Calculators Feel Wrong
Most VRAM calculators treat local inference like static model weights loaded into memory.
That is only part of the story.
The actual experience depends heavily on:
- context length
- quantization
- runtime backend
- memory bandwidth
- offloading strategy
- KV cache growth
- storage speed
Two systems with similar VRAM can behave completely differently depending on the workload.
That is why I stopped trying to make the planner behave like a benchmark.
It works better as a planning tool that helps visualize constraints before buying hardware or wasting time debugging unrealistic local AI setups.
Try the Updated Planner
If you are trying to figure out:
- how much VRAM you need for local LLMs
- whether your GPU can run a model
- whether Apple Silicon is viable for local AI
- what models work best for coding agents
- how context length affects VRAM usage
You can try the updated tool here:
Local AI VRAM Calculator & GPU Planner (Beta)
The estimates are still heuristic in places, but they are much closer to real-world local inference behavior than the original version.
Top comments (5)
Quantization and context length are the two levers everyone underestimates. I've seen people buy a $2000 GPU for a 70B at FP16 when Q4_K_M on a 3090 would have been fine for their actual use case. Your breakdown makes that painfully visible.
Great article! Really appreciate how you explained the concepts in a simple and structured way. This was both insightful and easy to follow. Thanks for sharing 🙌
Appreciate the support! My goal was to make things comprehensive, but not too esoteric. Glad to hear it resonated with you.
What quantization level does the tool assume by default? Q4_K_M or Q5? I've noticed Q8 is basically placebo for most workflows - double the VRAM for maybe 2% better perplexity. Might be worth adding a footnote about diminishing returns.
It assumes Q4_K_M by default. And thank you for the footnote suggestion! I'm planning some updates in the near future, so I'll try to get that added, too.