logarithmicspirals

Posted on Apr 23 • Edited on May 9 • Originally published at logarithmicspirals.com

Local LLM VRAM Calculator & GPU Planner for Apple Silicon, NVIDIA, and Coding Agents

#ai #gpu #ollama #homenetwork

VRAM breakdown for model weights and KV cache

Running local LLMs is surprisingly easy to get wrong.

A model technically fitting into VRAM does not mean the setup is actually usable. Context length, KV cache growth, runtime overhead, quantization, and coding-agent workflows can completely change the hardware requirements.

That is why I built the Local AI VRAM Calculator & GPU Planner (Beta).

The planner helps estimate:

local LLM VRAM requirements
GPU fit for different models
Apple Silicon viability
coding-agent workloads
context-length scaling
CPU-only inference tradeoffs

After more real-world testing, I updated the tool to better handle coding agents, Apple Silicon systems, and long-context workloads.

Coding Models vs Coding Agents

One thing I realized while dogfooding the planner is that “Coding” and “Coding Agent” are completely different workloads.

A lightweight coding assistant can get away with much smaller models and shorter context windows. Coding agents running through workflows like OpenCode, Claude Code, or Codex-style harnesses are much more demanding.

Once you start introducing:

tool calls
agent loops
repository-wide reasoning
long-context sessions
structured outputs

the model requirements change pretty dramatically.

Some models that feel perfectly usable for autocomplete or small coding tasks become frustrating very quickly in agent-style workflows.

That distinction is now reflected in the planner recommendations.

Apple Silicon vs NVIDIA vs CPU-Only

The original version of the planner mostly assumed a desktop discrete GPU setup.

That turned out to be too limiting.

The planner now supports switching between:

Discrete GPU
Apple Silicon
No GPU

Those environments behave very differently in practice.

Apple Silicon systems benefit from unified memory, which changes how memory pressure and model loading behave. CPU-only inference has very different latency and usability constraints. Discrete GPUs still dominate larger local inference workloads, especially for coding agents and long-context reasoning.

Separating those compute types made the recommendations much more realistic.

How Much VRAM Do You Actually Need for Local LLMs?

This is still the question people search for the most, and the answer is more complicated than it should be.

A quantized 7B model running at 8K context behaves very differently from a coding model running at 128K context with large KV cache growth.

That is why the planner breaks estimates into:

model weights
KV cache
runtime overhead
estimated total VRAM

In practice, context length is where many local setups start breaking down.

A model may technically fit while still becoming slow, unstable, or frustrating to use.

That becomes especially noticeable with coding agents, tool use, and larger repositories.

Why Most VRAM Calculators Feel Wrong

Most VRAM calculators treat local inference like static model weights loaded into memory.

That is only part of the story.

The actual experience depends heavily on:

context length
quantization
runtime backend
memory bandwidth
offloading strategy
KV cache growth
storage speed

Two systems with similar VRAM can behave completely differently depending on the workload.

That is why I stopped trying to make the planner behave like a benchmark.

It works better as a planning tool that helps visualize constraints before buying hardware or wasting time debugging unrealistic local AI setups.

Try the Updated Planner

If you are trying to figure out:

how much VRAM you need for local LLMs
whether your GPU can run a model
whether Apple Silicon is viable for local AI
what models work best for coding agents
how context length affects VRAM usage

You can try the updated tool here:

Local AI VRAM Calculator & GPU Planner (Beta)

The estimates are still heuristic in places, but they are much closer to real-world local inference behavior than the original version.

Top comments (5)

George Toresco • Apr 27

Quantization and context length are the two levers everyone underestimates. I've seen people buy a $2000 GPU for a 70B at FP16 when Q4_K_M on a 3090 would have been fine for their actual use case. Your breakdown makes that painfully visible.

Mindmagic • Apr 23

Great article! Really appreciate how you explained the concepts in a simple and structured way. This was both insightful and easy to follow. Thanks for sharing 🙌

logarithmicspirals • Apr 23

Appreciate the support! My goal was to make things comprehensive, but not too esoteric. Glad to hear it resonated with you.

Lisa Gela • Apr 27

What quantization level does the tool assume by default? Q4_K_M or Q5? I've noticed Q8 is basically placebo for most workflows - double the VRAM for maybe 2% better perplexity. Might be worth adding a footnote about diminishing returns.

logarithmicspirals • Apr 27

It assumes Q4_K_M by default. And thank you for the footnote suggestion! I'm planning some updates in the near future, so I'll try to get that added, too.