How Much GPU Memory Does NexusQuant Actually Save?

#llm #machinelearning #python #gpu

How Much GPU Memory Does NexusQuant Actually Save?

KV cache compression numbers like "10x" sound impressive in a paper. But what does that mean in practice, for a real GPU, serving real users? Let me give you a concrete memory calculator so you can answer this for your own setup.

The KV Cache Formula

For any transformer, the KV cache size is:

KV_bytes = 2 × num_layers × num_heads × head_dim × seq_len × bytes_per_element

The 2 is for keys AND values. bytes_per_element is 2 for FP16, 4 for FP32.

For Mistral-7B (32 layers, 8 KV heads, head_dim=128, FP16):

KV_bytes = 2 × 32 × 8 × 128 × seq_len × 2
         = 131,072 × seq_len bytes
         ≈ 128 KB per token

At 128K tokens: 128 KB × 131,072 = 16.7 GB just for the KV cache.

GPU Memory Table

Here's what that means on real hardware:

GPU	VRAM	Max KV tokens (FP16, no NQ)	With NexusQuant 10x	With NexusQuant 17x	With NexusQuant 33x
RTX 3090	24 GB	~150K	~1.5M	~2.6M	~5M
A10G	24 GB	~150K	~1.5M	~2.6M	~5M
A100 40GB	40 GB	~256K	~2.6M	~4.4M	~8.5M
A100 80GB	80 GB	~512K	~5.1M	~8.7M	~17M
H100 80GB	80 GB	~512K	~5.1M	~8.7M	~17M

Numbers leave 8 GB headroom for model weights and activations on 7B models.

Practical Scenarios

Scenario A: A10G serving 128K context

Without NexusQuant: 128K × 128 KB = 16.7 GB. With only 24 GB total and ~14 GB for the 7B model, you're OOM before the first token generates.

With NexusQuant 17x preset: 16.7 / 17 = 0.98 GB for KV. Fits comfortably. You can now serve 128K context on a single A10G.

Scenario B: A100 80GB, maximize throughput

A single request at 128K takes 16.7 GB KV. With 80 GB total and ~14 GB for weights, you have ~66 GB for KV - enough for about 4 concurrent sessions at 128K.

At 17x compression, each session uses ~1 GB of KV. You can serve ~66 concurrent sessions instead of 4. That's 16x more users per GPU.

Scenario C: Long context research

You want 1M token context on Mistral-7B. KV cache = 1M × 128 KB = 128 GB. Impossible on a single GPU.

At 17x: 7.5 GB. Fits on a single A100 40GB alongside the model weights.

Python Calculator

def kv_cache_gb(
    num_layers: int,
    num_kv_heads: int,
    head_dim: int,
    seq_len: int,
    dtype_bytes: int = 2,  # FP16
) -> float:
    """Returns KV cache size in GB."""
    total_bytes = 2 * num_layers * num_kv_heads * head_dim * seq_len * dtype_bytes
    return total_bytes / (1024 ** 3)


def savings_report(model_name: str, num_layers: int, num_kv_heads: int,
                   head_dim: int, seq_len: int, gpu_vram_gb: float,
                   model_weights_gb: float) -> None:
    base_gb = kv_cache_gb(num_layers, num_kv_heads, head_dim, seq_len)
    available = gpu_vram_gb - model_weights_gb

    print(f"
=== {model_name} @ {seq_len:,} tokens ===")
    print(f"Base KV cache: {base_gb:.1f} GB")
    print(f"GPU available for KV (after weights): {available:.0f} GB")

    for ratio, label in [(10, "high quality"), (17, "balanced"), (33, "max")]:
        compressed_gb = base_gb / ratio
        concurrent = int(available / compressed_gb)
        fits = "YES" if base_gb / ratio < available else "NO"
        print(f"  {ratio}x ({label}): {compressed_gb:.2f} GB | fits={fits} | concurrent sessions={concurrent}")


# Mistral-7B examples
savings_report("Mistral-7B", 32, 8, 128, 128_000, 80, 14)
savings_report("Mistral-7B", 32, 8, 128, 1_000_000, 80, 14)

# Llama-3-8B
savings_report("Llama-3-8B", 32, 8, 128, 128_000, 40, 16)

# Llama-3-70B (multi-GPU: assume 160 GB total)
savings_report("Llama-3-70B", 80, 8, 128, 128_000, 160, 140)

Sample output:

=== Mistral-7B @ 128,000 tokens ===
Base KV cache: 16.7 GB
GPU available for KV (after weights): 66 GB
  10x (high quality): 1.67 GB | fits=YES | concurrent sessions=39
  17x (balanced): 0.98 GB | fits=YES | concurrent sessions=67
  33x (max): 0.51 GB | fits=YES | concurrent sessions=129

=== Mistral-7B @ 1,000,000 tokens ===
Base KV cache: 130.4 GB
GPU available for KV (after weights): 66 GB
  10x (high quality): 13.04 GB | fits=YES | concurrent sessions=5
  17x (balanced): 7.67 GB | fits=YES | concurrent sessions=8
  33x (max): 3.95 GB | fits=YES | concurrent sessions=16

=== Llama-3-8B @ 128,000 tokens ===
Base KV cache: 16.7 GB
GPU available for KV (after weights): 24 GB
  10x (high quality): 1.67 GB | fits=YES | concurrent sessions=14
  17x (balanced): 0.98 GB | fits=YES | concurrent sessions=24
  33x (max): 0.51 GB | fits=YES | concurrent sessions=47

The Model Weight Overhead

One thing papers often gloss over: the model itself takes memory. Here are real-world numbers:

Model	Weights (BF16/FP16)
TinyLlama-1.1B	2.2 GB
Mistral-7B	14 GB
Llama-3-8B	16 GB
Llama-3-70B	140 GB (2× A100 80GB)
Llama-3-405B	~810 GB

At 70B+, KV cache compression becomes less about fitting the model and more about fitting the context. A 70B model with 128K context uses only ~2x its own weight in KV cache. NexusQuant is still valuable here, but the math changes - you're extending context headroom, not primarily enabling the model to load.

What We Measured

All numbers in NexusQuant are validated with torch.cuda.memory_allocated() before and after cache materialization. No estimate-only ratios. The paper's compression column includes all overhead: scales (FP16), lattice indices (packed uint8), and metadata. If it's not in memory, it's not in the ratio.

Repo: github.com/jagmarques/nexusquant

pip install nexusquant-kv

from nexusquant import nexusquant_evict

with nexusquant_evict(model, quality="balanced"):  # 17x
    output = model.generate(input_ids, max_new_tokens=512)