DEV Community

Cover image for A Smaller KV Cache Did Not Make Transformers Faster
Alankrit Verma
Alankrit Verma

Posted on

A Smaller KV Cache Did Not Make Transformers Faster

Long-context generation makes the KV cache hard to ignore.

I wanted to answer one question:

Why can a KV cache become much smaller while generation gets slower?

The short answer:

storage compression and attention execution are different problems.

TL;DR

  • I measured KV-cache compression as a systems problem, not just a storage problem.
  • quanto cut the cache footprint from 50.911 MiB to 0.913 MiB, but generation latency increased from 2.250 s to 3.912 s.
  • That result was useful: it separated storage compression from execution compression.
  • The rest of the work followed from that distinction. If attention still consumes dense tensors, smaller cache storage alone will not make decode faster.

Evidence

I put the detailed benchmark notes in a public evidence repo:

The first trap: storage is not execution

The first hypothesis sounded reasonable:

fewer cache bytes should mean faster generation.

But that bundles two different claims together:

  1. The cache stores fewer bytes.
  2. The attention step does less work.

Those are not the same claim.

The better engineering question was:

does the attention hot path consume less work, or did I only store the same work in a smaller format?

That is the lens for the rest of the post.

Every generated token reuses keys and values from previous tokens. As the context grows, those cached tensors grow with it. So the natural first idea is simple:

Compress the KV cache, store fewer bytes, and get faster generation.

I tested that idea while exploring TurboQuant-style cache compression in a Hugging Face transformers fork.

Important scope note:

This is not a claim that the official TurboQuant research idea "does not work."

The external context is:

What I tested was narrower:

Can I make a TurboQuant-style compressed-attention path useful inside a local eager transformers implementation?

The first useful result was not that a particular backend won.

It was this:

Storage compression and attention execution are different problems.

A cache can become dramatically smaller while generation gets slower.

That single distinction changed the rest of the project.

The mental model

In decoder-only generation, each new token uses cached keys and values from previous tokens.

Simplified for one attention head:

a = softmax(q K^T)
o = a V
Enter fullscreen mode Exit fullscreen mode

Where:

  • q is the current query.
  • K is the historical key cache.
  • V is the historical value cache.
  • a is the attention distribution.
  • o is the output contribution from history.

Keys decide where to attend. Values provide the information that gets mixed.

When context length grows, both K and V grow.

So compression can target at least two different things:

  1. Store the cache in fewer bytes.
  2. Execute attention without reconstructing dense historical tensors.

Those sound related. In practice, they are different engineering targets.

The first measurement

I started with existing cache behavior in transformers.

The baselines were:

  • DynamicCache: dense eager execution.
  • quanto: a strong storage-compression baseline.
  • hqq: another quantized-cache baseline.

The benchmark below used HuggingFaceTB/SmolLM2-135M-Instruct in a roughly 2048-token context generation case.

I measured more than just stored bytes:

  • generation latency
  • stored cache footprint
  • cache bytes per token
  • sampled runtime memory
  • whether generated outputs matched the dense baseline in simple cases
Backend What It Represents Mean Latency Cache Footprint Cache Bytes / Token Runtime Delta Peak
dynamic dense eager baseline 2.250 s 50.911 MiB 23040.0 0.102 GB
quanto strong storage-compression baseline 3.912 s 0.913 MiB 413.3 0.048 GB
hqq alternative quantized-cache baseline 9.770 s 19.133 MiB 8658.6 0.040 GB

The important row is quanto.

It reduced stored cache footprint from:

50.911 MiB -> 0.913 MiB
Enter fullscreen mode Exit fullscreen mode

That is an excellent cache-size result.

But latency went from:

2.250 s -> 3.912 s
Enter fullscreen mode Exit fullscreen mode

So cache storage got much smaller, while generation got slower.

That is not a paradox. It shows what the backend is optimizing.

Why smaller storage did not mean faster attention

The current generic quantized-cache shape in transformers is roughly:

  1. Produce new dense keys and values.
  2. Quantize them for storage.
  3. Keep compressed tensors in the cache.
  4. Later dequantize cached tensors.
  5. Return dense keys and values to normal attention.

So the attention implementation still consumes dense tensors.

That means the architecture is:

compressed storage + dense execution

not:

compressed attention

Storage compression versus execution compression

The first design can save cache bytes.

The second design is needed if the goal is to make attention itself faster.

This distinction became the first real output of the project.

Why I still looked at compressed attention

TurboQuant-style work was interesting because the bigger promise is not simply "store the KV cache with fewer bits."

The stronger target is:

  • store historical keys in a compressed representation
  • compute attention logits using that compressed representation
  • avoid reconstructing every dense historical key each decode step

The ordinary dense key path computes:

logits_t = q . k_t
Enter fullscreen mode Exit fullscreen mode

for every historical token t.

The compressed-key target is closer to:

logits_t ~= compressed_dot(q, code(k_t), residual(k_t))
Enter fullscreen mode Exit fullscreen mode

without materializing every full k_t.

That is an execution-path change.

It requires a different shape than a normal storage-only QuantizedCache backend.

That is why the project became less about "add another cache backend" and more about "change what attention actually consumes."

The stable compressed-key baseline

I built a stable compressed-key baseline to test that direction.

Internally, I called it reference. For a public reader, the better name is:

the stable compressed-key baseline

Its job was not to be the final optimized system. Its job was to prove that an end-to-end compressed-key attention path could exist in a Llama-style eager stack and provide a consistent comparison point for later experiments.

It kept:

  • compressed historical keys
  • compressed-key attention-logit computation
  • residual correction behavior
  • a full value path so correctness and fidelity stayed interpretable

That baseline survived the project better than the later value-path experiments.

The key lesson was:

The compressed-key path was not where most failures came from.

The failures came from values.

I also saw some directional evidence that compressed-key work might become more interesting as model/context size changes. But that evidence was not clean enough to be the headline result. The safe claim was narrower:

keep the compressed-key baseline as an internal anchor, but do not call it the final system.

Why values became the hard part

Attention has two major pieces:

  1. Compute attention weights from keys.
  2. Mix values using those weights.

Even if keys are compressed, the output still requires:

o = sum_t a_t v_t
Enter fullscreen mode Exit fullscreen mode

If the implementation still reconstructs or processes values across most of history, the value path remains expensive.

That is exactly what happened.

The project shifted from:

Can I compress the cache?

to:

Can I keep the compressed-key path and make historical value participation structurally cheaper?

That question led to the second half of the work: multiple value-path approximations, most of which failed.

What I learned

This is the architecture lesson that shaped the rest of the work.

I learned:

  • Existing quantized cache backends can be very good at reducing stored cache footprint.
  • Stored-cache size is not the same as runtime attention cost.
  • Dense eager execution is a serious baseline because it has a simple hot path.
  • TurboQuant-style compressed-key attention is a different target from storage-only cache compression.
  • The stable compressed-key path was useful enough to keep as an internal baseline.
  • The next bottleneck was historical value mixing.

That is where the next technical question came from:

Can I make historical value mixing cheaper without destroying quality?

That question is more brutal than cache compression, because it is no longer enough to store fewer bytes. The compressed representation also has to be cheap to use.

Scope

These measurements came from one local fork, one benchmark setup, and a small-model-first workflow. The goal was not to claim universal results for every model and GPU.

The goal was to answer a systems question:

Am I actually reducing attention execution cost, or only cache storage?

For this phase, the answer was clear.

I had reduced storage.

I had not yet won execution.

This distinction changed the next question. Once the key path had a stable compressed baseline, the remaining bottleneck was not:

can I store fewer bytes?

It was:

can I mix historical values cheaply enough without breaking quality?

Top comments (0)