Adaptive KV-Cache Quantization: How 'Don't Waste Bits' Cuts On-Device LLM Latency by 17%

#llminference #quantization #ondeviceai #paperpoc

Running LLMs on-device means fighting two constraints simultaneously: memory and latency. The KV-cache — the buffer that stores past token representations so the model does not recompute them — is often the bottleneck on both fronts.

A paper published in April 2026 by researchers at Clemson University proposes a solution: stop using the same bit-width for every token's KV cache entry, and instead assign precision dynamically based on how important each token actually is.

The paper is titled "Don't Waste Bits! Adaptive KV-Cache Quantization for Lightweight On-Device LLMs" (arXiv 2604.04722). The results on SmolLM-360M are specific: 17.75% reduction in decoding latency compared to static KV quantization, with accuracy within 0.30 points of full FP16 inference.

The Problem with Static KV Quantization

Before getting into the paper's approach, it helps to understand what is being improved.

When an LLM decodes tokens autoregressively, it must attend to every previous token in the context. The key-value pairs for those tokens are stored in the KV cache. On mobile and edge devices, this cache grows linearly with context length and can dominate both memory usage and memory bandwidth.

The standard mitigation is quantization — storing cache entries at lower bit-width (e.g., 4-bit or 8-bit instead of FP16). This reduces memory footprint proportionally and can speed up the memory-bandwidth-bound operations that dominate decoding on small devices.

The problem is that static quantization treats every token the same. A high-information token at the start of a complex reasoning chain gets the same 4-bit treatment as a repeated filler word. The paper's core insight is that this wastes bits in the wrong places: it quantizes important tokens too aggressively while spending full precision on trivial ones.

The Approach: Token-Level Precision Control

The paper's framework has two stages: feature extraction and precision selection.

Feature extraction computes four lightweight signals per token:

Token frequency — how common the token is in the sequence. Frequent tokens tend to carry less unique information.
Quality score — a measure of how well the token fits the local context, derived from the hidden state.
Attention variance — how much the attention weights for this token vary across heads. High variance suggests the token is important to multiple attention patterns.
Entropy-based uncertainty — the entropy of the attention distribution over this token. High entropy means the model is uncertain about how to weigh this token.

These four numbers form a compact feature vector for each token in the KV cache.

Precision selection feeds that feature vector into a small data-driven controller — a learned mapping from features to bit-width. The controller chooses from {2-bit, 4-bit, 8-bit, FP16} for each token's KV cache entry.

The controller is small enough that computing it does not meaningfully add to decoding time. The overhead is in the feature computation, which the paper keeps deliberately lightweight.

Reproducing the Core Concept

The paper does not release code (as of the April 2026 arXiv submission), but the mechanism is clear enough from the paper's description to implement the core concept. Here is a minimal reproduction of the feature extraction and precision selection logic:

import torch
import torch.nn.functional as F

def compute_token_importance_features(
    hidden_state: torch.Tensor,      # [seq_len, hidden_dim]
    attention_weights: torch.Tensor, # [num_heads, seq_len, seq_len]
    token_counts: dict,
    token_ids: list,
) -> torch.Tensor:
    """
    Compute the 4 token-level features from the paper.
    Returns: [seq_len, 4] feature tensor
    """
    seq_len = hidden_state.shape[0]
    features = torch.zeros(seq_len, 4)

    for t in range(seq_len):
        # Feature 1: token frequency (normalized)
        tok_id = token_ids[t]
        freq = token_counts.get(tok_id, 0) / max(len(token_ids), 1)
        features[t, 0] = freq

        # Feature 2: quality score — cosine similarity to sequence mean
        mean_hidden = hidden_state.mean(dim=0)
        quality = F.cosine_similarity(
            hidden_state[t].unsqueeze(0),
            mean_hidden.unsqueeze(0)
        ).item()
        features[t, 1] = quality

        # Feature 3: attention variance across heads
        attn_for_token = attention_weights[:, :, t]  # [num_heads, seq_len]
        attn_var = attn_for_token.var(dim=0).mean().item()
        features[t, 2] = attn_var

        # Feature 4: attention entropy
        attn_dist = attention_weights[:, t, :].mean(dim=0)  # [seq_len]
        attn_dist = F.softmax(attn_dist, dim=-1)
        entropy = -torch.sum(attn_dist * torch.log(attn_dist + 1e-9)).item()
        features[t, 3] = entropy

    return features


def select_kv_precision(
    features: torch.Tensor,  # [seq_len, 4]
    thresholds: tuple = (0.2, 0.5, 0.8),
) -> list:
    """
    Map token features to KV cache precision levels.
    Lower overall score → more aggressive quantization.
    """
    # Combine features into a single importance score
    # Weights can be learned; these are illustrative
    weights = torch.tensor([0.15, 0.35, 0.30, 0.20])
    scores = (features * weights).sum(dim=-1)  # [seq_len]
    scores = (scores - scores.min()) / (scores.max() - scores.min() + 1e-9)

    precisions = []
    for score in scores:
        s = score.item()
        if s < thresholds[0]:
            precisions.append("int2")
        elif s < thresholds[1]:
            precisions.append("int4")
        elif s < thresholds[2]:
            precisions.append("int8")
        else:
            precisions.append("fp16")

    return precisions

The key insight in this code: fp16 is reserved for the highest-importance tokens (top 20% of scores), while the least important 20% get compressed to 2-bit. Most tokens land in the middle ranges.

In practice, the paper's controller is a small learned network rather than a fixed threshold function, but the feature vector is the same.

Results

The paper tests on SmolLM-135M, SmolLM-360M, and SmolLM-1.7B — the HuggingFace family of small language models designed for on-device deployment. Benchmarks include HellaSwag and other commonsense reasoning tasks.

For SmolLM-360M on HellaSwag, compared to static KV quantization:

Metric	Static KV Quant	Adaptive (paper)	FP16 baseline
Decoding latency	100% (baseline)	-17.75%	—
HellaSwag accuracy	—	+7.60 points	+7.90 points

The adaptive method closes most of the accuracy gap between static quantization and full precision, while also being faster than static quantization due to spending fewer compute cycles on low-importance tokens.

The paper also shows that compared to FP16, the adaptive approach loses only 0.30 accuracy points on HellaSwag while delivering significantly lower latency.

Why This Matters for On-Device AI

The SmolLM family is a realistic target for on-device deployment — phones, laptops, edge inference hardware. The paper's results are on that class of model.

The practical implication is that developers targeting on-device inference do not have to choose between full precision (quality but slow) and static quantization (fast but degraded). Dynamic precision selection offers a third path that is faster than static quantization and closer to full precision in accuracy.

The technique is also composable with other KV cache optimizations. It does not require changes to the model architecture — it is applied during the decoding loop, as a post-processing step on cache writes.

Limitations and What to Watch

The paper has some important caveats:

The controller requires training data. The learned precision selector must be trained on representative sequences. The paper does not release the training pipeline, so reproducing the full learned version requires writing that infrastructure from scratch.

Results are on small models. SmolLM-1.7B is the largest model tested. Whether the technique scales to 7B+ models, and whether the latency gains hold on GPU-accelerated hardware (versus mobile CPUs where memory bandwidth is the bottleneck), is not addressed in this paper.

The 17.75% latency reduction is on CPU. On devices where the compute is the bottleneck rather than memory bandwidth, the gains would be smaller.

No code release. As of April 2026, the paper does not link to a public implementation. Reproducing the full system requires building from the paper's description.

Implementation Path

If you want to experiment with this approach:

Start with SmolLM-360M from HuggingFace — the smallest model with clear results in the paper.
Implement the feature extraction as shown above, using the model's attention outputs (available via output_attentions=True in HuggingFace Transformers).
Train a small precision selector on a representative dataset — a simple logistic regression or small MLP mapping 4 features to 4 precision classes.
Evaluate on HellaSwag using the lm-eval library to compare against the paper's numbers.

The paper's approach is not a drop-in library — it is a technique that requires integration into the decoding loop. But the core idea is implementable from the paper's description, and the feature vector is the right place to start.

The Paper

"Don't Waste Bits! Adaptive KV-Cache Quantization for Lightweight On-Device LLMs"

Sayed Pedram Haeri Boroujeni, Niloufar Mehrabi, Patrick Woods, Gabriel Hillesheim, Abolfazl Razi

Clemson University — arXiv:2604.04722, April 6, 2026

https://arxiv.org/abs/2604.04722

The gap between static and adaptive KV quantization turns out to be both measurable and meaningful: 17% faster decoding with near-FP16 accuracy is a real improvement on the class of hardware where on-device LLMs actually run. The technique is straightforward enough to reproduce and adapt — if you are building inference infrastructure for small models on constrained hardware, this paper is worth reading carefully.