KV Cache Quantization for On-Device LLM Inference on Android

#programming #webdev

---
title: "KV Cache Quantization for On-Device Android LLM Inference"
published: true
description: "A hands-on guide to fitting a 7B LLM into 4GB Android RAM using INT4 KV cache quantization, sliding window eviction, and ashmem memory mapping."
tags: android, kotlin, mobile, architecture
canonical_url: https://mvpfactory.co/blog/kv-cache-quantization-on-device-android-llm-inference
---

## What We Are Building

By the end of this tutorial, you will understand how to run a 7B parameter LLM on a 4GB Android device without getting OOM-killed. We will walk through three techniques that work together: quantizing attention key-value caches from FP16 to INT4, implementing a sliding window eviction policy with anchor tokens, and using Android-specific `ashmem` memory mapping with `madvise` hints to keep your app's memory footprint safe.

Let me show you a pattern I use in every project that involves on-device inference. This is the memory architecture that separates apps that ship from apps that crash after 30 seconds.

## Prerequisites

- Familiarity with transformer attention and KV caches
- A working Android project with NDK support (for native memory management)
- Basic understanding of Android memory management (`PSS`, `LowMemoryKiller`)

## Step 1: Understand the KV Cache Problem

Every transformer layer maintains key and value tensors for each generated token. For a 7B model with 32 layers and 32 attention heads at a head dimension of 128, a single token's KV cache in FP16 costs:

2 (K+V) × 32 layers × 32 heads × 128 dim × 2 bytes = 524,288 bytes ≈ 0.5 MB/token


At a 2048-token context window, that is 1 GB of KV cache alone — before model weights even load. On a device with 4GB total RAM and maybe 2GB available to your app, that is a dead end. We need to compress this aggressively.

## Step 2: Apply INT4 Group-Wise Quantization

Quantizing KV caches from FP16 to INT4 with group-wise scaling (groups of 32 or 64 elements sharing a single FP16 scale factor) compresses the cache to roughly 25% of its original size. Here is what the numbers look like:

| Format | Bits/Element | Scale Overhead | Effective Bits | Cache for 2048 Tokens |
|--------|-------------|----------------|----------------|-----------------------|
| FP16 | 16 | 0 | 16.0 | ~1,024 MB |
| INT8 | 8 | ~0.5 (g=32) | 8.5 | ~544 MB |
| INT4 (g=32) | 4 | ~0.5 | 4.5 | ~288 MB |
| INT4 (g=64) | 4 | ~0.25 | 4.25 | ~272 MB |

INT4 with group size 32 is the sweet spot in my experience. Perplexity degradation stays under 0.3 points on most benchmarks compared to FP16, while the g=64 variant introduces noticeable quality drops in multi-turn conversations. That 0.25-bit savings is not worth the trade.

Here is the minimal setup to get this working in your inference loop:

kotlin
// Per-layer KV cache quantization
fun quantizeKVCache(fp16Tensor: FloatArray, groupSize: Int = 32): QuantizedTensor {
val numGroups = fp16Tensor.size / groupSize
val scales = FloatArray(numGroups)
val quantized = ByteArray(fp16Tensor.size / 2) // INT4 packed

for (g in 0 until numGroups) {
    val offset = g * groupSize
    val absMax = (0 until groupSize).maxOf { abs(fp16Tensor[offset + it]) }
    scales[g] = absMax / 7.0f  // INT4 range: [-8, 7]
    // Pack two INT4 values per byte
    for (i in 0 until groupSize step 2) {
        val q0 = clamp(round(fp16Tensor[offset + i] / scales[g]), -8, 7)
        val q1 = clamp(round(fp16Tensor[offset + i + 1] / scales[g]), -8, 7)
        quantized[(offset + i) / 2] = ((q0.toInt() and 0x0F) or (q1.toInt() shl 4)).toByte()
    }
}
return QuantizedTensor(quantized, scales)

}


## Step 3: Implement Sliding Window Eviction

Even with INT4 quantization, unbounded context growth eventually exhausts memory. A sliding window eviction policy with a fixed budget keeps memory deterministic. I have found 512 recent tokens plus 64 "anchor" tokens from the conversation start works well in practice.

The architecture breaks into three zones:

- **Tokens 0–63** are the anchor zone. Never evicted. This preserves the system prompt and initial context.
- **The last 512 tokens** are the active window with full INT4 KV cache retained.
- **Everything between token 64 and the start of the active window** gets evicted FIFO as new tokens generate.

This gives you a fixed ceiling of ~82 MB for the KV cache regardless of conversation length. Even budget Android devices can handle that.

## Step 4: Use ashmem + madvise for Memory Mapping

Here is the gotcha that will save you hours. Most teams allocate KV cache on the Java heap or via standard `malloc`, then wonder why Android's `LowMemoryKiller` terminates their app during generation. The docs do not mention this, but Android's anonymous shared memory (`ashmem`) regions with explicit `madvise` hints are what actually works:

- **`MADV_SEQUENTIAL`** on the active generation window so the kernel prefetches efficiently
- **`MADV_DONTNEED`** on evicted KV cache pages, immediately releasing physical memory without unmapping virtual address space
- **`MADV_MERGEABLE`** on anchor zone pages across sessions, enabling KSM deduplication when multiple conversations share the same system prompt

This keeps your app's PSS (Proportional Set Size) — the metric Android actually uses for OOM decisions — well below the per-app threshold. Even on devices reporting 4GB total RAM where real available memory hovers around 1.8–2.2 GB.

## The Full Memory Budget

Here is what the final breakdown looks like with everything in place:

| Component | Memory (INT4 strategy) |
|-----------|----------------------|
| Model weights (Q4_K_M) | ~3.8 GB (mmap, demand-paged) |
| KV cache (INT4, 576 tokens) | ~82 MB |
| Activation buffers | ~150 MB |
| Runtime overhead | ~120 MB |
| **App total PSS** | **~350–400 MB** |

The model weights use `mmap` with `MAP_PRIVATE`, so Android demand-pages them and can reclaim clean pages under pressure. Your actual resident memory stays within safe limits.

## Gotchas

- **INT8 is not enough on mobile.** The memory savings over FP16 look decent on paper, but in practice INT4 with group size 32 is the threshold that makes multi-turn generation viable on 4GB devices.
- **Never use the Java heap for KV cache.** This is the single most common mistake. The GC pressure alone will stall your generation, and `LowMemoryKiller` will terminate you before the GC even catches up.
- **Profile PSS, not VSS.** Use `dumpsys meminfo` and watch the PSS column. Virtual memory size is misleading on Android because of mmap'd model weights.
- **Design eviction around conversation semantics, not just recency.** The 512+64 anchor strategy preserves system prompt context that pure FIFO eviction would destroy.

## Conclusion

On-device inference is a memory architecture problem. Quantize KV caches to INT4 with group size 32 for a real 75% memory reduction with negligible perplexity cost. Cap your context with a fixed-budget sliding window using anchor tokens. And use `ashmem` regions with explicit `madvise` hints — never the Java heap. Teams that treat this as a memory architecture problem are shipping. Teams that bolt it on after the model works "in theory" are still debugging OOM crashes.

DEV Community

KV Cache Quantization for On-Device LLM Inference on Android

Top comments (0)