Why E8 lattice quantization beats scalar quantization for KV caches

#python #machinelearning #math #llm

Most KV cache quantization methods treat each number independently: round each float to the nearest 2-bit or 4-bit value. This works, but it wastes bits.

The E8 lattice quantizes 8 numbers at once, exploiting correlations between dimensions. The result: 3x better compression under entropy coding compared to scalar quantization at the same distortion.

The problem with scalar quantization

Given a 128-dimensional KV vector, scalar INT2 quantization rounds each of the 128 values independently. Each value gets mapped to one of 4 levels. The indices are near-uniformly distributed, so entropy coding (zstd, Huffman) barely helps - maybe 1.2x reduction.

E8: quantize 8 at a time

The E8 root lattice is the densest sphere packing in 8 dimensions. Instead of rounding each number to {-1, 0, 1, 2}, we split each 128-dim vector into 16 groups of 8, and snap each group to the nearest E8 lattice point.

from nexusquant.core.e8_lattice import E8Lattice

# Quantize 8D groups
groups = vector.reshape(-1, 8)
lattice_points = E8Lattice.nearest_point(groups)

The key insight: E8 nearest-neighbor assignment is non-uniform. Certain lattice points are hit far more often than others because real KV data clusters in specific regions of 8D space. This skew creates a highly compressible distribution.

The entropy advantage

Quantizer	zstd compression on indices
Scalar INT2	1.23x
E8 2-bit	3.74x

That's a 3x advantage from the lattice structure alone. It comes from E8's parity constraints and peaked shell occupancy - mathematical properties of the lattice that happen to align with how KV cache data distributes.

But there's a catch: outliers

Raw KV vectors are heavy-tailed. One outlier dimension inflates the quantization scale for all 8 dimensions in its group. Fix: apply a Hadamard rotation first.

from nexusquant.core.hadamard import hadamard_matrix

H = hadamard_matrix(128)
rotated = vector @ H.T  # spread energy uniformly
# now quantize rotated vector with E8

Hadamard rotation is orthogonal - no information loss. It just spreads each component across all dimensions, making the distribution near-isotropic. After rotation, E8 quantization at 2 bits/dim causes less than 0.1% PPL degradation.

The combination of Hadamard + E8 is what makes NexusQuant work. Removing either one degrades quality significantly.

GitHub | Paper

Best regards,
João Marques