The Position Encoding Problem Nobody Solved Until 2021
LLaMA handles 32K-token contexts with 8% lower perplexity than MPT using half the parameters. The secret isn't model size or training data volume—it's how each model tells tokens where they sit in the sequence.
Position encodings are the unsung bottleneck of long-context language models. Transformers are permutation-invariant by design—without explicit position information, "I ate the cake" and "the cake ate I" look identical. Vaswani's original sinusoidal encodings (Vaswani et al., 2017) worked for 512-token contexts, but extrapolating to 32K+ tokens caused perplexity explosions in practice. You can read the RoPE paper here and the ALiBi paper here.
This post compares two position encoding methods that claim to fix long-context degradation: Rotary Position Embedding (RoPE) from Su et al. (2021) and Attention with Linear Biases (ALiBi) from Press et al. (2021). LLaMA uses RoPE. MPT uses ALiBi. Both train on similar data, but the perplexity gap at 32K tokens tells a different story.
Continue reading the full article on TildAlice

Top comments (0)