Lost in the Middle: Why LLMs Quietly Ignore the Centre of Their Own Context Window

#llm #ai #deeplearning

Every time you hand a long document to an LLM and ask it to summarise or answer a question, something quietly goes wrong. The model reads the whole thing — or appears to — but its answers disproportionately reflect what was at the beginning and the end. Whatever sat in the middle? Largely ignored.

This isn't a rumour. It was rigorously documented in a 2023 paper titled "Lost in the Middle: How Language Models Use Long Contexts" (Liu et al., Stanford/UC Berkeley), and it remains one of the most practically important — and underappreciated — findings in applied LLM science.

The Shape of the Problem

The researchers ran a controlled experiment: they placed the answer to a multi-document QA question inside a set of retrieved documents, then varied which position the relevant document occupied — first, middle, or last. Performance dropped sharply when the relevant document was positioned in the middle of the context, even when the total context length was well within the model's stated window.

The performance curve is U-shaped: high accuracy when the answer is near position 1 (beginning) or the final position, with a pronounced dip for everything in between. On some configurations, accuracy in the middle fell by 20+ percentage points compared to the edges.

Why Does This Happen?

The mechanism is rooted in how attention distributes across long sequences. Two forces pull in opposite directions:

Recency bias — The final tokens are closest to where the model is generating its next token. In autoregressive transformers, recent tokens tend to receive higher attention weights because they're both positionally proximate and because many training tasks (next-token prediction, instruction following) implicitly reward sensitivity to recent context.

Primacy bias — The very first tokens in a prompt — especially system instructions — receive unusually high attention during pre-training and fine-tuning because they set the conversational frame. Instruction-tuned models are heavily conditioned on treating the beginning of context as authoritative.

The middle gets neither benefit. It's not recent enough for recency bias, and it wasn't there when the model was learning to follow instructions. In the attention score distribution, middle-sequence tokens often receive lower aggregate attention than their informational value warrants.

What This Means in Practice

If you're building a RAG pipeline, this has concrete implications for your retrieval and context construction strategy:

Reranking matters more than retrieval order. A retriever that returns the most relevant chunk in position 3 of 5 will underperform the same retriever that puts that chunk at position 1 — even if the model technically "sees" all five. Getting retrieval order right isn't just aesthetics; it's accuracy.

Don't bury your needle. If you're doing something like "here are 10 excerpts, answer based on them," and the answer lives in excerpt 6, you're playing against the model's attention distribution. Front-load or back-load your most relevant context.

The "more context = better" assumption breaks down. Adding more retrieved chunks to a prompt can actually reduce accuracy if it pushes the relevant chunk deeper into a crowded middle. Precision of retrieval often beats recall of retrieval for this reason.

Longer context windows don't fix it. The effect persists in models with 128K+ context windows. The architecture hasn't changed; only the capacity has. A 1M-token model still has a U-shaped attention distribution. The middle is just a much bigger valley.

Does Instruction Tuning Help?

Somewhat. Models fine-tuned explicitly to scan entire contexts (e.g., with long-document QA training data) show a shallower U-curve. Anthropic's Constitutional AI training and similar techniques that emphasise careful reading do push some improvement. But the effect doesn't disappear — it attenuates.

There's also a prompt-engineering mitigation: explicitly instructing the model to "carefully consider all documents before answering" or "pay equal attention to all parts of the provided context" has been shown to partially counteract primacy/recency bias, likely by activating fine-tuned attention-distribution behaviour. It's not a fix, but it's free.

The Deeper Lesson

This finding is a good reminder that LLMs don't "read" context the way a human reads a document — scanning linearly, building a uniform mental model. They process all tokens in parallel, but the attention mechanism creates an implicit weighting that's shaped by training distribution, position, and architecture. The model's nominal context window is the ceiling; its effective context is shaped by where you put things.

If your application depends on the model reliably using specific information — legal clauses, numerical specs, code sections — position is not neutral. Put critical content at the edges, rerank your retrieval results before injection, and don't assume that visible-in-context means attended-to.