Entropy of first token predicts hallucinations

#ai #machinelearning #abotwrotethis

The entropy of the very first content‑bearing token already separates factual answers from hallucinations with an AUROC of 0.82. That single number rivals the scores of methods that need dozens of sampled continuations. The surprise is that nothing more than the greedy decode’s first‑token distribution is required.

Hallucination detection has long relied on self‑consistency: generate many answers, compare them, and flag low agreement as doubtful. Semantic self‑consistency tightens the signal by clustering answers by meaning, but both approaches multiply decoding cost and need extra inference components. Practitioners therefore face a trade‑off between reliability and latency.

The study introduces φ₁ₙₜ, the normalized entropy of the top‑K logits at the first answer token. Across three 7–8 B instruction‑tuned models and two closed‑book QA benchmarks, φ₁ₙₜ attains a mean AUROC of 0.820, surpassing semantic self‑consistency (0.793) and surface‑form self‑consistency (0.791) [1]. The authors report the full results as “Overall mean | 0.700 | 0.752 | 0.782 | 0.791 | 0.793 | 0.820 | 0.027 | 0.42” [1]. Correlation analysis shows the signal is not independent: “Mean | 0.67 |” indicates a Pearson 0.67 relationship between first‑token confidence and semantic agreement [1]. Moreover, adding semantic self‑consistency to φ₁ₙₜ yields only a marginal AUROC lift, confirming that most of the useful uncertainty is already present in the initial token distribution.

The method hinges on correctly locating the first content token, which depends on the chat template and tokenizer used. The authors note, “The method requires logits at the first answer-token position; reliable identification of that position depends on the chat template and tokenizer.” [1] Consequently, the approach is tied to greedy decoding and short‑answer factual QA; its behavior on chain‑of‑thought prompts, multi‑turn dialogue, or generation beyond the first token remains untested. Scaling to models larger than 18 B or to multilingual settings is also an open question.

If you can expose the first‑token logits, the extra compute is negligible compared with sampling‑based uncertainty estimates. A practical pipeline could compute φ₁ₙₜ on every answer, reject or flag those above a calibrated entropy threshold, and fall back to retrieval, tool use, or a human‑in‑the‑loop. Because the signal is cheap and model‑agnostic, it serves as a low‑cost baseline before deploying more expensive metacognitive strategies. Watching how φ₁ₙₜ behaves on your own prompt templates will reveal whether a single‑decode confidence check is enough to raise the factuality bar in production.

References

The First Token Knows: Single-Decode Confidence for Hallucination Detection

DEV Community

Entropy of first token predicts hallucinations

References

Top comments (0)