AI Papers: A Deep Dive
WHEN MODELS KNOW THE ANSWER BUT SAY THE WRONG THING ANYWAY Source: Hallucination as Commitment Failure: Larger LLMs Misfire Despite Knowing the Answer [https://arxiv.org/abs/2605.22007] Paper was published on May 21, 2026 This episode was AI-generated on May 22, 2026. The script was written by an AI language model and the host voices were synthesized by Eleven Labs. The producer is not affiliated with Anthropic or Eleven Labs. A new paper shows that up to 47% of hallucinations from large instruction-tuned LLMs happen when the model already has the correct answer sitting in its probability distribution — it just commits to a wrong token instead. Even stranger: the bigger the model, the more often this happens, and the mechanism that causes it is the same one that makes these models feel helpful and decisive in the first place. KEY TAKEAWAYS * Why 'Saint,' 'St,' and 'C-athedral' splitting the vote can hand the wrong answer a plurality win under greedy decoding * How P-mass — summing probability across spellings of the same concept — reframes hallucinations from a knowledge problem to a commitment problem * Why the fraction of these 'commitment failure' hallucinations climbs monotonically with scale, from 16% to 47% * The Instruct-vs-base comparison showing instruction tuning, not scale itself, is what sharpens confidence in wrong tokens * Why uncertainty-based hallucination detection has a structural ceiling it cannot cross * Why the obvious fix — concept-aware decoding — only recovers about 2% of failures in the Instruct models we actually deploy * 01:20 — The Saint Basil's example A concrete case where the model's belief is concentrated on the right answer but greedy decoding picks the wrong token anyway. * 02:41 — From token entropy to P-mass Reframing the analysis by pooling probability across surface forms of the same concept, and defining 'commitment failure' hallucinations. * 05:23 — The dispersion mechanism Same total belief in the right concept, but in failures it's spread across spellings 26% vs 78% — a clean three-to-one signature. * 08:04 — Instruction tuning as the causal variable Comparing Instruct and base models of identical size shows the sharpening of wrong tokens is driven by helpfulness training, not scale. * 10:46 — The decisive witness and the high-gain microphone Two analogies for why confident correctness and confident hallucination are the same disposition viewed from different angles. * 13:27 — Limitations and scope P-mass requires ground truth, the 0.2 threshold is a choice, and the setup is short-form QA with greedy decoding — the cleanest regime for the effect. * 16:09 — A unifying account of the alignment tax How commitment sharpening offers a single mechanism behind accuracy loss, calibration loss, mode collapse, and confident hallucination. * 18:50 — Phrase-level commitment and what comes next Evidence that sharpening cascades across tokens, why simple concept-clustering won't fix it, and what a meaning-level training objective might require. RECOMMENDED READING * Training language models to follow instructions with human feedback [https://arxiv.org/abs/2203.02155] — The Ouyang et al. paper that introduced the 'alignment tax' framing the episode invokes when connecting instruction tuning to sharpened, sometimes-wrong commitments. * Language Models (Mostly) Know What They Know [https://arxiv.org/abs/2207.05221] — Anthropic's investigation of LLM self-knowledge and calibration, a natural counterpoint to the episode's claim that uncertainty-based detection has a structural ceiling. * Semantic Uncertainty: Linguistic Invariances for Uncertainty Estimation in Natural Language Generation [https://arxiv.org/abs/2302.09664] — Kuhn, Gal, and Farquhar's proposal to cluster generations by meaning before measuring uncertainty — the closest existing analog to the 'count meanings, not tokens' move at the heart of P-mass. * How Language Model Hallucinations Can Snowball [https://arxiv.org/abs/2305.13534] — Zhang et al. on how early committed tokens lock models into wrong continuations, directly relevant to the episode's discussion of phrase-level commitment cascading past the first token.
99 episodios
Comentarios
0Sé la primera persona en comentar
¡Regístrate ahora y únete a la comunidad de AI Papers: A Deep Dive!