When Models Know the Answer But Say the Wrong Thing Anyway

Descripción

WHEN MODELS KNOW THE ANSWER BUT SAY THE WRONG THING ANYWAY Source: Hallucination as Commitment Failure: Larger LLMs Misfire Despite Knowing the Answer [https://arxiv.org/abs/2605.22007] Paper was published on May 21, 2026 This episode was AI-generated on May 22, 2026. The script was written by an AI language model and the host voices were synthesized by Eleven Labs. The producer is not affiliated with Anthropic or Eleven Labs. A new paper shows that up to 47% of hallucinations from large instruction-tuned LLMs happen when the model already has the correct answer sitting in its probability distribution — it just commits to a wrong token instead. Even stranger: the bigger the model, the more often this happens, and the mechanism that causes it is the same one that makes these models feel helpful and decisive in the first place. KEY TAKEAWAYS * Why 'Saint,' 'St,' and 'C-athedral' splitting the vote can hand the wrong answer a plurality win under greedy decoding * How P-mass — summing probability across spellings of the same concept — reframes hallucinations from a knowledge problem to a commitment problem * Why the fraction of these 'commitment failure' hallucinations climbs monotonically with scale, from 16% to 47% * The Instruct-vs-base comparison showing instruction tuning, not scale itself, is what sharpens confidence in wrong tokens * Why uncertainty-based hallucination detection has a structural ceiling it cannot cross * Why the obvious fix — concept-aware decoding — only recovers about 2% of failures in the Instruct models we actually deploy * 01:20 — The Saint Basil's example A concrete case where the model's belief is concentrated on the right answer but greedy decoding picks the wrong token anyway. * 02:41 — From token entropy to P-mass Reframing the analysis by pooling probability across surface forms of the same concept, and defining 'commitment failure' hallucinations. * 05:23 — The dispersion mechanism Same total belief in the right concept, but in failures it's spread across spellings 26% vs 78% — a clean three-to-one signature. * 08:04 — Instruction tuning as the causal variable Comparing Instruct and base models of identical size shows the sharpening of wrong tokens is driven by helpfulness training, not scale. * 10:46 — The decisive witness and the high-gain microphone Two analogies for why confident correctness and confident hallucination are the same disposition viewed from different angles. * 13:27 — Limitations and scope P-mass requires ground truth, the 0.2 threshold is a choice, and the setup is short-form QA with greedy decoding — the cleanest regime for the effect. * 16:09 — A unifying account of the alignment tax How commitment sharpening offers a single mechanism behind accuracy loss, calibration loss, mode collapse, and confident hallucination. * 18:50 — Phrase-level commitment and what comes next Evidence that sharpening cascades across tokens, why simple concept-clustering won't fix it, and what a meaning-level training objective might require. RECOMMENDED READING * Training language models to follow instructions with human feedback [https://arxiv.org/abs/2203.02155] — The Ouyang et al. paper that introduced the 'alignment tax' framing the episode invokes when connecting instruction tuning to sharpened, sometimes-wrong commitments. * Language Models (Mostly) Know What They Know [https://arxiv.org/abs/2207.05221] — Anthropic's investigation of LLM self-knowledge and calibration, a natural counterpoint to the episode's claim that uncertainty-based detection has a structural ceiling. * Semantic Uncertainty: Linguistic Invariances for Uncertainty Estimation in Natural Language Generation [https://arxiv.org/abs/2302.09664] — Kuhn, Gal, and Farquhar's proposal to cluster generations by meaning before measuring uncertainty — the closest existing analog to the 'count meanings, not tokens' move at the heart of P-mass. * How Language Model Hallucinations Can Snowball [https://arxiv.org/abs/2305.13534] — Zhang et al. on how early committed tokens lock models into wrong continuations, directly relevant to the episode's discussion of phrase-level commitment cascading past the first token.

Finding Millions of Readable Concepts Inside a Real, Deployed AI Model

FINDING MILLIONS OF READABLE CONCEPTS INSIDE A REAL, DEPLOYED AI MODEL Source: Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet [https://arxiv.org/abs/2605.29358] Paper was published on May 28, 2026 This episode was AI-generated on May 29, 2026. The script was written by an AI language model and the host voices were synthesized by Eleven Labs. The producer is not affiliated with Anthropic or Eleven Labs. Researchers reached into Claude's internals, found the single thread that means 'Golden Gate Bridge,' and turned it up until the model believed it was the bridge. This episode unpacks the paper that proved interpretability works on a real commercial model — and is unusually honest about everything it still can't do. KEY TAKEAWAYS * Why individual neurons mean nothing, and how the 'superposition' idea — concepts as blended directions, like mixing paint — explains it * How sparse autoencoders un-mix those directions into millions of human-readable features, and how scaling laws turned 'how big a dictionary' into an engineering decision * The crucial difference between a feature that merely correlates with a concept (a thermometer) and one you can pull to change behavior (a thermostat) * Why the reasoning that actually mattered in the Kobe Bryant trivia chain was the seventieth-loudest signal — loudness and importance turn out to be different things * Why finding a 'deception' or 'bioweapon' feature is not an alarm bell, and what the authors say the real safety signal would be * Where the paper is weakest: no ground truth, circular Claude-grades-Claude evaluation, off-distribution steering, cherry-picked reasoning chains, and dictionaries that miss most of what's there * 00:00 — Golden Gate Claude and the question of where concepts live The opening demo sets up the central puzzle: what is a nameable 'thread' inside a pile of numbers, and why can't you just read it off the neurons? * 03:05 — Superposition and dictionary learning The paint-mixing intuition for why concepts are directions rather than neurons, and how sparse autoencoders recover those directions by reconstructing the model's state from a tiny handful of features. * 06:10 — From toy models to a real one Why scaling this to Claude 3 Sonnet — and deriving Chinchilla-style scaling laws to pick a 34-million-feature dictionary — was an existential test for the whole field. * 09:15 — Are the features real? Abstraction and causation Features that fire across languages and even images, the 'bug in code' detector, and the thermometer-versus-thermostat distinction that the paper's credibility rests on. * 12:20 — Watching the model reason: the Kobe Bryant chain How knocking out features one at a time revealed a causal hop from Kobe to Lakers to LA to California to Sacramento — and why the load-bearing features were buried deep in the noise. * 14:05 — The periodic-table finding How concept frequency predicts when a concept gets its own feature, why a one-in-a-billion concept needs a billion-feature dictionary, and how features split as the microscope gets sharper. * 18:30 — Safety-relevant features, carefully framed Deception, secrecy, hate, and self-concept features exist — but the authors argue the real question is when they fire, not that they exist, illustrated with honesty-lever and forced-screed demos. * 19:55 — Where the paper is weakest The authors' own reservations: no ground truth, the circular Claude-grades-Claude evaluation, the sensitivity gap, extreme off-distribution steering, cherry-picked chains, and demonstrably incomplete dictionaries. * 24:41 — What it actually settled The technique survived contact with a real model and made unsupervised, one-time-cost interpretability credible — while leaving the safety payoff an explicit aspiration rather than a result. RECOMMENDED READING * Toy Models of Superposition [https://arxiv.org/abs/2209.10652] — The earlier Anthropic work that introduced the superposition hypothesis the episode leans on—the paint-mixing intuition for why single neurons are polysemantic—but only on the toy models this paper had to prove scalable. * Towards Monosemanticity: Decomposing Language Models With Dictionary Learning [https://transformer-circuits.pub/2023/monosemantic-features/index.html] — The one-layer 'sandbox' study whose skeptical reception ('cute, but does it scale?') is the exact existential question this episode says the Sonnet paper was built to answer. * Training Compute-Optimal Large Language Models (Chinchilla) [https://arxiv.org/abs/2203.15556] — The scaling-law paper the episode name-checks as the template for deciding how big the 34-million-feature dictionary should be—turning a gamble into a curve you can read off. * Emergent World Representations: Exploring a Sequence Model Trained on a Synthetic Task (Othello-GPT) [https://arxiv.org/abs/2210.13382] — The Othello cautionary tale the hosts cite—researchers assumed the wrong board representation—illustrating why the episode prizes unsupervised dictionary learning over hand-built detectors.

30 de may de 202627 min

When Models Know the Answer But Say the Wrong Thing Anyway

Descripción

Comentarios

Empieza 7 días de prueba

Todos los episodios