How an Open-Book Trick Teaches a Model to Catch Its Own Mistakes

Beskrivelse

HOW AN OPEN-BOOK TRICK TEACHES A MODEL TO CATCH ITS OWN MISTAKES Source: Self-Trained Verification for Training- and Test-Time Self-Improvement [https://arxiv.org/abs/2605.30290] Paper was published on May 28, 2026 This episode was AI-generated on May 29, 2026. The script was written by an AI language model and the host voices were synthesized by Eleven Labs. The producer is not affiliated with Anthropic or Eleven Labs. The same AI critic that's supposed to make reasoning models smarter keeps talking itself out of correct verdicts — and that broken critic is the hidden bottleneck the whole field leans on. This episode unpacks a clever fix: let a model peek at the answer key, learn what good error-spotting looks like, then take the key away. The payoff includes an 8-billion-parameter model that, guided by a trained critic, beats one thirty times its size on hard problems. KEY TAKEAWAYS * Why test-time refinement and self-training are secretly bottlenecked on the same weak component — the verifier — which gets more confident over rounds while accuracy stays flat * The asymmetry that powers the method: diagnosing a flawed solution with an answer key in hand is far easier than spotting the error cold, and that gap becomes a free training signal * Why plainly copying the teacher's feedback fails completely, while on-policy distillation — practicing and being corrected on your own attempts — works * The headline results: roughly doubling pass rates on hard math, going from 1.5% to 21% on the hardest science problems, and an 8B model beating a 235B one * The surprise the authors didn't expect: training inside the verification loop improved the model's solo, no-critic-present first attempts past a ceiling that standard training couldn't budge * Where the work is soft: results rest on one model family and two domains that both have verifiable answers, the flywheel is demonstrated for only one turn, and small base rates inflate the multiplicative gains * 00:00 — A critic that overturns its own correct verdict A worked example of a verifier correctly judging a solution wrong and then arguing itself into the wrong answer, illustrating the structural failure at the heart of the paper. * 03:01 — Two recipes, one shared bottleneck How both test-time refinement loops and self-training depend entirely on a critic, and why models are structurally bad at catching their own subtle errors. * 06:02 — The open-book asymmetry and self-trained verification The core insight that grading with an answer key is far easier than without one, and how the same model is run with and without the key to generate a training signal. * 09:04 — Why copying the teacher fails The finding that supervised imitation of good feedback flatly doesn't work, while on-policy distillation — practicing on your own trajectory — does. * 12:05 — Ruling out the alternatives and the science results The experiments that show simpler critic-training approaches stall, the 14x improvement on the hardest science problems, and the small model beating one thirty times its size. * 15:06 — Turning the critic back on the generator Training a plateaued generator inside the verification loop, and the unexpected jump in its solo, unassisted first-attempt performance. * 18:08 — Limitations and what actually shifts An honest accounting of the method's narrow scope, reliance on verifiable answers, small base rates, and the undemonstrated flywheel — plus the reframe of self-improvement as a verification problem. RECOMMENDED READING * Large Language Models Cannot Self-Correct Reasoning Yet [https://arxiv.org/abs/2310.01798] — The empirical case that models confidently bless their own wrong answers without external help — the exact failure mode this episode's verifier is built to overcome. * STaR: Bootstrapping Reasoning With Reasoning [https://arxiv.org/abs/2203.14465] — The foundational self-training recipe — keep the model's good attempts and train on them — that the episode names as one of the two bottlenecked recipes depending on a critic. * On-Policy Distillation of Language Models: Learning from Self-Generated Mistakes [https://arxiv.org/abs/2306.13649] — Goes deeper on the practice-and-correct training method the episode credits for working where plain imitation of the teacher's feedback flatly failed. * Let's Verify Step by Step [https://arxiv.org/abs/2305.20050] — The influential argument that training a strong verifier (process reward model) drives reasoning gains, directly supporting the episode's 'intelligence lives in the critic, not the solver' reframe.

Finding Millions of Readable Concepts Inside a Real, Deployed AI Model

FINDING MILLIONS OF READABLE CONCEPTS INSIDE A REAL, DEPLOYED AI MODEL Source: Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet [https://arxiv.org/abs/2605.29358] Paper was published on May 28, 2026 This episode was AI-generated on May 29, 2026. The script was written by an AI language model and the host voices were synthesized by Eleven Labs. The producer is not affiliated with Anthropic or Eleven Labs. Researchers reached into Claude's internals, found the single thread that means 'Golden Gate Bridge,' and turned it up until the model believed it was the bridge. This episode unpacks the paper that proved interpretability works on a real commercial model — and is unusually honest about everything it still can't do. KEY TAKEAWAYS * Why individual neurons mean nothing, and how the 'superposition' idea — concepts as blended directions, like mixing paint — explains it * How sparse autoencoders un-mix those directions into millions of human-readable features, and how scaling laws turned 'how big a dictionary' into an engineering decision * The crucial difference between a feature that merely correlates with a concept (a thermometer) and one you can pull to change behavior (a thermostat) * Why the reasoning that actually mattered in the Kobe Bryant trivia chain was the seventieth-loudest signal — loudness and importance turn out to be different things * Why finding a 'deception' or 'bioweapon' feature is not an alarm bell, and what the authors say the real safety signal would be * Where the paper is weakest: no ground truth, circular Claude-grades-Claude evaluation, off-distribution steering, cherry-picked reasoning chains, and dictionaries that miss most of what's there * 00:00 — Golden Gate Claude and the question of where concepts live The opening demo sets up the central puzzle: what is a nameable 'thread' inside a pile of numbers, and why can't you just read it off the neurons? * 03:05 — Superposition and dictionary learning The paint-mixing intuition for why concepts are directions rather than neurons, and how sparse autoencoders recover those directions by reconstructing the model's state from a tiny handful of features. * 06:10 — From toy models to a real one Why scaling this to Claude 3 Sonnet — and deriving Chinchilla-style scaling laws to pick a 34-million-feature dictionary — was an existential test for the whole field. * 09:15 — Are the features real? Abstraction and causation Features that fire across languages and even images, the 'bug in code' detector, and the thermometer-versus-thermostat distinction that the paper's credibility rests on. * 12:20 — Watching the model reason: the Kobe Bryant chain How knocking out features one at a time revealed a causal hop from Kobe to Lakers to LA to California to Sacramento — and why the load-bearing features were buried deep in the noise. * 14:05 — The periodic-table finding How concept frequency predicts when a concept gets its own feature, why a one-in-a-billion concept needs a billion-feature dictionary, and how features split as the microscope gets sharper. * 18:30 — Safety-relevant features, carefully framed Deception, secrecy, hate, and self-concept features exist — but the authors argue the real question is when they fire, not that they exist, illustrated with honesty-lever and forced-screed demos. * 19:55 — Where the paper is weakest The authors' own reservations: no ground truth, the circular Claude-grades-Claude evaluation, the sensitivity gap, extreme off-distribution steering, cherry-picked chains, and demonstrably incomplete dictionaries. * 24:41 — What it actually settled The technique survived contact with a real model and made unsupervised, one-time-cost interpretability credible — while leaving the safety payoff an explicit aspiration rather than a result. RECOMMENDED READING * Toy Models of Superposition [https://arxiv.org/abs/2209.10652] — The earlier Anthropic work that introduced the superposition hypothesis the episode leans on—the paint-mixing intuition for why single neurons are polysemantic—but only on the toy models this paper had to prove scalable. * Towards Monosemanticity: Decomposing Language Models With Dictionary Learning [https://transformer-circuits.pub/2023/monosemantic-features/index.html] — The one-layer 'sandbox' study whose skeptical reception ('cute, but does it scale?') is the exact existential question this episode says the Sonnet paper was built to answer. * Training Compute-Optimal Large Language Models (Chinchilla) [https://arxiv.org/abs/2203.15556] — The scaling-law paper the episode name-checks as the template for deciding how big the 34-million-feature dictionary should be—turning a gamble into a curve you can read off. * Emergent World Representations: Exploring a Sequence Model Trained on a Synthetic Task (Othello-GPT) [https://arxiv.org/abs/2210.13382] — The Othello cautionary tale the hosts cite—researchers assumed the wrong board representation—illustrating why the episode prizes unsupervised dictionary learning over hand-built detectors.

30. mai 202627 min

How an Open-Book Trick Teaches a Model to Catch Its Own Mistakes

Beskrivelse

Kommentarer

2 Måneder for 19 kr

Alle episoder