AI Papers: A Deep Dive
HOW TWO TOKENS REOPENED A REASONING METHOD THE FIELD HAD GIVEN UP ON Source: Demystifying Hidden-State Recurrence: Switchable Latent Reasoning with On-Policy Reinforcement Learning [https://arxiv.org/abs/2606.13106] Paper was published on June 11, 2026 This episode was AI-generated on June 13, 2026. The script was written by an AI language model and the host voices were synthesized by Eleven Labs. The producer is not affiliated with Anthropic or Eleven Labs. A year ago, AI researchers decided that silent, in-your-head reasoning was incompatible with the reinforcement learning that powers modern reasoning models. This paper argues that wall was never a law of nature — just a framing error fixable with two ordinary tokens — and then turns its own microscope on the result until the headline shrinks to something quieter and stranger. KEY TAKEAWAYS * Why on-policy RL only ever needed a probability at the moments the model actually decides something — and how two boundary tokens supply exactly that, leaving the deterministic latent steps trainable after all * How the SWITCH framework trains a model to think silently, including the counterintuitive trick of converting all reasoning to latent at once instead of one span at a time * An elegant causal-intervention experiment — dead silence versus matched-volume noise — that shows the silent step does specific, load-bearing computation rather than acting as inert filler * Why the analysis quietly deflates its own premise: the 'recurrence' is really one consequential step plus a forced timer, and on real test problems you can rip the whole mechanism out with no effect * What reinforcement learning actually changed — not the computation itself, but the model's judgment about when to deploy it * Where the honest result lands: a tie with normal visible reasoning at modest token savings, not the 26-point blowout the headline number suggests * 00:00 — The calculator distinction Sets up the core idea that RL only needs probabilities at the moments you make a choice, not for the deterministic machinery in between. * 02:56 — Why models think out loud, and the dream of thinking silently Explains the cost of token-by-token reasoning and the Coconut idea of looping a model's thought vector back in without converting it to words. * 05:53 — The wall: why RL seemed incompatible with hidden-state recurrence Lays out the two problems — latent steps are untrainable and uninspectable — that led the field to abandon the approach. * 08:49 — The fix: two boundary tokens and the SWITCH framework Shows how adding discrete enter/exit tokens makes RL well-defined again by attaching probabilities only to the decision points. * 11:46 — Training in three phases Walks through supervised tagging of high-entropy spans, the all-at-once conversion to latent reasoning, and the Switch-GRPO reinforcement learning setup. * 14:42 — The results, and what the headline number hides Examines the 79% MATH-500 score and argues the honest framing is parity with visible reasoning at modestly fewer tokens, not a blowout over older latent methods. * 17:39 — Turning the boundary tokens into a microscope Uses three nested questions and a causal silence-versus-static intervention to show the switch is a real learned decision and the latent step carries specific computation. * 20:35 — Where the recurrence deflates Reveals that the work happens almost entirely in the first latent step and that on the full test set the mechanism can be removed with no effect. * 23:32 — What RL actually changed, and how it eventually breaks Shows RL recalibrated when to use latent reasoning rather than improving it, and documents the reward-hacking collapse the authors early-stop to avoid. * 26:28 — Honest scope and the two-checkpoint concern Weighs the constructive contribution against the limited testing, the deferred comparison to rival methods, and the fact that the analyzed model differs from the headlined one. RECOMMENDED READING * Training Large Language Models to Reason in a Continuous Latent Space (Coconut) [https://arxiv.org/abs/2412.06769] — The hidden-state recurrence method this episode builds on — the 'feed the thought vector back in instead of a word' idea that SWITCH reopens for RL. * DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models [https://arxiv.org/abs/2402.03300] — Introduces GRPO, the on-policy RL algorithm whose probability-ratio requirement is the exact 'every position must be a choice' wall that the episode argues was a framing error. * DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning [https://arxiv.org/abs/2501.12948] — The R1-style RL-for-reasoning recipe the episode repeatedly invokes as the engine SWITCH has to stay compatible with. * Think Before You Speak: Training Language Models With Pause Tokens [https://arxiv.org/abs/2310.02226] — The pause/filler-token line of work behind the episode's 'inert placeholder' fear that the causal silence-versus-static intervention is designed to test.
136 episodes
Comments
0Be the first to comment
Sign up now and become a member of the AI Papers: A Deep Dive community!