AI Papers: A Deep Dive

Why Training Only on Perfect Solutions Cripples a Model's Reasoning

22 min · I går

Beskrivelse

WHY TRAINING ONLY ON PERFECT SOLUTIONS CRIPPLES A MODEL'S REASONING Source: Provable Benefits of RLVR over SFT for Reasoning Models: Learning to Backtrack Efficiently [https://arxiv.org/abs/2606.22938] Paper was published on June 22, 2026 This episode was AI-generated on June 23, 2026. The script was written by an AI language model and the host voices were synthesized by Eleven Labs. The producer is not affiliated with Anthropic or Eleven Labs. Everyone assumes clean, flawless examples are the best reasoning data — and a new theory paper proves that intuition is backwards. By formalizing reasoning as path-finding through a maze, two researchers show imitation learning provably can't teach backtracking, while reinforcement learning learns it for free from the model's own failures. The result is a clean, exponential gap that reframes what 'high-quality reasoning data' even means. KEY TAKEAWAYS * Why training on clean, backtracking-free solutions provably freezes a model's ability to retreat from dead ends — there's no gradient signal where there's no data * How modeling reasoning as path-finding through a maze turns 'backtracking' into something you can prove theorems about * The headline result: RL scales linearly with reasoning depth (W·K) while imitation blows up exponentially (W·L^K), from the identical starting model * Why bolting a clever search wrapper onto a weak imitation model helps a lot but still can't fully close the gap * The steelman critique: the central theorem is close to true by construction, and the exponential drama leans on a chosen graph topology and a deliberately pessimistic definition of SFT * The practical payoff — why distilling from an RL-trained model works precisely because you inherit its messy recoveries, not just its answers * 00:03 — Is clean data secretly the problem? The provocative claim that flawless solutions are the wrong training data, and why a new theory paper makes it more than a vibe. * 01:28 — Two ways to train, one key difference Setting up the fight between supervised fine-tuning and RLVR, with the crucial distinction that RL learns from the model's own failures. * 03:41 — Turning reasoning into a maze How the authors recast reasoning as path-finding through corridors with parallel lanes, making backtracking a measurable, provable quantity. * 06:04 — No examples, no nudge The simple gradient fact that dooms imitation learning — perfect solutions contain no dead ends, so backward-facing states never get any signal. * 09:03 — Linear versus falling off a cliff The exponential blowup of imitation versus the linear scaling of RL, and what that gap means concretely as reasoning gets deeper. * 10:15 — How RL escapes the trap Why reinforcement learning visits the exact dead-end states imitation never sees, and how its learning rule turns failure into the gradient that matters. * 13:12 — Does it survive a real algorithm? Confirming the predicted optimum with PPO, a transformer, and asymmetric graphs — and why search scaffolding helps but still can't fully close the gap. * 15:19 — How true by construction is this? The steelman critique — a pessimistic strawman SFT, a target-blind model, and an exponential that leans on a chosen topology and idealized RL analysis. * 18:39 — The dead ends are the curriculum The distillation fix and the big takeaway: quality reasoning data isn't clean data — it's data that keeps the struggle and the recoveries in. RECOMMENDED READING * Tree of Thoughts: Deliberate Problem Solving with Large Language Models [https://arxiv.org/abs/2305.10601] — The search-scaffolding approach the episode critiques — the authors show external orchestration helps but can't fully replace backtracking baked into the weights. * DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning [https://arxiv.org/abs/2501.12948] — The 'folklore' the episode says this theory paper finally formalizes — a flagship demonstration that RL with verifiable rewards produces genuine reasoning and backtracking behavior. * Proximal Policy Optimization Algorithms [https://arxiv.org/abs/1707.06347] — The actual RL algorithm the paper uses to confirm its toy-model predictions on a real transformer — worth reading to understand the machinery behind the W-times-K result. * Chain-of-Thought Prompting Elicits Reasoning in Large Language Models [https://arxiv.org/abs/2201.11903] — The reasoning paradigm the paper models as path-finding through a graph — useful context for judging the gap between the episode's blind-search sandbox and chain-of-thought as actually practiced.

Kommentarer

Vær den første til at kommentere

Tilmeld dig nu og bliv en del af AI Papers: A Deep Dive-fællesskabet!

Kom i gang

Why Training Only on Perfect Solutions Cripples a Model's Reasoning

Beskrivelse

Kommentarer

1 måned kun 9 kr.

Alle episoder