AI Papers: A Deep Dive
HOW AN AGENT GOT 44 POINTS BETTER BY MINING ITS OWN SCRATCH PAPER Source: Inducing Reasoning Primitives from Agent Traces [https://arxiv.org/abs/2606.02994] Paper was published on June 02, 2026 This episode was AI-generated on June 3, 2026. The script was written by an AI language model and the host voices were synthesized by Eleven Labs. The producer is not affiliated with Anthropic or Eleven Labs. An AI agent that solved a hard legal-reasoning task only 30% of the time jumped to 74% — using nothing but its own past successful transcripts, with zero retraining. This episode unpacks why that isn't a free lunch, the clever control experiment that proves it, and the honest places where the whole method falls apart. KEY TAKEAWAYS * Why mining an agent's own successful 'thoughts' — not its actions — can convert inconsistent competence into consistent competence without changing a single weight * The 'implicit aggregation' mechanism: how a stable consensus recipe of the agent's best behavior dissolves the apparent paradox of beating its own teacher * Why the Self-Consistency control (20x more compute via majority vote) fails to close the gap — proving it's better-organized reasoning, not just more thinking * Where the method breaks: arithmetic-heavy tasks where language-model 'pseudo-tools' compound small errors and drop below plain chain-of-thought * The honest caveats — a curated benchmark, 'surpasses' meaning 'matches' on most tasks, and the headline +44 partly reflecting how broken the baseline was * Why human-readable induced tools make the agent's reasoning vocabulary auditable and editable, unlike invisible fine-tuning * 00:00 — The 30-to-74 jump that looks like a free lunch The opening puzzle: an agent quadruples its score on an NBA contract-legality task using only its own previous successful transcripts. * 03:24 — The scratch paper problem How ReAct agents reinvent the same reasoning moves on every problem and discard the valuable method along with the answer. * 06:48 — The four-stage induction pipeline Walking through the deliberately minimal recipe: run a generic agent, keep only thoughts from successful runs, label and cluster the reasoning moves, and name the top five. * 10:12 — Pseudo-tools and the colleague-down-the-hall trick Why the induced 'tools' contain no real code, and how routing a request to an improvising model bridges callable names and fuzzy judgment. * 13:36 — Implicit aggregation: why it beats its own source The chef-and-recipe analogy explaining how a corpus-level specification locks in the agent's best behavior and converts high-variance competence into reliability. * 17:00 — The compute objection and the Self-Consistency control Testing whether the gains are just extra thinking budget — and why 20x more compute via majority vote fails to reproduce the lift. * 20:25 — Where it breaks: arithmetic, curation, and modest gains The honest limitations — deterministic computation killing the method, a favorable curated benchmark, and 'surpasses' that's really 'matches' on most tasks. * 23:49 — Auditable competence and the bigger reframe Why human-readable induced tools beat invisible fine-tuning, plus the statistical due diligence and the closing picture of self-improvement without new capability. RECOMMENDED READING * ReAct: Synergizing Reasoning and Acting in Language Models [https://arxiv.org/abs/2210.03629] — Introduces the thought-action-observation agent loop that this episode's induced agent runs on and mines for reusable reasoning moves. * Agent Workflow Memory [https://arxiv.org/abs/2409.07429] — The 'nearest cousin' the episode explicitly contrasts with — it mines whole multi-step workflows from traces, where this paper extracts atomic reasoning primitives instead. * Self-Consistency Improves Chain of Thought Reasoning in Language Models [https://arxiv.org/abs/2203.11171] — The majority-vote-over-samples baseline the episode highlights as the crucial control showing that 20x more compute does not reproduce the library's gains. * Chain-of-Thought Prompting Elicits Reasoning in Large Language Models [https://arxiv.org/abs/2201.11903] — The plain single-pass reasoning baseline the induced agent is measured against, including the arithmetic-heavy tasks where the pseudo-tool approach actually falls below it.
109 Episoder
Kommentarer
0Vær den første til å kommentere
Registrer deg nå og bli medlem av AI Papers: A Deep Dive sitt community!