How Teaching an AI to Predict, Not Act, Made It a Better Actor

Beschreibung

HOW TEACHING AN AI TO PREDICT, NOT ACT, MADE IT A BETTER ACTOR Source: Qwen-AgentWorld: Language World Models for General Agents [https://arxiv.org/abs/2606.24597] Paper was published on June 23, 2026 This episode was AI-generated on June 24, 2026. The script was written by an AI language model and the host voices were synthesized by Eleven Labs. The producer is not affiliated with Anthropic or Eleven Labs. Researchers trained a model to do one thing — guess what a computer would say back — with zero acting, no tool calls, no clicking. Then it got better at every multi-step agent task they threw at it, including a function-calling benchmark whose data it had never seen. The bet: prediction and action are the same muscle, and the field has only been training one side of it. KEY TAKEAWAYS * Why a model trained only to predict environment responses — never to act — transfers measurably into better agent behavior, with prediction accuracy rising from 70% to 78% * The three-stage recipe (pre-train injects, fine-tune activates, RL sharpens) and how the reward function had to be redesigned to stop the model from flattering its own AI judge * How a steered simulator beat a live search engine for training (50.3% vs 45.6%) by deliberately handing back partial answers — the 'stingy teacher' effect * Why training agents inside entirely fictional worlds (a 2030 Mars colony) made them better at real search without contaminating their knowledge * Where the marketing outruns the evidence: a sub-half-point frontier win, a fifth-place GUI ranking, an AI judge with a documented exploit, and a 'beats reality' claim resting on a single comparison * Why environments — not model size — are the real bottleneck in agent training, and how a learnable simulator could unshackle it * 00:00 — Two muscles or one? Sets up the central puzzle — a model trained only to predict, never to act, becoming a better actor across every task. * 01:09 — The half of the loop nobody trained Explains the policy/world-model split, the theory that general agents must contain a world model, and why environments are the field's real bottleneck. * 03:02 — Turning seven worlds into one problem How representing terminals, phones, and web pages all as text lets one model learn to be any environment under a single objective. * 04:39 — Outsmarting a model that cheats the grader Walks through the three-stage training pipeline, the self-praise reward hack, and the clever loss-masking trick for boilerplate turns. * 10:08 — Is the headline as big as it sounds? Examines the benchmark results — a razor-thin frontier margin versus a clean eight-point win over their own base model, plus the cross-domain transfer effect. * 13:42 — When a fake world beats the real one The decoupled paradigm — training agents inside fictional worlds and against a steered simulator that beat a live search engine. * 17:38 — Prediction with no acting in it The unified paradigm — a single-turn, tool-free warm-up that lifts agent performance on all seven multi-turn benchmarks, demonstrated with the Postfix mail server case. * 20:59 — Where the marketing runs ahead Finn's three-part critique: the thin headline win, the gameable AI judge, and the 'beats reality' claim resting on a single narrow comparison. * 24:14 — What survives the harshest read The lasting contribution — prediction as a trainable foundation skill that transfers to action — and what it could change about agent-training economics. RECOMMENDED READING * Robust agents learn causal world models [https://arxiv.org/abs/2402.10877] — The Richens et al. result the episode cites as its theoretical spine — proving that any agent generalizing across enough tasks must have learned a world model. * A Path Towards Autonomous Machine Intelligence [https://openreview.net/forum?id=BZ5a1r-kVsf] — LeCun's manifesto for predict-before-you-act agents, the 'old vision' the episode invokes when explaining the unify paradigm where the agent simulates consequences before committing to an action.

When Turning Experience Into Code Makes Your AI Agent Dumber

WHEN TURNING EXPERIENCE INTO CODE MAKES YOUR AI AGENT DUMBER Source: Metis: Bridging Text and Code Memory for Self-Evolving Agents [https://arxiv.org/abs/2606.24151] Paper was published on June 23, 2026 This episode was AI-generated on June 24, 2026. The script was written by an AI language model and the host voices were synthesized by Eleven Labs. The producer is not affiliated with Anthropic or Eleven Labs. An AI agent that distilled its hard-won experience into reusable code scored ten points worse than an agent with no memory at all. This episode unpacks why the sophisticated-looking move — freezing lessons into callable tools — is also the fragile one, and what the right fix turns out to be. You'll come away understanding the single most basic decision in building agents that learn on the job: when a lesson should stay as soft advice, and when it's earned the right to become code. KEY TAKEAWAYS * Why storing an agent's experience as callable code can drop it below an agent with no memory at all — a 22-point collapse the moment it has to generalize * The 'injection asymmetry': text is consumed as adaptable advice you filter through reality, while code is a trusted black box whose flaws propagate to every caller and suppress the agent's own recovery behavior * Metis's 'text first, code earned' policy — sorting experience into plans, facts, and pitfalls, and crystallizing only recurring plans into tools using the desire-path principle * Why the codifier deliberately never reads the messy trajectory, building tools from the clean query pattern instead — and how that lets even failed runs safely count toward codification * The ablation that proves the recurrence gate: an 'Eager' version cost 47% more to build, scored worse, and left over half its tools never invoked * Where the clean story has a seam: the headline result is really about ungated, trajectory-trained, unvalidated code on a single benchmark — not a law that 'code memory is bad' * 01:57 — The brilliant employee with amnesia Frames the core problem: stateless agents lose everything they figure out, and the field hasn't examined how lessons should be stored. * 03:01 — Text advice or a black-box tool? Lays out the fork between storing lessons as adaptable text versus callable code, and why the real difference is how the agent consumes each. * 04:50 — The experiment that fixed every variable Describes the clean diagnostic on AppWorld, splitting executor and reflector models, and measuring construction cost, execution efficiency, and transfer reliability. * 08:43 — The 22-point collapse Reveals the headline reversal: code memory looks great in-sample but collapses 22 points under realistic streaming, dropping below the no-memory baseline. * 10:06 — Why the confident tool fails hard Explains the injection asymmetry through the coworker analogy and why trusted code suppresses an agent's own self-correction. * 13:07 — Paving only the paths people walk Walks through Metis's three design choices — the plans/facts/pitfalls taxonomy, the recurrence gate, and query-only codification — using the desire-path analogy. * 18:13 — Does the machinery actually pay off? Tests the predictions: Metis is more accurate and cheaper at once, and the Eager ablation proves the recurrence gate is a quality filter. * 21:41 — The seam in the clean story The steelman critique: the real claim is about ungated, trajectory-trained code on a single benchmark, with the genuine edge limited to distribution shift. * 24:30 — Don't pour the concrete too early Draws out the durable lesson — store knowledge in a form that follows its properties — and poses the closing question to listeners. RECOMMENDED READING * ReAct: Synergizing Reasoning and Acting in Language Models [https://arxiv.org/abs/2210.03629] — The think-act-observe loop the episode names as the baseline floor every memory variant in Metis is measured against. * AppWorld: A Controllable World of Apps and People for Benchmarking Interactive Coding Agents [https://arxiv.org/abs/2407.18901] — The exact 457-API simulated benchmark all of the episode's accuracy and token numbers are run on. * Voyager: An Open-Ended Embodied Agent with Large Language Models [https://arxiv.org/abs/2305.16291] — The canonical 'agent builds a reusable skill library of callable code' approach this episode's text-first-code-earned policy is implicitly arguing against. * Generative Agents: Interactive Simulacra of Human Behavior [https://arxiv.org/abs/2304.03442] — A contrasting take where agent experience is stored and retrieved as natural-language memory, the 'soft advice' side of the episode's text-versus-code fork.

Gestern26 min

How Teaching an AI to Predict, Not Act, Made It a Better Actor

Beschreibung

Kommentare

2 Monate für 1 €

Alle Folgen