Terminal Agents Get Free Supervision From The Tokens We've Been Throwing Away

Beschrijving

TERMINAL AGENTS GET FREE SUPERVISION FROM THE TOKENS WE'VE BEEN THROWING AWAY Source: ECHO: Terminal Agents Learn World Models for Free [https://arxiv.org/abs/2605.24517] Paper was published on May 23, 2026 This episode was AI-generated on May 26, 2026. The script was written by an AI language model and the host voices were synthesized by Eleven Labs. The producer is not affiliated with Anthropic or Eleven Labs. Standard agent RL throws away 85% of rollouts because the task didn't succeed — but the terminal's responses inside those failed runs contain dense, gradable supervision that nobody was using. A new Microsoft Research paper shows that adding a simple next-token loss on environment outputs roughly doubles task success, recovers most of the value of expensive expert demonstrations, and in some cases lets models improve with no reward signal at all. KEY TAKEAWAYS * Why agent RL's reward sparsity is partly an artifact of which tokens we compute loss on, not a property of the task * How ECHO's one-line addition — cross-entropy on terminal output tokens — roughly doubles TerminalBench 2.0 pass rates at 8B and 14B scale * The lambda=0.2 collapse: when the auxiliary weight is too high, models learn to issue boring commands whose outputs are easy to predict * Why ECHO can substitute for the 'interaction prior' half of expert demonstrations but not the 'strategy prior' half * The verifier-free result — improvement with no reward signal on some held-out tasks, and active regression on others — and what that tells us about when prediction-as-learning works * Honest limits: small absolute numbers, untested at higher base capability, and a 'world model' claim that rests on a single transfer experiment * 00:00 — The supervision that was already in the rollout Framing the core observation: failed agent trajectories contain thousands of environment tokens whose gradients GRPO masks out. * 03:13 — What ECHO actually changes The one-line addition of next-token loss on terminal outputs, and the chess-student analogy for why predicting the environment forces understanding. * 06:27 — The headline numbers, honestly Roughly doubled pass rates at 8B and 14B on TerminalBench 2.0 — on a baseline of 2-5%, with timeouts cut in half and faster convergence. * 09:41 — Which tokens to predict, and the lambda collapse Why warning messages had to be excluded, and how setting the auxiliary loss weight too high causes models to game the prediction objective with trivial commands. * 12:55 — Substituting for expert demonstrations ECHO from a raw base model recovers most of the value of 15,000 GLM-4.6 demonstrations — but only the interaction-prior half, not the strategy half. * 16:09 — Transfer evidence and the world-modeling claim ECHO models predict Qwen3-32B's trajectories far better than GRPO baselines, suggesting transferable knowledge of terminal dynamics — though what specifically transferred isn't probed. * 15:59 — The verifier-free experiment Turning off the reward signal entirely and letting environment prediction alone drive improvement — which works on PyTerm, fails on TBLite, and reveals when the method needs action-linked feedback. * 22:36 — Steelman, limits, and what to test next Five honest caveats about the result and the open question of whether ECHO generalizes beyond terminals and beyond low-capability base models. RECOMMENDED READING * Group Relative Policy Optimization (DeepSeekMath) [https://arxiv.org/abs/2402.03300] — Introduces the GRPO algorithm that ECHO modifies — essential background for understanding what 'masking out the terminal tokens' actually means in the baseline. * Curiosity-driven Exploration by Self-supervised Prediction [https://arxiv.org/abs/1705.05363] — The canonical prior work on learning from prediction error as an intrinsic signal, which the episode's verifier-free result echoes in a language-model setting. * Reinforcement Learning with Unsupervised Auxiliary Tasks (UNREAL) [https://arxiv.org/abs/1611.05397] — A foundational example of adding auxiliary prediction losses to RL agents, useful for contextualizing ECHO against the deeper history of dense-supervision methods the paper doesn't directly compare to. * SWE-bench: Can Language Models Resolve Real-World GitHub Issues? [https://arxiv.org/abs/2310.06770] — Sets the benchmark context for the kind of terminal-agent task ECHO is trying to improve, and frames why doubling a 5% pass rate matters even though the absolute numbers stay small.

Why Frozen-Weight Agents Still Get Worse Over Time

WHY FROZEN-WEIGHT AGENTS STILL GET WORSE OVER TIME Source: Your Agents Are Aging Too: Agent Lifespan Engineering for Deployed Systems [https://arxiv.org/abs/2605.26302] Paper was published on May 25, 2026 This episode was AI-generated on May 27, 2026. The script was written by an AI language model and the host voices were synthesized by Eleven Labs. The producer is not affiliated with Anthropic or Eleven Labs. A deployed AI agent's model weights never change — but the agent itself ages, and it ages in at least four mechanistically distinct ways. A new paper introduces a diagnostic ladder that can tell those failure modes apart, and shows that a one-paragraph change to how an agent summarizes its own memory can extend its useful lifespan by more than four times. KEY TAKEAWAYS * Agent reliability is a lifespan property, not a benchmark snapshot — the memory store, retrieval, and compaction around a frozen model keep changing every session * Four named failure modes: compression, interference, revision, and maintenance aging — split into accumulation-driven and event-driven families * The counterfactual ladder: a three-rung diagnostic that isolates write failures, read failures, and utilization failures without needing model internals * Three models with nearly identical error rates can have completely different underlying diseases — and 'add more memory' is the wrong fix for two of them * A one-paragraph 'careful' compaction prompt that names what to preserve verbatim yields roughly a 4.5x lifespan improvement on the same system * Production monitoring tends to track constraint compliance while missing silent precision decay — the agent stops violating rules but also stops knowing the specifics * Scale doesn't fix structural problems: a small typed-state sidecar cuts running-balance error 25–50% with no model change * 00:00 — Four vignettes, one puzzle Four deployed-agent failures that the standard 'frozen weights = frozen system' mental model can't explain. * 02:05 — Reframing reliability as a lifespan property Why the apparatus around the model — memory, retrieval, compaction — is what actually changes over time. * 04:10 — The four aging mechanisms Compression, interference, revision, and maintenance aging — and why they split into accumulation-driven and event-driven families. * 06:30 — The counterfactual ladder A three-rung diagnostic that isolates write, read, and utilization failures by progressively swapping in oracle components. * 08:20 — Same score, different disease Empirical results showing models with near-identical error rates can have completely different failure breakdowns under the ladder. * 10:25 — The 4.5x compaction-prompt result How a one-paragraph change to summarization instructions extends agent half-life dramatically on the same underlying system. * 14:30 — Silent precision decay Why constraint-violation monitoring stays green while the agent quietly forgets the specifics it was supposed to remember. * 14:35 — Why scale doesn't save the running budget A small and a large model both drift on arithmetic over a session history because the failure is representational, not capacity-bound. * 16:41 — Honest critique Synthetic scenarios, simple memory architectures, and short session horizons — what the paper's numbers can and can't tell us. * 18:46 — Production CLI agents and re-reading Findings from Claude Code and OpenHands on why correct answers correlate with more retrieval, and why flagship models can write lower-fidelity artifacts. * 20:51 — The sticky note fix A small typed-state overlay alongside normal memory that cuts accumulator error substantially without changing the model. RECOMMENDED READING * MemGPT: Towards LLMs as Operating Systems [https://arxiv.org/abs/2310.08560] — Proposes a hierarchical memory system with explicit paging between context and external storage — directly relevant to the episode's argument that the fix for agent aging is structural, not bigger models. * Lost in the Middle: How Language Models Use Long Contexts [https://arxiv.org/abs/2307.03172] — Empirical evidence that models fail to utilize information even when it's present in context — the 'utilization failure' rung of the episode's counterfactual ladder. * Generative Agents: Interactive Simulacra of Human Behavior [https://arxiv.org/abs/2304.03442] — The Park et al. paper that popularized reflection-and-summarization memory architectures — exactly the kind of compaction-based stack whose aging dynamics this episode dissects. * Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks [https://arxiv.org/abs/2005.11401] — The original RAG paper, useful background for the episode's distinction between write failures, retrieval failures, and utilization failures in memory-augmented agents.

Gisteren22 min

Terminal Agents Get Free Supervision From The Tokens We've Been Throwing Away

Beschrijving

Reacties

2 maanden voor € 1

Alle afleveringen