When a One-Liner Beats Your Agent's Clever Verification Logic

Beskrivelse

WHEN A ONE-LINER BEATS YOUR AGENT'S CLEVER VERIFICATION LOGIC Source: Bayesian control for coding agents [https://arxiv.org/abs/2606.24453] Paper was published on June 23, 2026 This episode was AI-generated on June 24, 2026. The script was written by an AI language model and the host voices were synthesized by Eleven Labs. The producer is not affiliated with Anthropic or Eleven Labs. Your coding agent has to decide whether to pay for an eleven-minute test or just ship — and a new paper turns that gut call into a single computable number. But the surprising part is how much effort it spends telling you exactly when its own Bayesian machinery is dead weight. We map out the three regimes that decide whether careful reasoning beats a dumb if-statement. KEY TAKEAWAYS * The exact break-even line for running an expensive verifier: verify only when your belief that the code is correct crosses cost divided by reward * Why a syntax checker carries zero signal — and how the Bayesian update figures that out on its own without hand-tuning * The three-region map: verify everything when checking is cheap, gate on one near-oracle test in the middle, and reason carefully only when verification is expensive and critics are imperfect * Why the headline 'plus sixty-two over always-verify' is soft — it's measured against a known-bad baseline, in a replay (not live) evaluation, and ignores the upfront cost of calibrating from oracle calls * How the controller's running belief doubles as a portable confidence score (0.87 ranking, rising to 0.91 on hard problems) you can bolt onto any agent * The whole gain comes from frozen models and a smarter control layer — no training, no fine-tuning * 01:42 — The agent that's really a toolbox Reframes a coding agent as a generator wrapped in a menu of tools — from a free syntax check to an eleven-minute oracle — with wildly lopsided costs and reliabilities. * 03:06 — Why fixed rules ignore what matters Argues that always-verify, best-of-N, and hard-coded refinement loops all ignore uncertainty, and proposes treating the control layer like a diagnostician ordering tests. * 04:10 — The whole idea in one breath Lays out the core move: carry a running belief that the code will pass, let cheap critics nudge it, and act to maximize reward minus the costs you rack up. * 06:36 — The one equation worth doing Derives the break-even threshold — verify when belief crosses cost-over-reward — and shows how that ratio plus the prior pass rate become the two axes of the map. * 08:25 — How a critic moves the needle Explains via Bayes' rule why a critic's value is the gap between how it treats correct versus broken code, why syntax checks are useless, and how mediocre critics compose. * 11:17 — Three regions, and only one is interesting Walks through the two-axis map: verify everything when checking is cheap, gate on a near-oracle test in the middle, and reason carefully only in the costly top-left corner. * 15:40 — How much of plus-sixty-two is real? The steelman critique: the headline margin beats a known-bad baseline, the evaluation replays pre-collected patches rather than generating live, and calibration hides an upfront oracle bill. * 20:42 — A confidence score you can bolt on anywhere Shows the belief state works as a well-calibrated, training-free confidence signal that beats sequence probability and perplexity — and gets better on hard problems. RECOMMENDED READING * Self-Refine: Iterative Refinement with Self-Feedback [https://arxiv.org/abs/2303.17651] — One of the named refinement agents the episode benchmarks against; it formalizes the generate-critique-regenerate loop the paper argues ignores uncertainty. * Reflexion: Language Agents with Verbal Reinforcement Learning [https://arxiv.org/abs/2303.11366] — The verbal-memory refinement agent that, in the episode's expensive-verification regime, actually went negative against doing nothing clever. * SWE-bench: Can Language Models Resolve Real-World GitHub Issues? [https://arxiv.org/abs/2310.06770] — The real-GitHub-issue benchmark whose patches gave the episode its eleven-minute test-suite telemetry and the region-A headline numbers.

When Turning Experience Into Code Makes Your AI Agent Dumber

WHEN TURNING EXPERIENCE INTO CODE MAKES YOUR AI AGENT DUMBER Source: Metis: Bridging Text and Code Memory for Self-Evolving Agents [https://arxiv.org/abs/2606.24151] Paper was published on June 23, 2026 This episode was AI-generated on June 24, 2026. The script was written by an AI language model and the host voices were synthesized by Eleven Labs. The producer is not affiliated with Anthropic or Eleven Labs. An AI agent that distilled its hard-won experience into reusable code scored ten points worse than an agent with no memory at all. This episode unpacks why the sophisticated-looking move — freezing lessons into callable tools — is also the fragile one, and what the right fix turns out to be. You'll come away understanding the single most basic decision in building agents that learn on the job: when a lesson should stay as soft advice, and when it's earned the right to become code. KEY TAKEAWAYS * Why storing an agent's experience as callable code can drop it below an agent with no memory at all — a 22-point collapse the moment it has to generalize * The 'injection asymmetry': text is consumed as adaptable advice you filter through reality, while code is a trusted black box whose flaws propagate to every caller and suppress the agent's own recovery behavior * Metis's 'text first, code earned' policy — sorting experience into plans, facts, and pitfalls, and crystallizing only recurring plans into tools using the desire-path principle * Why the codifier deliberately never reads the messy trajectory, building tools from the clean query pattern instead — and how that lets even failed runs safely count toward codification * The ablation that proves the recurrence gate: an 'Eager' version cost 47% more to build, scored worse, and left over half its tools never invoked * Where the clean story has a seam: the headline result is really about ungated, trajectory-trained, unvalidated code on a single benchmark — not a law that 'code memory is bad' * 01:57 — The brilliant employee with amnesia Frames the core problem: stateless agents lose everything they figure out, and the field hasn't examined how lessons should be stored. * 03:01 — Text advice or a black-box tool? Lays out the fork between storing lessons as adaptable text versus callable code, and why the real difference is how the agent consumes each. * 04:50 — The experiment that fixed every variable Describes the clean diagnostic on AppWorld, splitting executor and reflector models, and measuring construction cost, execution efficiency, and transfer reliability. * 08:43 — The 22-point collapse Reveals the headline reversal: code memory looks great in-sample but collapses 22 points under realistic streaming, dropping below the no-memory baseline. * 10:06 — Why the confident tool fails hard Explains the injection asymmetry through the coworker analogy and why trusted code suppresses an agent's own self-correction. * 13:07 — Paving only the paths people walk Walks through Metis's three design choices — the plans/facts/pitfalls taxonomy, the recurrence gate, and query-only codification — using the desire-path analogy. * 18:13 — Does the machinery actually pay off? Tests the predictions: Metis is more accurate and cheaper at once, and the Eager ablation proves the recurrence gate is a quality filter. * 21:41 — The seam in the clean story The steelman critique: the real claim is about ungated, trajectory-trained code on a single benchmark, with the genuine edge limited to distribution shift. * 24:30 — Don't pour the concrete too early Draws out the durable lesson — store knowledge in a form that follows its properties — and poses the closing question to listeners. RECOMMENDED READING * ReAct: Synergizing Reasoning and Acting in Language Models [https://arxiv.org/abs/2210.03629] — The think-act-observe loop the episode names as the baseline floor every memory variant in Metis is measured against. * AppWorld: A Controllable World of Apps and People for Benchmarking Interactive Coding Agents [https://arxiv.org/abs/2407.18901] — The exact 457-API simulated benchmark all of the episode's accuracy and token numbers are run on. * Voyager: An Open-Ended Embodied Agent with Large Language Models [https://arxiv.org/abs/2305.16291] — The canonical 'agent builds a reusable skill library of callable code' approach this episode's text-first-code-earned policy is implicitly arguing against. * Generative Agents: Interactive Simulacra of Human Behavior [https://arxiv.org/abs/2304.03442] — A contrasting take where agent experience is stored and retrieved as natural-language memory, the 'soft advice' side of the episode's text-versus-code fork.

I går26 min

When a One-Liner Beats Your Agent's Clever Verification Logic

Beskrivelse

Kommentarer

1 måned kun 9 kr.

Alle episoder