Why Better Bug Reports Can Make AI Coding Agents Worse

Descripción

WHY BETTER BUG REPORTS CAN MAKE AI CODING AGENTS WORSE Source: SHERLOC: Structured Diagnostic Localization for Code Repair Agents [https://arxiv.org/abs/2606.24820] Paper was published on June 23, 2026 This episode was AI-generated on June 24, 2026. The script was written by an AI language model and the host voices were synthesized by Eleven Labs. The producer is not affiliated with Anthropic or Eleven Labs. Hand a capable AI coding agent a more accurate report of where a bug lives, and it can fix fewer bugs than with nothing at all. This episode digs into SHERLOC, a paper arguing the field has been scoring localization like a search engine when what actually matters is the diagnosis — and shows where the impressive numbers stop being deployable. KEY TAKEAWAYS * Why AI coding agents spend roughly 48% of their turns and over 320,000 tokens just locating a bug before writing any fix * How SHERLOC reframes localization from 'find the right file' to a structured five-field diagnostic case file * Why a single setting — thinking mode off — collapses the same model from 74% recall to 10%, with 87% of runs producing no valid output * The capability-dependent transfer finding: weak repair agents gain 8-12 points, while strong agents can lose ground when fed findings indiscriminately * Why a low-quality diagnosis (20% resolve rate) drags an agent below the 62% baseline of having no report at all * The two honest limits: the quality filter relies on the ground-truth patch and isn't deployable, and ~58% of recall may come from memorized famous libraries * 00:00 — The taxi meter that never stops Sets up the counterintuitive finding and the headline cost: agents burn roughly half their compute just locating bugs before fixing anything. * 02:47 — Red circle versus the written report Introduces the core reframe — that a bare file path is underspecified, and SHERLOC instead emits a structured five-field diagnostic finding. * 05:12 — One setting flips everything Explains SHERLOC's training-free design, its four-tool menu and self-recovery layer, and the dramatic collapse when reasoning mode is turned off. * 09:09 — Can the underdog beat the specialists? Covers SHERLOC's state-of-the-art benchmark results and how structure substitutes for both scale and specialized training. * 10:25 — Does it just remember Django? Introduces the contamination worry and the masking gauntlet used to estimate how much performance comes from real exploration versus memorization. * 12:12 — The map that distracts the cabbie Presents capability-dependent transfer and the result that bad diagnoses drag agents below their no-report baseline. * 16:43 — The filter you can't actually ship The steelman critique: the quality filter peeks at the ground-truth patch, contamination remains unresolved, and the best numbers come at heavy serving cost in one language. * 21:25 — What actually survives the critique Lands on the durable reframe that diagnosis quality, not location accuracy, predicts repair success, and poses the closing question to listeners. RECOMMENDED READING * SWE-bench: Can Language Models Resolve Real-World GitHub Issues? [https://arxiv.org/abs/2310.06770] — The benchmark this episode's results are measured on — the real-GitHub-bug-plus-fixing-PR dataset whose Lite and Verified splits SHERLOC tops and whose contamination problems the hosts dwell on. * SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering [https://arxiv.org/abs/2405.15793] — The agent-framework lineage behind the repair agents SHERLOC injects case files into, and the source of the 'don't let models run arbitrary shells or they derail' design lesson the episode cites.

When Turning Experience Into Code Makes Your AI Agent Dumber

WHEN TURNING EXPERIENCE INTO CODE MAKES YOUR AI AGENT DUMBER Source: Metis: Bridging Text and Code Memory for Self-Evolving Agents [https://arxiv.org/abs/2606.24151] Paper was published on June 23, 2026 This episode was AI-generated on June 24, 2026. The script was written by an AI language model and the host voices were synthesized by Eleven Labs. The producer is not affiliated with Anthropic or Eleven Labs. An AI agent that distilled its hard-won experience into reusable code scored ten points worse than an agent with no memory at all. This episode unpacks why the sophisticated-looking move — freezing lessons into callable tools — is also the fragile one, and what the right fix turns out to be. You'll come away understanding the single most basic decision in building agents that learn on the job: when a lesson should stay as soft advice, and when it's earned the right to become code. KEY TAKEAWAYS * Why storing an agent's experience as callable code can drop it below an agent with no memory at all — a 22-point collapse the moment it has to generalize * The 'injection asymmetry': text is consumed as adaptable advice you filter through reality, while code is a trusted black box whose flaws propagate to every caller and suppress the agent's own recovery behavior * Metis's 'text first, code earned' policy — sorting experience into plans, facts, and pitfalls, and crystallizing only recurring plans into tools using the desire-path principle * Why the codifier deliberately never reads the messy trajectory, building tools from the clean query pattern instead — and how that lets even failed runs safely count toward codification * The ablation that proves the recurrence gate: an 'Eager' version cost 47% more to build, scored worse, and left over half its tools never invoked * Where the clean story has a seam: the headline result is really about ungated, trajectory-trained, unvalidated code on a single benchmark — not a law that 'code memory is bad' * 01:57 — The brilliant employee with amnesia Frames the core problem: stateless agents lose everything they figure out, and the field hasn't examined how lessons should be stored. * 03:01 — Text advice or a black-box tool? Lays out the fork between storing lessons as adaptable text versus callable code, and why the real difference is how the agent consumes each. * 04:50 — The experiment that fixed every variable Describes the clean diagnostic on AppWorld, splitting executor and reflector models, and measuring construction cost, execution efficiency, and transfer reliability. * 08:43 — The 22-point collapse Reveals the headline reversal: code memory looks great in-sample but collapses 22 points under realistic streaming, dropping below the no-memory baseline. * 10:06 — Why the confident tool fails hard Explains the injection asymmetry through the coworker analogy and why trusted code suppresses an agent's own self-correction. * 13:07 — Paving only the paths people walk Walks through Metis's three design choices — the plans/facts/pitfalls taxonomy, the recurrence gate, and query-only codification — using the desire-path analogy. * 18:13 — Does the machinery actually pay off? Tests the predictions: Metis is more accurate and cheaper at once, and the Eager ablation proves the recurrence gate is a quality filter. * 21:41 — The seam in the clean story The steelman critique: the real claim is about ungated, trajectory-trained code on a single benchmark, with the genuine edge limited to distribution shift. * 24:30 — Don't pour the concrete too early Draws out the durable lesson — store knowledge in a form that follows its properties — and poses the closing question to listeners. RECOMMENDED READING * ReAct: Synergizing Reasoning and Acting in Language Models [https://arxiv.org/abs/2210.03629] — The think-act-observe loop the episode names as the baseline floor every memory variant in Metis is measured against. * AppWorld: A Controllable World of Apps and People for Benchmarking Interactive Coding Agents [https://arxiv.org/abs/2407.18901] — The exact 457-API simulated benchmark all of the episode's accuracy and token numbers are run on. * Voyager: An Open-Ended Embodied Agent with Large Language Models [https://arxiv.org/abs/2305.16291] — The canonical 'agent builds a reusable skill library of callable code' approach this episode's text-first-code-earned policy is implicitly arguing against. * Generative Agents: Interactive Simulacra of Human Behavior [https://arxiv.org/abs/2304.03442] — A contrasting take where agent experience is stored and retrieved as natural-language memory, the 'soft advice' side of the episode's text-versus-code fork.

Ayer26 min

Why Better Bug Reports Can Make AI Coding Agents Worse

Descripción

Comentarios

2 meses por 1 €

Todos los episodios