AI Papers: A Deep Dive
WHY BETTER BUG REPORTS CAN MAKE AI CODING AGENTS WORSE Source: SHERLOC: Structured Diagnostic Localization for Code Repair Agents [https://arxiv.org/abs/2606.24820] Paper was published on June 23, 2026 This episode was AI-generated on June 24, 2026. The script was written by an AI language model and the host voices were synthesized by Eleven Labs. The producer is not affiliated with Anthropic or Eleven Labs. Hand a capable AI coding agent a more accurate report of where a bug lives, and it can fix fewer bugs than with nothing at all. This episode digs into SHERLOC, a paper arguing the field has been scoring localization like a search engine when what actually matters is the diagnosis — and shows where the impressive numbers stop being deployable. KEY TAKEAWAYS * Why AI coding agents spend roughly 48% of their turns and over 320,000 tokens just locating a bug before writing any fix * How SHERLOC reframes localization from 'find the right file' to a structured five-field diagnostic case file * Why a single setting — thinking mode off — collapses the same model from 74% recall to 10%, with 87% of runs producing no valid output * The capability-dependent transfer finding: weak repair agents gain 8-12 points, while strong agents can lose ground when fed findings indiscriminately * Why a low-quality diagnosis (20% resolve rate) drags an agent below the 62% baseline of having no report at all * The two honest limits: the quality filter relies on the ground-truth patch and isn't deployable, and ~58% of recall may come from memorized famous libraries * 00:00 — The taxi meter that never stops Sets up the counterintuitive finding and the headline cost: agents burn roughly half their compute just locating bugs before fixing anything. * 02:47 — Red circle versus the written report Introduces the core reframe — that a bare file path is underspecified, and SHERLOC instead emits a structured five-field diagnostic finding. * 05:12 — One setting flips everything Explains SHERLOC's training-free design, its four-tool menu and self-recovery layer, and the dramatic collapse when reasoning mode is turned off. * 09:09 — Can the underdog beat the specialists? Covers SHERLOC's state-of-the-art benchmark results and how structure substitutes for both scale and specialized training. * 10:25 — Does it just remember Django? Introduces the contamination worry and the masking gauntlet used to estimate how much performance comes from real exploration versus memorization. * 12:12 — The map that distracts the cabbie Presents capability-dependent transfer and the result that bad diagnoses drag agents below their no-report baseline. * 16:43 — The filter you can't actually ship The steelman critique: the quality filter peeks at the ground-truth patch, contamination remains unresolved, and the best numbers come at heavy serving cost in one language. * 21:25 — What actually survives the critique Lands on the durable reframe that diagnosis quality, not location accuracy, predicts repair success, and poses the closing question to listeners. RECOMMENDED READING * SWE-bench: Can Language Models Resolve Real-World GitHub Issues? [https://arxiv.org/abs/2310.06770] — The benchmark this episode's results are measured on — the real-GitHub-bug-plus-fixing-PR dataset whose Lite and Verified splits SHERLOC tops and whose contamination problems the hosts dwell on. * SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering [https://arxiv.org/abs/2405.15793] — The agent-framework lineage behind the repair agents SHERLOC injects case files into, and the source of the 'don't let models run arbitrary shells or they derail' design lesson the episode cites.
165 episodios
Comentarios
0Sé la primera persona en comentar
¡Regístrate ahora y únete a la comunidad de AI Papers: A Deep Dive!