AI Papers: A Deep Dive
HOW CODING AGENTS CAN MINE THEIR OWN FAILURES INTO A SELF-TARGETING CURRICULUM Source: Socratic-SWE: Self-Evolving Coding Agents via Trace-Derived Agent Skills [https://arxiv.org/abs/2606.07412] Paper was published on June 05, 2026 This episode was AI-generated on June 9, 2026. The script was written by an AI language model and the host voices were synthesized by Eleven Labs. The producer is not affiliated with Anthropic or Eleven Labs. Almost every pipeline that trains AI coding agents records a detailed diary of how the agent worked through a bug — then throws all of it away, keeping only pass or fail. This paper recycles those discarded traces into a library of the agent's specific weaknesses and manufactures practice problems aimed straight at them, and the payoff is a self-improvement loop that keeps accelerating where rival methods quietly regress. KEY TAKEAWAYS * Why the field's habit of compressing a rich agent trace down to a single pass/fail bit throws away its best map of where the agent is weak * How 'skill cards' work — concrete, written diagnoses of an agent's repeated mistakes (like confusing a nonce and a timestamp in an OAuth library), not vague 'be better at edge cases' advice * The shift from difficulty to gradient alignment: selecting practice problems by whether training on them measurably points toward the goal, not by how hard they are * Why this loop accelerates across three rounds to just over 50% on SWE-bench Verified while competing self-play methods peak early and one ends up worse than where it started * That the win comes from the framework (real traces, real test validation), not the eloquence of the extractor — a small model and a frontier model both moved results by only about half a point * Where the authors and the hosts push back: the validation set may quietly teach to the test, 'starting from zero tasks' overstates the case, and the method saturates around round five because it operates in a closed pool of repositories * 00:00 — The diary you throw away Every agent run leaves a detailed trace of its problem-solving, but training pipelines keep only the final pass/fail bit and delete the rest. * 02:38 — The data bottleneck and static synthetic bugs Why RL for coding agents starves for verifiable problems, and why fixed-rule synthetic bug generators keep drilling weaknesses the agent doesn't actually have. * 05:17 — Skill cards: turning traces into named weaknesses A concrete look at how the system distills repeated, specific failures (like the OAuthLib examples) into reusable diagnostic playbooks. * 07:56 — The self-evolving loop and its validation gates How one model wears Solver and Generator hats behind an answer-key blindfold, and the four cheap-first checkpoints that filter out malformed problems. * 10:35 — From difficulty to gradient alignment The flashcard reframing — keeping problems whose training direction lines up with the goal, measured with cosine similarity, instead of selecting by difficulty. * 13:14 — Results: accelerating gains while rivals regress Just over 50% on SWE-bench Verified, gains that speed up across three rounds, and competing self-play methods that peak early or get worse. * 15:53 — Where the skeptic pushes back The hosts probe teaching-to-the-test risks in the validation set, the overstated 'zero tasks' framing, the five-round saturation ceiling, and single-model-size limitations. * 18:32 — The idea that outlives the system Why recycling an agent's own behavior into a self-targeting curriculum reshapes the economics of training, even within a closed world. RECOMMENDED READING * SWE-bench: Can Language Models Resolve Real-World GitHub Issues? [https://arxiv.org/abs/2310.06770] — The benchmark of real GitHub issues this episode's headline 'just over fifty percent' result is measured on, and the standard for verifiable coding-agent evaluation. * SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering [https://arxiv.org/abs/2405.15793] — Establishes the tool-calling agent loop — search, edit, run tests — whose discarded traces this episode argues are the most valuable training signal. * Automatic Curriculum Learning through Value Disagreement [https://arxiv.org/abs/2006.09641] — A foundational take on curriculum design that selects tasks by learning progress rather than raw difficulty, the exact reframing this episode credits as the paper's deeper contribution. * Self-Rewarding Language Models [https://arxiv.org/abs/2401.10020] — A prominent self-improvement loop where one model generates and judges its own training data, the kind of self-play approach this episode contrasts against for collapsing or regressing across rounds.
124 episodios
Comentarios
0Sé la primera persona en comentar
¡Regístrate ahora y únete a la comunidad de AI Papers: A Deep Dive!