Two Levers for Self-Improving AI: When Rewriting Code Isn't Enough

Beskrivelse

TWO LEVERS FOR SELF-IMPROVING AI: WHEN REWRITING CODE ISN'T ENOUGH Source: SIA: Self Improving AI with Harness & Weight Updates [https://arxiv.org/abs/2605.27276] Paper was published on May 26, 2026 This episode was AI-generated on May 27, 2026. The script was written by an AI language model and the host voices were synthesized by Eleven Labs. The producer is not affiliated with Anthropic or Eleven Labs. An AI agent spent many iterations rewriting its own scaffolding to denoise genomic data and hit a wall. Then it was allowed to retrain its own weights — and on the first try, it added two trivial lines of code that any biologist would have spotted, cutting error by twenty percent. A new paper argues that scaffold edits and weight updates reach fundamentally different places, and that no self-improvement loop touching only one is going to be enough. KEY TAKEAWAYS * Why scaffold rewrites and weight updates are not interchangeable — they change different things (how the agent searches vs. what the model knows) * How SIA's Feedback-Agent reads full agent trajectories to decide which lever to pull, and even picks which RL algorithm to use * Concrete results across three deliberately different domains: Chinese legal classification, CUDA kernel optimization on H100s, and single-cell RNA-seq denoising * Why the headline 502% improvement is real but misleading — the mechanism claim is closer to a 20% gain over the harness-only ceiling * The 'coupled co-evolutionary Goodhart' failure mode the authors themselves flag: two optimizers converging on a verifier rather than the underlying problem * What the paper does and doesn't prove — a credible proof of concept, not a settled result, with clean verifiers doing more work than the framing admits * 00:00 — The two-line fix that broke a plateau An opening case study where a weight update found a trivial biological invariant that endless scaffold iteration had missed. * 03:08 — Two camps that haven't been talking Framing the field's split between scaffold-evolution work (Darwin Gödel Machine, AI Scientist) and test-time-training work, and the obvious question each camp's silence implies. * 06:17 — Inside the SIA architecture How the Meta-Agent, task agent, and Feedback-Agent fit together, and why giving the Feedback-Agent the full trajectory matters. * 09:26 — Three benchmarks, three shapes of expertise Walking through LawBench, CUDA kernel optimization, and RNA-seq denoising — and what each result implies about the harness ceiling. * 12:34 — Picking the RL algorithm on the fly Why the Feedback-Agent chooses between methods like GRPO and entropic advantage weighting based on the reward landscape, and what that automation does and doesn't prove. * 16:23 — The skeptic pass Where the ablations fall short, why the benchmark selection flatters the method, and how the abstract's biggest number answers a different question than the mechanism claim. * 18:53 — Coupled co-evolutionary Goodhart The deeper failure mode the authors themselves raise: two optimizers fitting each other rather than the underlying problem. * 22:00 — What this would mean if it generalizes Where the human role moves if specifying a task and a verifier is enough, and why that 'if' is still load-bearing. RECOMMENDED READING * Darwin Gödel Machine: Open-Ended Evolution of Self-Improving Agents [https://arxiv.org/abs/2505.22954] — A leading example of the scaffold-evolution camp the episode contrasts with weight updates — the AI rewrites the code around a frozen model. * The Surprising Effectiveness of Test-Time Training for Abstract Reasoning [https://arxiv.org/abs/2411.07279] — Akyürek et al.'s test-time-training work, representing the opposite camp SIA tries to unify: leave the scaffolding alone and adapt the weights at inference. * DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models [https://arxiv.org/abs/2402.03300] — Introduces GRPO, the RL algorithm the Feedback-Agent picks for the LawBench task — useful background for the algorithm-selection discussion. * The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery [https://arxiv.org/abs/2408.06292] — Another reference point in the scaffold-iteration lineage SIA positions itself against, where an LLM orchestrates research without touching its own weights.

Why Frozen-Weight Agents Still Get Worse Over Time

WHY FROZEN-WEIGHT AGENTS STILL GET WORSE OVER TIME Source: Your Agents Are Aging Too: Agent Lifespan Engineering for Deployed Systems [https://arxiv.org/abs/2605.26302] Paper was published on May 25, 2026 This episode was AI-generated on May 27, 2026. The script was written by an AI language model and the host voices were synthesized by Eleven Labs. The producer is not affiliated with Anthropic or Eleven Labs. A deployed AI agent's model weights never change — but the agent itself ages, and it ages in at least four mechanistically distinct ways. A new paper introduces a diagnostic ladder that can tell those failure modes apart, and shows that a one-paragraph change to how an agent summarizes its own memory can extend its useful lifespan by more than four times. KEY TAKEAWAYS * Agent reliability is a lifespan property, not a benchmark snapshot — the memory store, retrieval, and compaction around a frozen model keep changing every session * Four named failure modes: compression, interference, revision, and maintenance aging — split into accumulation-driven and event-driven families * The counterfactual ladder: a three-rung diagnostic that isolates write failures, read failures, and utilization failures without needing model internals * Three models with nearly identical error rates can have completely different underlying diseases — and 'add more memory' is the wrong fix for two of them * A one-paragraph 'careful' compaction prompt that names what to preserve verbatim yields roughly a 4.5x lifespan improvement on the same system * Production monitoring tends to track constraint compliance while missing silent precision decay — the agent stops violating rules but also stops knowing the specifics * Scale doesn't fix structural problems: a small typed-state sidecar cuts running-balance error 25–50% with no model change * 00:00 — Four vignettes, one puzzle Four deployed-agent failures that the standard 'frozen weights = frozen system' mental model can't explain. * 02:05 — Reframing reliability as a lifespan property Why the apparatus around the model — memory, retrieval, compaction — is what actually changes over time. * 04:10 — The four aging mechanisms Compression, interference, revision, and maintenance aging — and why they split into accumulation-driven and event-driven families. * 06:30 — The counterfactual ladder A three-rung diagnostic that isolates write, read, and utilization failures by progressively swapping in oracle components. * 08:20 — Same score, different disease Empirical results showing models with near-identical error rates can have completely different failure breakdowns under the ladder. * 10:25 — The 4.5x compaction-prompt result How a one-paragraph change to summarization instructions extends agent half-life dramatically on the same underlying system. * 14:30 — Silent precision decay Why constraint-violation monitoring stays green while the agent quietly forgets the specifics it was supposed to remember. * 14:35 — Why scale doesn't save the running budget A small and a large model both drift on arithmetic over a session history because the failure is representational, not capacity-bound. * 16:41 — Honest critique Synthetic scenarios, simple memory architectures, and short session horizons — what the paper's numbers can and can't tell us. * 18:46 — Production CLI agents and re-reading Findings from Claude Code and OpenHands on why correct answers correlate with more retrieval, and why flagship models can write lower-fidelity artifacts. * 20:51 — The sticky note fix A small typed-state overlay alongside normal memory that cuts accumulator error substantially without changing the model. RECOMMENDED READING * MemGPT: Towards LLMs as Operating Systems [https://arxiv.org/abs/2310.08560] — Proposes a hierarchical memory system with explicit paging between context and external storage — directly relevant to the episode's argument that the fix for agent aging is structural, not bigger models. * Lost in the Middle: How Language Models Use Long Contexts [https://arxiv.org/abs/2307.03172] — Empirical evidence that models fail to utilize information even when it's present in context — the 'utilization failure' rung of the episode's counterfactual ladder. * Generative Agents: Interactive Simulacra of Human Behavior [https://arxiv.org/abs/2304.03442] — The Park et al. paper that popularized reflection-and-summarization memory architectures — exactly the kind of compaction-based stack whose aging dynamics this episode dissects. * Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks [https://arxiv.org/abs/2005.11401] — The original RAG paper, useful background for the episode's distinction between write failures, retrieval failures, and utilization failures in memory-augmented agents.

I går22 min

Two Levers for Self-Improving AI: When Rewriting Code Isn't Enough

Beskrivelse

Kommentarer

2 måneder kun 19 kr.

Alle episoder