Agents That Rewrite Their Own Weights Instead of Just Taking Notes

Descripción

AGENTS THAT REWRITE THEIR OWN WEIGHTS INSTEAD OF JUST TAKING NOTES Source: Scaling Self-Evolving Agents via Parametric Memory [https://arxiv.org/abs/2606.04536] Paper was published on June 03, 2026 This episode was AI-generated on June 4, 2026. The script was written by an AI language model and the host voices were synthesized by Eleven Labs. The producer is not affiliated with Anthropic or Eleven Labs. Almost every AI agent with 'memory' is like a cyclist who reads notes about balance mid-ride instead of learning to ride — the notebook gets thicker, but the rider never changes. A new paper from Peking University and Alibaba flips that: the agent distills its experience into flashcards and trains them directly into a small writable slice of its own weights mid-conversation. We unpack how the loop works, why flashcards beat summaries, and where the results hold up and where they don't. KEY TAKEAWAYS * Why prompt-space memory (summaries and retrieval) leaves the model's actual decision-making machinery frozen — and why writing into a small set of 'fast weights' is a fundamentally different channel * The mid-episode loop in detail: the agent hits a context budget, writes question-and-answer flashcards, bakes them into a tiny LoRA adapter with a few gradient steps, then clears its context * Why flashcards crush the alternatives: raw transcript training scores ~10 F1, summaries ~35, and QA flashcards ~41 — the structure of what you write matters more than the fact that you write it * How reinforcement learning makes memory-writing an action the agent gets good at, using a stop-gradient trick that avoids differentiating through the optimizer * The counterintuitive cost result: the no-memory baseline is the heaviest (~78GB) because context grows quadratically, while the parametric approach sits comfortably in the middle * The honest limits: the headline 10-point gain is a best case (some benchmarks are ties), the context-learning claim rests on a filtered subset, the SVD theorem proves a better start not a better finish, and everything is tested only at 4B–8B scale * 00:00 — The cyclist with the notebook The framing problem: today's agents can write things down and look them up, but the part that actually thinks never changes. * 03:17 — Three places an agent keeps what it knows Working context, retrieval/summaries, and the new third channel — a small writable set of weights the agent can edit mid-conversation. * 06:35 — The mid-episode memory-writing loop How the agent triggers on a context budget, distills flashcards, trains them into a tiny adapter, clears its desk, and accumulates changes across the episode. * 09:52 — Why the SVD initialization makes few-step learning possible Starting the adapter in the model's already-important directions — instead of randomly — so a handful of gradient steps actually go toward progress. * 13:10 — Flashcards beat summaries beat raw transcripts The experiment showing the structure of what you write into memory drives the result, with a dramatic spread across the three options. * 16:27 — Training the agent to write good flashcards The reinforcement learning pillar, where memory-writing becomes an action rewarded by outcome via a stop-gradient shortcut around the optimizer. * 19:45 — A case study and the surprising cost result Watching the agent compose distilled facts into a new answer, plus the finding that no-memory is the most expensive approach, not the cheapest. * 23:03 — Where it doesn't hold up A candid skeptical pass through the modest average gains, the filtered benchmark, the limits of the theorem, the approximate credit assignment, and the small-scale-only testing. RECOMMENDED READING * LoRA: Low-Rank Adaptation of Large Language Models [https://arxiv.org/abs/2106.09685] — The low-rank adapter method that the episode's 'fast weights' are built on — and the rank-6 adapter the agent edits mid-episode is exactly this construction. * Learning to (Learn at Test Time): RNNs with Expressive Hidden States [https://arxiv.org/abs/2407.04620] — A test-time-training approach that, like this episode's third channel, treats the model's own parameters as a place to write experience during inference rather than freezing them. * MemGPT: Towards LLMs as Operating Systems [https://arxiv.org/abs/2310.08560] — A canonical example of the 'filing cabinet' memory paradigm — context paging and retrieval around a frozen brain — that the episode positions as the special case where the parametric channel is switched off.

What If a Prompt Injection Never Left? Attacks That Wait in Agent Memory

WHAT IF A PROMPT INJECTION NEVER LEFT? ATTACKS THAT WAIT IN AGENT MEMORY Source: What If Prompt Injection Never Left? Exploring Cross-Session Stored Prompt Injection in Agentic Systems [https://arxiv.org/abs/2606.04425] Paper was published on June 03, 2026 This episode was AI-generated on June 4, 2026. The script was written by an AI language model and the host voices were synthesized by Eleven Labs. The producer is not affiliated with Anthropic or Eleven Labs. Once an AI agent gains durable memory, the attacker no longer has to be in the room when the damage happens — they plant an instruction once and the agent's own startup routine pulls the trigger days later, in a totally different user's session. Drawing a sharp parallel to the stored-versus-reflected XSS attacks that haunted the web for a decade, a new paper measures exactly how often these cross-session attacks survive end to end. The answer, and the surprising split between which attacks work and which fail, is worth your attention if you build anything with agent memory. KEY TAKEAWAYS * Why classic prompt injection is contained to one session, while 'cross-session stored prompt injection' decouples the moment of injection from the moment it fires — reframing the problem from malicious input to state contamination * The clean experimental trick of wiping conversation history while leaving the environment intact, which isolates the persistent-state effect from ordinary in-session memory carryover * Why attack success multiplies across three independent gates — write, reload, and activation — and how that explains the counterintuitive result that the model with the lowest write rate ends up the most exploitable * The jewel finding: injecting a false fact activates essentially 100% of the time across all three models, while overriding a user's stated preference almost never works — because facts swim with the model's instinct to trust its context and preference overrides swim against it * Why disguising a payload as a legitimate business policy dramatically boosts the write rate but barely moves end-to-end success — revealing that the write gate and the execution gate are two genuinely different checks * The honest limits: it's a hand-built benchmark on three models, the headline 32–42% success rate depends heavily on the sandbox's write policy, and the paper tests exactly zero defenses * 00:00 — The attacker who isn't in the room Sets up the unsettling premise that an instruction planted in an agent's memory can wait and fire long after the attacker is gone, and introduces the paper and its central question. * 02:41 — Why language models can't tell orders from data Explains the root of prompt injection using the contractor-and-sticky-note analogy, and why classic injection was historically contained to a single session. * 05:23 — The cross-site scripting parallel Maps reflected versus stored XSS onto prompt injection, naming the new threat 'cross-session stored prompt injection' and framing it as state contamination rather than malicious input. * 08:04 — Context as a pipeline, and which channels persist Reframes the agent as a pipeline that assembles a prompt, and distinguishes auto-loaded 'note on the monitor' channels from conditionally retrieved 'note in the drawer' channels as the primary risk surface. * 10:46 — The session-reset experiment Walks through the methodological core — wiping conversation history but leaving the environment intact — that isolates persistent-state influence from ordinary memory carryover. * 13:27 — The three-leg relay race Breaks attack success into the write, reload, and activation gates whose rates multiply, explaining why the least-writeable model ends up the most exploitable overall. * 16:09 — Why false facts win and preference overrides lose Presents the paper's standout result — fact injection activates almost always while preference overrides almost never do — and explains it as swimming with versus against the model's instincts. * 18:50 — Disguise, and which gate it fools Shows that dressing a payload as a business policy boosts the write rate sharply but barely changes end-to-end success, distinguishing write-gate tricks from execution-gate tricks. * 21:32 — Honest limits and what's left open Pressure-tests the headline numbers as artifacts of the sandbox's write policy, flags the thin evidence on the most dangerous harm category, and notes that no defenses are tested. * 24:13 — Why it matters now Argues that agentic systems are at the same fork the web faced with stored XSS, and lays out actionable takeaways for hardening the write and incorporation gates before the threat becomes endemic. RECOMMENDED READING * Universal and Transferable Adversarial Attacks on Aligned Language Models [https://arxiv.org/abs/2307.15043] — Connects to the episode's 'swimming against the current' insight about activation by showing how attacks that fight a model's alignment can still be engineered to succeed, sharpening the question of why preference overrides mostly failed here. * Not What You've Signed Up For: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection [https://arxiv.org/abs/2302.12173] — The foundational indirect prompt injection paper that establishes the 'reflected' baseline this episode contrasts against its stored, cross-session threat model.

5 de jun de 202626 min

Agents That Rewrite Their Own Weights Instead of Just Taking Notes

Descripción

Comentarios

Empieza 7 días de prueba

Todos los episodios