Why Frozen-Weight Agents Still Get Worse Over Time

Beschrijving

WHY FROZEN-WEIGHT AGENTS STILL GET WORSE OVER TIME Source: Your Agents Are Aging Too: Agent Lifespan Engineering for Deployed Systems [https://arxiv.org/abs/2605.26302] Paper was published on May 25, 2026 This episode was AI-generated on May 27, 2026. The script was written by an AI language model and the host voices were synthesized by Eleven Labs. The producer is not affiliated with Anthropic or Eleven Labs. A deployed AI agent's model weights never change — but the agent itself ages, and it ages in at least four mechanistically distinct ways. A new paper introduces a diagnostic ladder that can tell those failure modes apart, and shows that a one-paragraph change to how an agent summarizes its own memory can extend its useful lifespan by more than four times. KEY TAKEAWAYS * Agent reliability is a lifespan property, not a benchmark snapshot — the memory store, retrieval, and compaction around a frozen model keep changing every session * Four named failure modes: compression, interference, revision, and maintenance aging — split into accumulation-driven and event-driven families * The counterfactual ladder: a three-rung diagnostic that isolates write failures, read failures, and utilization failures without needing model internals * Three models with nearly identical error rates can have completely different underlying diseases — and 'add more memory' is the wrong fix for two of them * A one-paragraph 'careful' compaction prompt that names what to preserve verbatim yields roughly a 4.5x lifespan improvement on the same system * Production monitoring tends to track constraint compliance while missing silent precision decay — the agent stops violating rules but also stops knowing the specifics * Scale doesn't fix structural problems: a small typed-state sidecar cuts running-balance error 25–50% with no model change * 00:00 — Four vignettes, one puzzle Four deployed-agent failures that the standard 'frozen weights = frozen system' mental model can't explain. * 02:05 — Reframing reliability as a lifespan property Why the apparatus around the model — memory, retrieval, compaction — is what actually changes over time. * 04:10 — The four aging mechanisms Compression, interference, revision, and maintenance aging — and why they split into accumulation-driven and event-driven families. * 06:30 — The counterfactual ladder A three-rung diagnostic that isolates write, read, and utilization failures by progressively swapping in oracle components. * 08:20 — Same score, different disease Empirical results showing models with near-identical error rates can have completely different failure breakdowns under the ladder. * 10:25 — The 4.5x compaction-prompt result How a one-paragraph change to summarization instructions extends agent half-life dramatically on the same underlying system. * 14:30 — Silent precision decay Why constraint-violation monitoring stays green while the agent quietly forgets the specifics it was supposed to remember. * 14:35 — Why scale doesn't save the running budget A small and a large model both drift on arithmetic over a session history because the failure is representational, not capacity-bound. * 16:41 — Honest critique Synthetic scenarios, simple memory architectures, and short session horizons — what the paper's numbers can and can't tell us. * 18:46 — Production CLI agents and re-reading Findings from Claude Code and OpenHands on why correct answers correlate with more retrieval, and why flagship models can write lower-fidelity artifacts. * 20:51 — The sticky note fix A small typed-state overlay alongside normal memory that cuts accumulator error substantially without changing the model. RECOMMENDED READING * MemGPT: Towards LLMs as Operating Systems [https://arxiv.org/abs/2310.08560] — Proposes a hierarchical memory system with explicit paging between context and external storage — directly relevant to the episode's argument that the fix for agent aging is structural, not bigger models. * Lost in the Middle: How Language Models Use Long Contexts [https://arxiv.org/abs/2307.03172] — Empirical evidence that models fail to utilize information even when it's present in context — the 'utilization failure' rung of the episode's counterfactual ladder. * Generative Agents: Interactive Simulacra of Human Behavior [https://arxiv.org/abs/2304.03442] — The Park et al. paper that popularized reflection-and-summarization memory architectures — exactly the kind of compaction-based stack whose aging dynamics this episode dissects. * Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks [https://arxiv.org/abs/2005.11401] — The original RAG paper, useful background for the episode's distinction between write failures, retrieval failures, and utilization failures in memory-augmented agents.

What If a Prompt Injection Never Left? Attacks That Wait in Agent Memory

WHAT IF A PROMPT INJECTION NEVER LEFT? ATTACKS THAT WAIT IN AGENT MEMORY Source: What If Prompt Injection Never Left? Exploring Cross-Session Stored Prompt Injection in Agentic Systems [https://arxiv.org/abs/2606.04425] Paper was published on June 03, 2026 This episode was AI-generated on June 4, 2026. The script was written by an AI language model and the host voices were synthesized by Eleven Labs. The producer is not affiliated with Anthropic or Eleven Labs. Once an AI agent gains durable memory, the attacker no longer has to be in the room when the damage happens — they plant an instruction once and the agent's own startup routine pulls the trigger days later, in a totally different user's session. Drawing a sharp parallel to the stored-versus-reflected XSS attacks that haunted the web for a decade, a new paper measures exactly how often these cross-session attacks survive end to end. The answer, and the surprising split between which attacks work and which fail, is worth your attention if you build anything with agent memory. KEY TAKEAWAYS * Why classic prompt injection is contained to one session, while 'cross-session stored prompt injection' decouples the moment of injection from the moment it fires — reframing the problem from malicious input to state contamination * The clean experimental trick of wiping conversation history while leaving the environment intact, which isolates the persistent-state effect from ordinary in-session memory carryover * Why attack success multiplies across three independent gates — write, reload, and activation — and how that explains the counterintuitive result that the model with the lowest write rate ends up the most exploitable * The jewel finding: injecting a false fact activates essentially 100% of the time across all three models, while overriding a user's stated preference almost never works — because facts swim with the model's instinct to trust its context and preference overrides swim against it * Why disguising a payload as a legitimate business policy dramatically boosts the write rate but barely moves end-to-end success — revealing that the write gate and the execution gate are two genuinely different checks * The honest limits: it's a hand-built benchmark on three models, the headline 32–42% success rate depends heavily on the sandbox's write policy, and the paper tests exactly zero defenses * 00:00 — The attacker who isn't in the room Sets up the unsettling premise that an instruction planted in an agent's memory can wait and fire long after the attacker is gone, and introduces the paper and its central question. * 02:41 — Why language models can't tell orders from data Explains the root of prompt injection using the contractor-and-sticky-note analogy, and why classic injection was historically contained to a single session. * 05:23 — The cross-site scripting parallel Maps reflected versus stored XSS onto prompt injection, naming the new threat 'cross-session stored prompt injection' and framing it as state contamination rather than malicious input. * 08:04 — Context as a pipeline, and which channels persist Reframes the agent as a pipeline that assembles a prompt, and distinguishes auto-loaded 'note on the monitor' channels from conditionally retrieved 'note in the drawer' channels as the primary risk surface. * 10:46 — The session-reset experiment Walks through the methodological core — wiping conversation history but leaving the environment intact — that isolates persistent-state influence from ordinary memory carryover. * 13:27 — The three-leg relay race Breaks attack success into the write, reload, and activation gates whose rates multiply, explaining why the least-writeable model ends up the most exploitable overall. * 16:09 — Why false facts win and preference overrides lose Presents the paper's standout result — fact injection activates almost always while preference overrides almost never do — and explains it as swimming with versus against the model's instincts. * 18:50 — Disguise, and which gate it fools Shows that dressing a payload as a business policy boosts the write rate sharply but barely changes end-to-end success, distinguishing write-gate tricks from execution-gate tricks. * 21:32 — Honest limits and what's left open Pressure-tests the headline numbers as artifacts of the sandbox's write policy, flags the thin evidence on the most dangerous harm category, and notes that no defenses are tested. * 24:13 — Why it matters now Argues that agentic systems are at the same fork the web faced with stored XSS, and lays out actionable takeaways for hardening the write and incorporation gates before the threat becomes endemic. RECOMMENDED READING * Universal and Transferable Adversarial Attacks on Aligned Language Models [https://arxiv.org/abs/2307.15043] — Connects to the episode's 'swimming against the current' insight about activation by showing how attacks that fight a model's alignment can still be engineered to succeed, sharpening the question of why preference overrides mostly failed here. * Not What You've Signed Up For: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection [https://arxiv.org/abs/2302.12173] — The foundational indirect prompt injection paper that establishes the 'reflected' baseline this episode contrasts against its stored, cross-session threat model.

5 jun 202626 min

Why Frozen-Weight Agents Still Get Worse Over Time

Beschrijving

Reacties

Probeer 14 dagen gratis

Alle afleveringen