Why Frozen-Weight Agents Still Get Worse Over Time

Descripción

WHY FROZEN-WEIGHT AGENTS STILL GET WORSE OVER TIME Source: Your Agents Are Aging Too: Agent Lifespan Engineering for Deployed Systems [https://arxiv.org/abs/2605.26302] Paper was published on May 25, 2026 This episode was AI-generated on May 27, 2026. The script was written by an AI language model and the host voices were synthesized by Eleven Labs. The producer is not affiliated with Anthropic or Eleven Labs. A deployed AI agent's model weights never change — but the agent itself ages, and it ages in at least four mechanistically distinct ways. A new paper introduces a diagnostic ladder that can tell those failure modes apart, and shows that a one-paragraph change to how an agent summarizes its own memory can extend its useful lifespan by more than four times. KEY TAKEAWAYS * Agent reliability is a lifespan property, not a benchmark snapshot — the memory store, retrieval, and compaction around a frozen model keep changing every session * Four named failure modes: compression, interference, revision, and maintenance aging — split into accumulation-driven and event-driven families * The counterfactual ladder: a three-rung diagnostic that isolates write failures, read failures, and utilization failures without needing model internals * Three models with nearly identical error rates can have completely different underlying diseases — and 'add more memory' is the wrong fix for two of them * A one-paragraph 'careful' compaction prompt that names what to preserve verbatim yields roughly a 4.5x lifespan improvement on the same system * Production monitoring tends to track constraint compliance while missing silent precision decay — the agent stops violating rules but also stops knowing the specifics * Scale doesn't fix structural problems: a small typed-state sidecar cuts running-balance error 25–50% with no model change * 00:00 — Four vignettes, one puzzle Four deployed-agent failures that the standard 'frozen weights = frozen system' mental model can't explain. * 02:05 — Reframing reliability as a lifespan property Why the apparatus around the model — memory, retrieval, compaction — is what actually changes over time. * 04:10 — The four aging mechanisms Compression, interference, revision, and maintenance aging — and why they split into accumulation-driven and event-driven families. * 06:30 — The counterfactual ladder A three-rung diagnostic that isolates write, read, and utilization failures by progressively swapping in oracle components. * 08:20 — Same score, different disease Empirical results showing models with near-identical error rates can have completely different failure breakdowns under the ladder. * 10:25 — The 4.5x compaction-prompt result How a one-paragraph change to summarization instructions extends agent half-life dramatically on the same underlying system. * 14:30 — Silent precision decay Why constraint-violation monitoring stays green while the agent quietly forgets the specifics it was supposed to remember. * 14:35 — Why scale doesn't save the running budget A small and a large model both drift on arithmetic over a session history because the failure is representational, not capacity-bound. * 16:41 — Honest critique Synthetic scenarios, simple memory architectures, and short session horizons — what the paper's numbers can and can't tell us. * 18:46 — Production CLI agents and re-reading Findings from Claude Code and OpenHands on why correct answers correlate with more retrieval, and why flagship models can write lower-fidelity artifacts. * 20:51 — The sticky note fix A small typed-state overlay alongside normal memory that cuts accumulator error substantially without changing the model. RECOMMENDED READING * MemGPT: Towards LLMs as Operating Systems [https://arxiv.org/abs/2310.08560] — Proposes a hierarchical memory system with explicit paging between context and external storage — directly relevant to the episode's argument that the fix for agent aging is structural, not bigger models. * Lost in the Middle: How Language Models Use Long Contexts [https://arxiv.org/abs/2307.03172] — Empirical evidence that models fail to utilize information even when it's present in context — the 'utilization failure' rung of the episode's counterfactual ladder. * Generative Agents: Interactive Simulacra of Human Behavior [https://arxiv.org/abs/2304.03442] — The Park et al. paper that popularized reflection-and-summarization memory architectures — exactly the kind of compaction-based stack whose aging dynamics this episode dissects. * Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks [https://arxiv.org/abs/2005.11401] — The original RAG paper, useful background for the episode's distinction between write failures, retrieval failures, and utilization failures in memory-augmented agents.

The Trojan Is Your Agent's Memory: Why Single-Step Defenses Miss Persistent Attacks

THE TROJAN IS YOUR AGENT'S MEMORY: WHY SINGLE-STEP DEFENSES MISS PERSISTENT ATTACKS Source: From Prompt Injection to Persistent Control: Defending Agentic Harness Against Trojan Backdoors [https://arxiv.org/abs/2605.31042] Paper was published on May 29, 2026 This episode was AI-generated on June 1, 2026. The script was written by an AI language model and the host voices were synthesized by Eleven Labs. The producer is not affiliated with Anthropic or Eleven Labs. The famous prompt-injection attack barely works against frontier models anymore — so why does a multi-step version succeed 95% of the time against the very same model? It's because the danger moved from the chat box into the agent's persistent memory, and a new paper argues the entire deployed safety industry is defending the wrong moment. The fix flips the question from 'is this action dangerous?' to 'where did this instruction come from?' KEY TAKEAWAYS * Why classic prompt injection now fails at near-zero, yet a slow attack smeared across files and sessions succeeds about 95% of the time against the same frontier model * The core reframe: the dangerous moment isn't the harmful action, it's the earlier innocent step when untrusted text quietly becomes a future instruction * How DASGuard's chain-of-custody provenance tracking — and its draft-vs-sent-email distinction between sanitizing files and blocking irreversible actions — cuts attack success from 95% to under 16% * The ablation that proves the insight is the contribution: remove just the source labels and the whole defense collapses back to 92.7%, even with detection and memory intact * Why the 16% number deserves grains of salt — no adaptive attacker, a benchmark and defense from the same team, a thin clean-task set, and a 13% false-positive rate * Why the reframe outlasts the benchmark: provenance tracking is portable across agent harnesses, but recovery from an already-poisoned workspace remains wide open * 00:00 — The attack with no visible moment An opening scenario where a planted policy line graduates into a trusted runbook rule and triggers harm days later, with no single step that looks dangerous. * 02:54 — Why classic prompt injection stopped working The authors run AgentDojo and InjecAgent against undefended frontier models and find single-shot injection now fails at near-zero — making the field think the problem is half-solved. * 05:48 — The agentic harness and the persistence problem How memory that survives across sessions creates a brand-new place for attackers to hide, and why the right question shifts from 'is this safe?' to 'where did this come from?' * 08:43 — Relocating the trojan to the workspace Borrowing the backdoor concept from classic security and pointing the trigger at persistent workspace state rather than a secret token or pixel pattern. * 11:37 — ClawTrojan and the 95% number How the benchmark builds runnable sandboxes and validates full multi-step attack chains — including fragmented payloads — that succeed roughly 95% of the time. * 14:32 — How DASGuard works: detect, attribute, sanitize A walkthrough of the three gates, the content-source graph that propagates suspicion across steps, and the shadow workspace that cleans files instead of just blocking. * 17:26 — The results and the ablation that proves the point DASGuard drops attack success to under 16% while nine baselines barely move the needle, and removing provenance alone reverts the defense to near-undefended. * 20:21 — Where the numbers deserve skepticism A steelman critique covering the same-team benchmark, the absence of an adaptive attacker, the thin clean-task set, false positives, and adapted baselines. * 23:15 — What survives the paper Why the conceptual relocation — treat the workspace as something to defend, and never let a stranger's note become your agent's rule — outlasts the provisional metrics. RECOMMENDED READING * Not what you've signed up for: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection [https://arxiv.org/abs/2302.12173] — The foundational treatment of indirect prompt injection — the single-shot attack this episode argues frontier models now shrug off, setting up the persistence reframe. * Defeating Prompt Injections by Design (CaMeL) [https://arxiv.org/abs/2503.18813] — The data-flow defense the episode singles out as the strongest baseline, whose notion of provenance gets it 'halfway to the right idea' but stops short of persistent state. * AgentDojo: A Dynamic Environment to Evaluate Prompt Injection Attacks and Defenses for LLM Agents [https://arxiv.org/abs/2406.13352] — One of the two standard benchmarks the authors run to show single-shot injection now fails, motivating their multi-step ClawTrojan chains. * InjecAgent: Benchmarking Indirect Prompt Injections in Tool-Integrated Large Language Model Agents [https://arxiv.org/abs/2403.02691] — The second benchmark used as the near-zero baseline, illustrating the gap between obvious single-context injection and the smeared-across-time attack this episode centers on.

Ayer26 min

Why Frozen-Weight Agents Still Get Worse Over Time

Descripción

Comentarios

Empieza 7 días de prueba

Todos los episodios