AI Papers: A Deep Dive
WHAT IF A PROMPT INJECTION NEVER LEFT? ATTACKS THAT WAIT IN AGENT MEMORY Source: What If Prompt Injection Never Left? Exploring Cross-Session Stored Prompt Injection in Agentic Systems [https://arxiv.org/abs/2606.04425] Paper was published on June 03, 2026 This episode was AI-generated on June 4, 2026. The script was written by an AI language model and the host voices were synthesized by Eleven Labs. The producer is not affiliated with Anthropic or Eleven Labs. Once an AI agent gains durable memory, the attacker no longer has to be in the room when the damage happens — they plant an instruction once and the agent's own startup routine pulls the trigger days later, in a totally different user's session. Drawing a sharp parallel to the stored-versus-reflected XSS attacks that haunted the web for a decade, a new paper measures exactly how often these cross-session attacks survive end to end. The answer, and the surprising split between which attacks work and which fail, is worth your attention if you build anything with agent memory. KEY TAKEAWAYS * Why classic prompt injection is contained to one session, while 'cross-session stored prompt injection' decouples the moment of injection from the moment it fires — reframing the problem from malicious input to state contamination * The clean experimental trick of wiping conversation history while leaving the environment intact, which isolates the persistent-state effect from ordinary in-session memory carryover * Why attack success multiplies across three independent gates — write, reload, and activation — and how that explains the counterintuitive result that the model with the lowest write rate ends up the most exploitable * The jewel finding: injecting a false fact activates essentially 100% of the time across all three models, while overriding a user's stated preference almost never works — because facts swim with the model's instinct to trust its context and preference overrides swim against it * Why disguising a payload as a legitimate business policy dramatically boosts the write rate but barely moves end-to-end success — revealing that the write gate and the execution gate are two genuinely different checks * The honest limits: it's a hand-built benchmark on three models, the headline 32–42% success rate depends heavily on the sandbox's write policy, and the paper tests exactly zero defenses * 00:00 — The attacker who isn't in the room Sets up the unsettling premise that an instruction planted in an agent's memory can wait and fire long after the attacker is gone, and introduces the paper and its central question. * 02:41 — Why language models can't tell orders from data Explains the root of prompt injection using the contractor-and-sticky-note analogy, and why classic injection was historically contained to a single session. * 05:23 — The cross-site scripting parallel Maps reflected versus stored XSS onto prompt injection, naming the new threat 'cross-session stored prompt injection' and framing it as state contamination rather than malicious input. * 08:04 — Context as a pipeline, and which channels persist Reframes the agent as a pipeline that assembles a prompt, and distinguishes auto-loaded 'note on the monitor' channels from conditionally retrieved 'note in the drawer' channels as the primary risk surface. * 10:46 — The session-reset experiment Walks through the methodological core — wiping conversation history but leaving the environment intact — that isolates persistent-state influence from ordinary memory carryover. * 13:27 — The three-leg relay race Breaks attack success into the write, reload, and activation gates whose rates multiply, explaining why the least-writeable model ends up the most exploitable overall. * 16:09 — Why false facts win and preference overrides lose Presents the paper's standout result — fact injection activates almost always while preference overrides almost never do — and explains it as swimming with versus against the model's instincts. * 18:50 — Disguise, and which gate it fools Shows that dressing a payload as a business policy boosts the write rate sharply but barely changes end-to-end success, distinguishing write-gate tricks from execution-gate tricks. * 21:32 — Honest limits and what's left open Pressure-tests the headline numbers as artifacts of the sandbox's write policy, flags the thin evidence on the most dangerous harm category, and notes that no defenses are tested. * 24:13 — Why it matters now Argues that agentic systems are at the same fork the web faced with stored XSS, and lays out actionable takeaways for hardening the write and incorporation gates before the threat becomes endemic. RECOMMENDED READING * Universal and Transferable Adversarial Attacks on Aligned Language Models [https://arxiv.org/abs/2307.15043] — Connects to the episode's 'swimming against the current' insight about activation by showing how attacks that fight a model's alignment can still be engineered to succeed, sharpening the question of why preference overrides mostly failed here. * Not What You've Signed Up For: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection [https://arxiv.org/abs/2302.12173] — The foundational indirect prompt injection paper that establishes the 'reflected' baseline this episode contrasts against its stored, cross-session threat model.
114 episodios
Comentarios
0Sé la primera persona en comentar
¡Regístrate ahora y únete a la comunidad de AI Papers: A Deep Dive!