How to Catch an AI Attack That No Single Conversation Reveals

Beskrivelse

HOW TO CATCH AN AI ATTACK THAT NO SINGLE CONVERSATION REVEALS Source: Stateful Online Monitoring Catches Distributed Agent Attacks [https://arxiv.org/abs/2605.31593] Paper was published on May 29, 2026 This episode was AI-generated on June 1, 2026. The script was written by an AI language model and the host voices were synthesized by Eleven Labs. The producer is not affiliated with Anthropic or Eleven Labs. An attacker can split a dangerous task into pieces so innocent that no safety monitor watching any one conversation will ever flag it — and a real espionage campaign already used exactly that trick. This episode digs into a paper that builds that distributed attack for real, then engineers a monitor that catches it live across millions of conversations, and discovers it accidentally catches ordinary jailbreaks too. You'll come away understanding why per-conversation monitoring is structurally blind, and what it takes to watch the whole building instead of one room. KEY TAKEAWAYS * Why a per-transcript safety monitor is architecturally unable to catch a distributed attack — it's not a weak lock, it's a lock on a door the attacker walks around * How a model that scores zero percent on hard cyber exploits jumps to ~37-40 percent by coordinating stronger 'subagents' who each see only a disguised sliver of the task * The stream-clustering and 'cheap glance' design that makes cross-conversation monitoring feasible at datacenter scale — catching ~85 percent of campaigns, ~30 percent earlier, with fewer false alarms * The two failure modes a six-person, two-month red team found: renaming files to scatter evidence across clusters, and innocent-but-ambiguous traffic lingering as false positives * The unplanned discovery that the same monitor catches ordinary jailbreaks for free, because real attackers iterate across many accounts and leave a cross-session fingerprint * Why the defense and its deepest weakness are the same fact — it works because disguised pieces smell alike, and fails the moment a patient attacker changes the smell * 00:00 — The espionage campaign nobody could see in real time A real-world attack used Claude Code split into innocent-looking pieces, and was only reconstructed by hand after the operation was finished. * 02:57 — Sequential vs. distributed attacks The core distinction between breaking a harmful task into steps in one conversation (easily caught) versus spreading it across many conversations (structurally invisible to per-transcript monitors). * 05:55 — Proving the attack actually does harm The authors show a model too weak to solve hard exploits on its own reaches ~37-40 percent by coordinating compartmentalized subagents — and that standard monitors catch it only about 20 percent of the time. * 08:53 — The scale constraint that breaks the naive defense Why re-clustering every conversation against every other one is computationally impossible, and why detection has to fire synchronously while the attack is still in progress. * 11:51 — Stream clustering and the cheap-glance design How running cluster summaries, embeddings, and a rarely-fired expensive judge let the monitor pool scattered evidence cheaply and reason over groups of conversations. * 14:49 — The results: more attacks, caught sooner, fewer false alarms On simulated datacenter traffic the stateful monitor improves recall, timing, and false-positive rate simultaneously — a rare result that isn't a trade-off. * 17:46 — Red-teaming and the accidental jailbreak discovery A two-month red team finds two real failure modes, and the monitor turns out to catch ordinary jailbreaks because attackers iterate across many accounts. * 20:44 — Limits, honest caveats, and what it reframes Where the detection advantage narrows as benign traffic grows, the simulated-data and single-model caveats, and the closing thesis of monitoring populations of users rather than isolated transcripts. RECOMMENDED READING * Sabotage Evaluations for Frontier Models [https://arxiv.org/abs/2410.21514] — Anthropic's framework for evaluating whether models can subvert oversight, directly relevant to the episode's theme of attacks that hide from monitors. * AI Control: Improving Safety Despite Intentional Subversion [https://arxiv.org/abs/2312.06942] — The paper that formalized using monitors and protocols to catch misbehavior even from adversarial models — the conceptual backdrop to this episode's monitor-vs-attacker arms race.

The Trojan Is Your Agent's Memory: Why Single-Step Defenses Miss Persistent Attacks

THE TROJAN IS YOUR AGENT'S MEMORY: WHY SINGLE-STEP DEFENSES MISS PERSISTENT ATTACKS Source: From Prompt Injection to Persistent Control: Defending Agentic Harness Against Trojan Backdoors [https://arxiv.org/abs/2605.31042] Paper was published on May 29, 2026 This episode was AI-generated on June 1, 2026. The script was written by an AI language model and the host voices were synthesized by Eleven Labs. The producer is not affiliated with Anthropic or Eleven Labs. The famous prompt-injection attack barely works against frontier models anymore — so why does a multi-step version succeed 95% of the time against the very same model? It's because the danger moved from the chat box into the agent's persistent memory, and a new paper argues the entire deployed safety industry is defending the wrong moment. The fix flips the question from 'is this action dangerous?' to 'where did this instruction come from?' KEY TAKEAWAYS * Why classic prompt injection now fails at near-zero, yet a slow attack smeared across files and sessions succeeds about 95% of the time against the same frontier model * The core reframe: the dangerous moment isn't the harmful action, it's the earlier innocent step when untrusted text quietly becomes a future instruction * How DASGuard's chain-of-custody provenance tracking — and its draft-vs-sent-email distinction between sanitizing files and blocking irreversible actions — cuts attack success from 95% to under 16% * The ablation that proves the insight is the contribution: remove just the source labels and the whole defense collapses back to 92.7%, even with detection and memory intact * Why the 16% number deserves grains of salt — no adaptive attacker, a benchmark and defense from the same team, a thin clean-task set, and a 13% false-positive rate * Why the reframe outlasts the benchmark: provenance tracking is portable across agent harnesses, but recovery from an already-poisoned workspace remains wide open * 00:00 — The attack with no visible moment An opening scenario where a planted policy line graduates into a trusted runbook rule and triggers harm days later, with no single step that looks dangerous. * 02:54 — Why classic prompt injection stopped working The authors run AgentDojo and InjecAgent against undefended frontier models and find single-shot injection now fails at near-zero — making the field think the problem is half-solved. * 05:48 — The agentic harness and the persistence problem How memory that survives across sessions creates a brand-new place for attackers to hide, and why the right question shifts from 'is this safe?' to 'where did this come from?' * 08:43 — Relocating the trojan to the workspace Borrowing the backdoor concept from classic security and pointing the trigger at persistent workspace state rather than a secret token or pixel pattern. * 11:37 — ClawTrojan and the 95% number How the benchmark builds runnable sandboxes and validates full multi-step attack chains — including fragmented payloads — that succeed roughly 95% of the time. * 14:32 — How DASGuard works: detect, attribute, sanitize A walkthrough of the three gates, the content-source graph that propagates suspicion across steps, and the shadow workspace that cleans files instead of just blocking. * 17:26 — The results and the ablation that proves the point DASGuard drops attack success to under 16% while nine baselines barely move the needle, and removing provenance alone reverts the defense to near-undefended. * 20:21 — Where the numbers deserve skepticism A steelman critique covering the same-team benchmark, the absence of an adaptive attacker, the thin clean-task set, false positives, and adapted baselines. * 23:15 — What survives the paper Why the conceptual relocation — treat the workspace as something to defend, and never let a stranger's note become your agent's rule — outlasts the provisional metrics. RECOMMENDED READING * Not what you've signed up for: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection [https://arxiv.org/abs/2302.12173] — The foundational treatment of indirect prompt injection — the single-shot attack this episode argues frontier models now shrug off, setting up the persistence reframe. * Defeating Prompt Injections by Design (CaMeL) [https://arxiv.org/abs/2503.18813] — The data-flow defense the episode singles out as the strongest baseline, whose notion of provenance gets it 'halfway to the right idea' but stops short of persistent state. * AgentDojo: A Dynamic Environment to Evaluate Prompt Injection Attacks and Defenses for LLM Agents [https://arxiv.org/abs/2406.13352] — One of the two standard benchmarks the authors run to show single-shot injection now fails, motivating their multi-step ClawTrojan chains. * InjecAgent: Benchmarking Indirect Prompt Injections in Tool-Integrated Large Language Model Agents [https://arxiv.org/abs/2403.02691] — The second benchmark used as the near-zero baseline, illustrating the gap between obvious single-context injection and the smeared-across-time attack this episode centers on.

I går26 min

How to Catch an AI Attack That No Single Conversation Reveals

Beskrivelse

Kommentarer

1 måned kun 9 kr.

Alle episoder