When Three LLMs Talk to Each Other, Their Ideas Quietly Stop Moving

Descripción

WHEN THREE LLMS TALK TO EACH OTHER, THEIR IDEAS QUIETLY STOP MOVING Source: Multi-LLM Systems Exhibit Robust Semantic Collapse [https://arxiv.org/abs/2605.17193] Paper was published on May 16, 2026 This episode was AI-generated on May 23, 2026. The script was written by an AI language model and the host voices were synthesized by Eleven Labs. The producer is not affiliated with Anthropic or Eleven Labs. Put three large language models in a room with no task and let them talk for a thousand rounds, and something striking happens: their vocabulary keeps growing, but the meaning of what they're saying barely moves. A new paper runs that experiment, tries twelve different ways to break the pattern, fails every time, and traces the cause to specific circuits inside the models — with real consequences for anyone betting on autonomous AI research pipelines. KEY TAKEAWAYS * Why multi-LLM conversations grow new vocabulary while their semantic content stays anchored near the starting point — about three times more anchored than human Reddit threads * How twelve intervention categories (temperature, prompts, personas, model mixing, removing safety training, reducing sycophancy, scaling agents, external shocks) all failed to produce more semantic diversity * The counterintuitive RL result: training models to be diverse made independent runs look more like each other, not less * The induction-head mechanism — look-back-and-copy circuits that get louder as conversations lengthen, while rare tokens get systematically forgotten * Why the Data Processing Inequality explains, in principle, why no closed-loop intervention can recover lost semantic diversity * Where the paper's claims are strong (empirical collapse, mechanistic story in Llama) and where they overreach (civilizational implications, single RL recipe) * 00:00 — Lovelace's question, reframed as an experiment How an 1843 worry about whether machines can originate anything becomes a concrete test you can run on modern LLMs. * 03:30 — The setup and the headline result Three LLMs talking with no task, measured on lexical versus semantic diversity — and the gap between the two curves. * 07:00 — Twelve ways to break the pattern, all failing A tour of every plausible escape hatch the authors tested, from temperature and prompts to uncensored models and direct reinforcement learning. * 10:30 — Opening up the model: induction heads and a vanishing tail What teacher-forcing replay on Llama-3.1-8B reveals about the circuits driving the collapse and the rare tokens that disappear along the way. * 13:31 — The Data Processing Inequality and why closed loops can't recover The information-theoretic argument that connects the empirical finding to a much older intuition about closed channels. * 17:30 — Caveats: the embedding model, the no-task setup, and the single architecture Where a careful skeptic should push back on the paper's measurements, scope, and mechanistic generalization. * 21:00 — Different models, different basins Why collapse doesn't dissolve model identity — it sharpens it, with a classifier reaching 94% accuracy at telling models apart late in conversations. * 24:30 — What this means for autonomous AI science and model collapse The implications for closed-loop research pipelines, the compounding of inference-time and training-time collapse, and the more speculative epistemic worries. RECOMMENDED READING * The Curse of Recursion: Training on Generated Data Makes Models Forget [https://arxiv.org/abs/2305.17493] — The Shumailov et al. paper on training-side model collapse that this episode positions as the upstream counterpart to inference-time semantic collapse. * In-context Learning and Induction Heads [https://arxiv.org/abs/2209.11895] — The Anthropic paper characterizing the induction-head circuits that the episode identifies as the mechanistic culprit behind LLMs echoing their own conversational history. * The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery [https://arxiv.org/abs/2408.06292] — A flagship example of the autonomous closed-loop AI research pipeline whose feasibility this episode's findings most directly challenge.

The Trojan Is Your Agent's Memory: Why Single-Step Defenses Miss Persistent Attacks

THE TROJAN IS YOUR AGENT'S MEMORY: WHY SINGLE-STEP DEFENSES MISS PERSISTENT ATTACKS Source: From Prompt Injection to Persistent Control: Defending Agentic Harness Against Trojan Backdoors [https://arxiv.org/abs/2605.31042] Paper was published on May 29, 2026 This episode was AI-generated on June 1, 2026. The script was written by an AI language model and the host voices were synthesized by Eleven Labs. The producer is not affiliated with Anthropic or Eleven Labs. The famous prompt-injection attack barely works against frontier models anymore — so why does a multi-step version succeed 95% of the time against the very same model? It's because the danger moved from the chat box into the agent's persistent memory, and a new paper argues the entire deployed safety industry is defending the wrong moment. The fix flips the question from 'is this action dangerous?' to 'where did this instruction come from?' KEY TAKEAWAYS * Why classic prompt injection now fails at near-zero, yet a slow attack smeared across files and sessions succeeds about 95% of the time against the same frontier model * The core reframe: the dangerous moment isn't the harmful action, it's the earlier innocent step when untrusted text quietly becomes a future instruction * How DASGuard's chain-of-custody provenance tracking — and its draft-vs-sent-email distinction between sanitizing files and blocking irreversible actions — cuts attack success from 95% to under 16% * The ablation that proves the insight is the contribution: remove just the source labels and the whole defense collapses back to 92.7%, even with detection and memory intact * Why the 16% number deserves grains of salt — no adaptive attacker, a benchmark and defense from the same team, a thin clean-task set, and a 13% false-positive rate * Why the reframe outlasts the benchmark: provenance tracking is portable across agent harnesses, but recovery from an already-poisoned workspace remains wide open * 00:00 — The attack with no visible moment An opening scenario where a planted policy line graduates into a trusted runbook rule and triggers harm days later, with no single step that looks dangerous. * 02:54 — Why classic prompt injection stopped working The authors run AgentDojo and InjecAgent against undefended frontier models and find single-shot injection now fails at near-zero — making the field think the problem is half-solved. * 05:48 — The agentic harness and the persistence problem How memory that survives across sessions creates a brand-new place for attackers to hide, and why the right question shifts from 'is this safe?' to 'where did this come from?' * 08:43 — Relocating the trojan to the workspace Borrowing the backdoor concept from classic security and pointing the trigger at persistent workspace state rather than a secret token or pixel pattern. * 11:37 — ClawTrojan and the 95% number How the benchmark builds runnable sandboxes and validates full multi-step attack chains — including fragmented payloads — that succeed roughly 95% of the time. * 14:32 — How DASGuard works: detect, attribute, sanitize A walkthrough of the three gates, the content-source graph that propagates suspicion across steps, and the shadow workspace that cleans files instead of just blocking. * 17:26 — The results and the ablation that proves the point DASGuard drops attack success to under 16% while nine baselines barely move the needle, and removing provenance alone reverts the defense to near-undefended. * 20:21 — Where the numbers deserve skepticism A steelman critique covering the same-team benchmark, the absence of an adaptive attacker, the thin clean-task set, false positives, and adapted baselines. * 23:15 — What survives the paper Why the conceptual relocation — treat the workspace as something to defend, and never let a stranger's note become your agent's rule — outlasts the provisional metrics. RECOMMENDED READING * Not what you've signed up for: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection [https://arxiv.org/abs/2302.12173] — The foundational treatment of indirect prompt injection — the single-shot attack this episode argues frontier models now shrug off, setting up the persistence reframe. * Defeating Prompt Injections by Design (CaMeL) [https://arxiv.org/abs/2503.18813] — The data-flow defense the episode singles out as the strongest baseline, whose notion of provenance gets it 'halfway to the right idea' but stops short of persistent state. * AgentDojo: A Dynamic Environment to Evaluate Prompt Injection Attacks and Defenses for LLM Agents [https://arxiv.org/abs/2406.13352] — One of the two standard benchmarks the authors run to show single-shot injection now fails, motivating their multi-step ClawTrojan chains. * InjecAgent: Benchmarking Indirect Prompt Injections in Tool-Integrated Large Language Model Agents [https://arxiv.org/abs/2403.02691] — The second benchmark used as the near-zero baseline, illustrating the gap between obvious single-context injection and the smeared-across-time attack this episode centers on.

2 de jun de 202626 min

When Three LLMs Talk to Each Other, Their Ideas Quietly Stop Moving

Descripción

Comentarios

Empieza 7 días de prueba

Todos los episodios