Mamba's Memory Problem

Descripción

State space models like Mamba promised linear scaling and constant memory. They delivered on efficiency, but researchers kept hitting the same wall: ask Mamba to recall something specific from early in a long context, and performance drops. Three papers at ICLR 2026 independently attacked this limitation. That convergence tells you how fundamental the problem is. This podcast breaks down: - Why Mamba's fixed-size state causes "lossy compression" of context - How Mixture of Memories (MoM) adds multiple internal memory banks - How Log-Linear Attention finds a middle ground between SSM and full attention - Why one paper proves SSMs fundamentally can't solve certain tasks without external tools The pattern across all three: you can add more state, but you have to pay somewhere. Parameters, mechanism complexity, or system infrastructure. No free lunch. 📄 Papers covered: - MoM: Linear Sequence Modeling with Mixture-of-Memories https://arxiv.org/abs/2502.13685 [https://arxiv.org/abs/2502.13685] - Log-Linear Attention https://openreview.net/forum?id=mOJgZWkXKW [https://openreview.net/forum?id=mOJgZWkXKW] - To Infinity and Beyond: Tool-Use Unlocks Length Generalization in SSMs https://openreview.net/forum?id=sSfep4udCb [https://openreview.net/forum?id=sSfep4udCb] 📬 Newsletter: https://llmsresearch.substack.com [https://llmsresearch.substack.com] 🐦 Twitter/X: https://x.com/llmsresearch [https://x.com/llmsresearch] 💻 GitHub: https://github.com/llmsresearch [https://github.com/llmsresearch] #Mamba #SSM #StateSpaceModels #ICLR2026 #LLM #MachineLearning #AIResearch #Transformers #DeepLearningChapters timestamp0:00 Mamba's secret weakness 0:42 The promise: linear scaling, constant memory 1:14 The catch: forgetting specific details 1:34 Memory bottleneck explained 1:43 Attention = perfect recall filing cabinet 2:10 SSM = single notepad with fixed pages 2:49 The core tradeoff 2:57 Three solutions to fix it 3:00 Solution 1: Mixture of Memories (MoM) 3:51 Solution 2: Log-Linear Attention 4:48 Solution 3: External tool use 5:49 The "no free lunch" pattern 6:41 What wins for longer contexts? 7:04 Subscribe for more research deep dives This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit llmsresearch.substack.com [https://llmsresearch.substack.com?utm_medium=podcast&utm_campaign=CTA_1]

What ICLR 2026 Taught Us About Multi-Agent Failures

Episode Title: What ICLR 2026 Taught Us About Multi-Agent Failures Episode Summary: We scanned ICLR 2026 accepted papers and found 14 that address real problems when building multi-agent systems: slow pipelines, expensive token bills, cascading errors, brittle topologies, and opaque agent coordination. This episode walks through five production problems and the research that provides concrete solutions. Timestamps TimeSection00:00Introduction: The gap between demos and production01:29Problem 1: Why is my agent system so slow?04:44Problem 2: My token bills are out of control07:30Problem 3: One agent hallucinates, the whole pipeline fails10:45Problem 4: My agent graph breaks when I swap a model12:53Problem 5: I have no idea what my agents are saying to each other15:39Recap: The practitioner's toolkit16:33What's still missing: Long-term stability and adversarial robustness17:02Closing Papers Discussed Problem 1: Latency Speculative Actions [https://openreview.net/forum?id=P0GOk5wslg] - Uses faster draft models to predict likely actions and execute API calls in parallel. Up to 30% speedup across web search and OS control tasks. Graph-of-Agents [https://openreview.net/forum?id=34cANdsHKV] - Uses model cards to filter agents by relevance. Beat a 6-agent baseline using only 3 selected agents. Problem 2: Token Costs KVComm [https://openreview.net/forum?id=F7rUng23nw] - Shares KV cache directly instead of translating to English. 30% of KV layers achieves near-full performance. MEM1 [https://openreview.net/forum?id=XY8AaxDSLb] - Uses RL-based memory consolidation to maintain constant context size. 3.7x memory reduction, 3.5x performance improvement. Problem 3: Error Cascades When Does Divide and Conquer Work [https://openreview.net/forum?id=ddQFUuHDDt] - Noise decomposition framework identifying task noise, model noise (superlinear growth), and aggregator noise. DoVer [https://openreview.net/forum?id=mrEK16Jy6h] - Intervention-driven debugging that edits message history to validate failure hypotheses. Flips 28% of failures to successes. Problem 4: Brittle Topologies CARD [https://openreview.net/forum?id=JgvJdICc6P] - Conditional graph generation that adapts topology based on environmental signals. MAS² [https://openreview.net/forum?id=qumy27hMDY] - Generator-implementer-rectifier team that self-architects agent structures. 19.6% performance gain with cross-backbone generalization. Stochastic Self-Organization [https://openreview.net/forum?id=rS3Jb9AAej] - Decentralized approach using Shapley value approximations. Hierarchy emerges from competence without explicit design. Problem 5: Observability GLC [https://openreview.net/forum?id=a3CUE06G5Y] - Autoencoder creates compressed symbols aligned with human concepts via contrastive learning. Speed of symbols, auditability of words. Emergent Coordination [https://openreview.net/forum?id=SRn1MtMPRq] - Information-theoretic metrics distinguishing real collaboration from "spurious temporal coupling." Key finding: you must prompt for theory of mind. ROTE / Modeling Others' Minds as Code [https://openreview.net/forum?id=vHXo7xIer6] - Models agent behavior as executable scripts. 50% improvement in prediction accuracy. Key Concepts Explained TermExplanationSpeculative executionBorrowed from CPU architecture. Guess the next action, execute in parallel, discard if wrong.KV cacheThe model's working memory of conversation context, stored as mathematical vectors.Model noiseConfusion that grows with context size. Grows superlinearly with input length.Shapley valuesGame theory concept for assigning credit to players in cooperative games.Spurious temporal couplingAgents appearing to collaborate but actually solving problems independently at the same time.Contrastive learningPushing similar things closer and different things further apart in vector space. Key Quotes "English is a terrible data transfer protocol for machines. We're taking clean mathematical concepts, translating them into paragraphs, and then asking another machine to turn them back into math." "The hierarchy emerges from competence. You don't design it." "Did they solve it together, or did they just all happen to solve it at the same time by themselves?" "Treating minds as software is a pretty effective way to predict what software will do." Links Newsletter: llmsresearch.substack.com [http://llmsresearch.substack.com] Twitter/X: @llmsresearch [https://twitter.com/llmsresearch] This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit llmsresearch.substack.com [https://llmsresearch.substack.com?utm_medium=podcast&utm_campaign=CTA_1]

31 de ene de 202617 min

Mamba's Memory Problem

Descripción

Comentarios

Empieza 7 días de prueba

Todos los episodios