AI Papers: A Deep Dive

Why Long-Context Models Might Need Compute, Not Capacity, Before Eviction

24 min · 26 de may de 2026

Descripción

WHY LONG-CONTEXT MODELS MIGHT NEED COMPUTE, NOT CAPACITY, BEFORE EVICTION Source: Language Models Need Sleep [https://arxiv.org/abs/2605.26099] Paper was published on May 25, 2026 This episode was AI-generated on May 26, 2026. The script was written by an AI language model and the host voices were synthesized by Eleven Labs. The producer is not affiliated with Anthropic or Eleven Labs. For two years the long-context modeling community has been arguing about how much information you can squeeze into a fixed-size memory. A new paper says that's the wrong axis entirely — the bottleneck isn't how big the whiteboard is, it's how much thinking happened while writing on it. The fix is a 'sleep' phase that loops compute over context right before the cache gets cleared, with no cost at answer time. KEY TAKEAWAYS * The reframe at the heart of the paper: a hybrid model's fast weight isn't a storage device, it's the residue of a one-pass computation — and shallow computation produces shallow residue regardless of capacity * Why the Rule 110 cellular automaton experiment is unusually clean: it holds stored information constant while varying required computation, isolating compute-for-reasoning from memory-for-storage * The deployment win: extra 'sleep' compute is paid during ingestion, not at answer time, so inference latency is unchanged while training cost scales linearly with loop count N * Concrete gains: two-operation GSM-Infinite problems jump from ~60% to ~90% accuracy with four sleep loops in the sliding-window setting; harder six-operation problems on Ouro go from ~42% to ~62% * The honest limits: the real-task gains tangle 'reasoning' with 'retrieval under constrained windows,' comparisons are mostly against the no-loop version of the same architecture, and the method needs careful two-stage training to work * Why the conceptual contribution may outlast the specific mechanism: it splits inference into a compute-rich ingestion phase and a latency-constrained answer phase, a framing likely to show up in other architectures * 00:00 — The Polaroid problem and the notebook-vs-whiteboard setup A chess thought experiment introduces the gap between storing a position and computing forward from it, then frames how attention's exact notebook and SSMs' lossy whiteboard have been combined in hybrid models. * 03:27 — The reframe: fast weights are computations, not storage The authors' core move — that the community has been optimizing capacity when the real bottleneck is how much thinking went into producing the compressed state. * 06:55 — Sleep as depth-recurrence at eviction time How looping the network N times over a context chunk before clearing the cache buys reasoning depth, with hippocampal consolidation and kitchen-prep analogies for why offline work pays off. * 10:23 — The Rule 110 experiment A walkthrough of the cellular automaton test bed that holds storage requirements constant while varying required computation, and why the result is unusually clean for deep learning. * 13:50 — Does the result transfer to real tasks? Graph traversal and GSM-Infinite results on Jet-Nemotron and Ouro show the same pattern, with a candid look at how 'reasoning gain' starts to blur with 'retrieval gain' outside synthetic settings. * 17:18 — The skeptic's checklist Where the evidence is weaker: tautology concerns on Rule 110, missing comparisons against alternative uses of the same compute budget, and a method that requires careful two-stage training warm-up. * 20:46 — What changes about how we think about inference Why the conceptual contribution — splitting inference into compute-rich ingestion and latency-bound answering — may outlive the specific mechanism, and how it connects to related sleep-time compute work. RECOMMENDED READING * Universal Transformers [https://arxiv.org/abs/1807.03819] — The canonical depth-recurrence paper the episode references — loops transformer layers at inference time, which this episode contrasts with loops at ingestion time. * Sleep-time Compute: Beyond Inference Scaling at Test-time [https://arxiv.org/abs/2504.13171] — The Lin et al. work Bella name-checks as a parallel 'do offline work before queries arrive' proposal with a totally different mechanism. * Mamba: Linear-Time Sequence Modeling with Selective State Spaces [https://arxiv.org/abs/2312.00752] — Background on the state-space 'whiteboard' that the episode's hybrid models rely on, useful for understanding what the fast weight actually is. * Deep Equilibrium Models [https://arxiv.org/abs/1909.01377] — Another point of reference for depth-recurrent architectures, helpful for situating the paper's loop-until-converged framing within a broader research lineage.

Comentarios

Sé la primera persona en comentar

¡Regístrate ahora y forma parte de la comunidad de AI Papers: A Deep Dive!

Prueba gratis

Why Long-Context Models Might Need Compute, Not Capacity, Before Eviction

Descripción

Comentarios

Empieza 7 días de prueba

Todos los episodios