Iniciar sesión

AI Post Transformers

AI Post Transformers

Do Language Models Need Sleep?

1 h 0 min · Ayer

Portada del episodio Do Language Models Need Sleep?

Descripción

This episode explores a paper proposing that language models could handle long-context reasoning by periodically pausing, replaying soon-to-be-evicted context offline, and consolidating it into fixed-size fast-weight memory instead of carrying an ever-growing KV cache. It explains the core machinery behind the idea, including state space models and Gated Delta Networks, and clarifies why this is more than prompt summarization or retrieval: the model is rewriting its internal bounded memory during inference. The discussion highlights the paper’s central argument that extra compute may be better spent during these offline “sleep” passes, so later token prediction stays cheap while older information is metabolized into usable latent state. Listeners would find it interesting because it frames long-context scaling as a memory-systems problem, raises concrete questions about whether this consolidation actually improves reasoning, and connects the proposal to broader debates about how future LLMs should trade off memory, compute, and exact recall. Sources: 1. Do Language Models Need Sleep? Offline Recurrence for Improved Online Inference — Sangyun Lee, Sean McLeish, Tom Goldstein, Giulia Fanti, 2026 http://arxiv.org/abs/2605.26099 2. Replay in Deep Learning: Current Approaches and Missing Biological Elements — Tyler L. Hayes, Giri P. Krishnan, Maxim Bazhenov, Hava T. Siegelmann, Terrence J. Sejnowski, Christopher Kanan, 2021 https://scholar.google.com/scholar?q=Replay+in+Deep+Learning:+Current+Approaches+and+Missing+Biological+Elements 3. Can sleep protect memories from catastrophic forgetting? — Oscar C. Gonzalez, Yury Sokolov, Giri P. Krishnan, Jean Erik Delanois, Maxim Bazhenov, 2020 https://scholar.google.com/scholar?q=Can+sleep+protect+memories+from+catastrophic+forgetting? 4. Sleep-like unsupervised replay reduces catastrophic forgetting in artificial neural networks — Timothy Tadros, Giri P. Krishnan, Ramyaa Ramyaa, Maxim Bazhenov, 2022 https://scholar.google.com/scholar?q=Sleep-like+unsupervised+replay+reduces+catastrophic+forgetting+in+artificial+neural+networks 5. Do Language Models Need Sleep? Offline Recurrence for Improved Online Inference — Sangyun Lee, Sean McLeish, Tom Goldstein, Giulia Fanti, 2026 https://scholar.google.com/scholar?q=Do+Language+Models+Need+Sleep?+Offline+Recurrence+for+Improved+Online+Inference 6. Using Fast Weights to Attend to the Recent Past — Jimmy Ba, Geoffrey Hinton, Volodymyr Mnih, Joel Z. Leibo, Catalin Ionescu, 2016 https://scholar.google.com/scholar?q=Using+Fast+Weights+to+Attend+to+the+Recent+Past 7. Linear Transformers Are Secretly Fast Weight Programmers — Imanol Schlag, Kazuki Irie, Jürgen Schmidhuber, 2021 https://scholar.google.com/scholar?q=Linear+Transformers+Are+Secretly+Fast+Weight+Programmers 8. Fast weight programming and linear transformers: from machine learning to neurobiology — Kazuki Irie, Samuel J. Gershman, 2026 https://scholar.google.com/scholar?q=Fast+weight+programming+and+linear+transformers:+from+machine+learning+to+neurobiology 9. TRELLIS: Learning to Compress Key-Value Memory in Attention Models — Mahdi Karami, Ali Behrouz, Praneeth Kacham, Vahab Mirrokni, 2025 https://scholar.google.com/scholar?q=TRELLIS:+Learning+to+Compress+Key-Value+Memory+in+Attention+Models 10. Gated Delta Networks: Improving Mamba2 with Delta Rule — Songlin Yang, Jan Kautz, Ali Hatamizadeh, 2024 https://scholar.google.com/scholar?q=Gated+Delta+Networks:+Improving+Mamba2+with+Delta+Rule 11. Titans: Learning to Memorize at Test Time — Ali Behrouz, Peilin Zhong, Vahab Mirrokni, 2025 https://scholar.google.com/scholar?q=Titans:+Learning+to+Memorize+at+Test+Time 12. Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach — Jonas Geiping, Sean McLeish, Neel Jain, et al., 2025 https://scholar.google.com/scholar?q=Scaling+up+Test-Time+Compute+with+Latent+Reasoning:+A+Recurrent+Depth+Approach 13. In-context Autoencoder for Context Compression in a Large Language Model — Tao Ge, Jing Hu, Lei Wang, Xun Wang, Si-Qing Chen, Furu Wei, 2023 https://scholar.google.com/scholar?q=In-context+Autoencoder+for+Context+Compression+in+a+Large+Language+Model 14. Cartridges: Lightweight and general-purpose long context representations via self-study — Sabri Eyuboglu, Ryan Ehrlich, Simran Arora, et al., 2025 https://scholar.google.com/scholar?q=Cartridges:+Lightweight+and+general-purpose+long+context+representations+via+self-study 15. Repeat After Me: Transformers are Better than State Space Models at Copying — Samy Jelassi, David Brandfonbrener, Sham M. Kakade, Eran Malach, 2024 https://scholar.google.com/scholar?q=Repeat+After+Me:+Transformers+are+Better+than+State+Space+Models+at+Copying 16. End-to-End Test-Time Training for Long Context — Arnuv Tandon et al., 2025 https://scholar.google.com/scholar?q=End-to-End+Test-Time+Training+for+Long+Context 17. Let's (not) just put things in Context: Test-Time Training for Long-Context LLMs — Rachit Bansal et al., 2025 https://scholar.google.com/scholar?q=Let's+(not)+just+put+things+in+Context:+Test-Time+Training+for+Long-Context+LLMs 18. Test-Time Training Done Right — Tianyuan Zhang et al., 2025 https://scholar.google.com/scholar?q=Test-Time+Training+Done+Right 19. Not All Heads Matter: A Head-Level KV Cache Compression Method with Integrated Retrieval and Reasoning — Yu Fu et al., 2024 https://scholar.google.com/scholar?q=Not+All+Heads+Matter:+A+Head-Level+KV+Cache+Compression+Method+with+Integrated+Retrieval+and+Reasoning 20. Beyond RAG: Task-Aware KV Cache Compression for Comprehensive Knowledge Reasoning — Giulio Corallo et al., 2025 https://scholar.google.com/scholar?q=Beyond+RAG:+Task-Aware+KV+Cache+Compression+for+Comprehensive+Knowledge+Reasoning 21. SideQuest: Model-Driven KV Cache Management for Long-Horizon Agentic Reasoning — Sanjay Kariyappa and G. Edward Suh, 2026 https://scholar.google.com/scholar?q=SideQuest:+Model-Driven+KV+Cache+Management+for+Long-Horizon+Agentic+Reasoning 22. Loop, Think, & Generalize: Implicit Reasoning in Recurrent-Depth Transformers — Harsh Kohli et al., 2026 https://scholar.google.com/scholar?q=Loop,+Think,+&+Generalize:+Implicit+Reasoning+in+Recurrent-Depth+Transformers 23. AI Post Transformers: Titans: Learning to Memorize at Test Time — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-20-titans-learning-to-memorize-at-test-time-054662.mp3 24. AI Post Transformers: In-Place Test-Time Training for Transformers — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-09-in-place-test-time-training-for-transfor-d0b976.mp3 25. AI Post Transformers: Recursive Language Models for Arbitrarily Long Prompts — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-04-recursive-language-models-for-arbitraril-fbcd1c.mp3 26. AI Post Transformers: Explicit Information Transmission for Context Compression — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-05-explicit-information-transmission-for-co-24e3c2.mp3 27. AI Post Transformers: KVzip for Query-Agnostic KV Cache Compression — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-29-kvzip-for-query-agnostic-kv-cache-compre-72afe5.mp3 28. AI Post Transformers: Gated Linear Attention for Efficient Long Sequences — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-18-gated-linear-attention-for-efficient-lon-c858ab.mp3 29. AI Post Transformers: MiA-Signature and Global Activation for Long Context — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-13-mia-signature-and-global-activation-for-5ad62f.mp3

Comentarios

0

Sé la primera persona en comentar

¡Regístrate ahora y únete a la comunidad de AI Post Transformers!

Todos los episodios

668 episodios

Dragonfly Topology for Scalable AI Networks

This episode explores the 2008 Dragonfly network topology paper and why its ideas suddenly matter again for large-scale AI systems in 2026. It explains how Dragonfly uses high-radix routers and router groups to keep most traffic to a local hop, a single global hop, and another local hop, reducing the number of expensive long-distance optical links compared with flattened butterfly and folded Clos designs. The discussion highlights the paper’s core argument that topology and routing must be co-designed around pin bandwidth, cable cost, power, and congestion, with the authors claiming roughly 20 percent lower cost than flattened butterfly and 52 percent lower cost than folded Clos beyond 16K nodes under their assumptions. Listeners would find it interesting because it connects an old supercomputing interconnect idea to modern TPU fabrics, mixture-of-experts traffic, all-to-all communication, and the growing reality that network design now directly shapes AI system performance. Sources: 1. Dragonfly Topology for Scalable AI Networks https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/34926.pdf 2. Technology-Driven, Highly-Scalable Dragonfly Topology — John Kim, William J. Dally, Steve Scott, Dennis Abts, 2008 https://scholar.google.com/scholar?q=Technology-Driven,+Highly-Scalable+Dragonfly+Topology 3. Flattened Butterfly: A Cost-Efficient Topology for High-Radix Networks — John Kim, William J. Dally, Dennis Abts, 2007 https://scholar.google.com/scholar?q=Flattened+Butterfly:+A+Cost-Efficient+Topology+for+High-Radix+Networks 4. Topological Characterization of Hamming and Dragonfly Networks and Its Implications on Routing — Cristobal Camarero, Enrique Vallejo, Ramon Beivide, 2014 https://scholar.google.com/scholar?q=Topological+Characterization+of+Hamming+and+Dragonfly+Networks+and+Its+Implications+on+Routing 5. Slim Fly: A Cost Effective Low-Diameter Network Topology — Maciej Besta, Torsten Hoefler, 2014 https://scholar.google.com/scholar?q=Slim+Fly:+A+Cost+Effective+Low-Diameter+Network+Topology 6. Microarchitecture of a High-Radix Router — John Kim, William J. Dally, Brian Towles, Amit K. Gupta, 2005 https://scholar.google.com/scholar?q=Microarchitecture+of+a+High-Radix+Router 7. The BlackWidow High-Radix Clos Network — Steve Scott, Dennis Abts, John Kim, William J. Dally, 2006 https://scholar.google.com/scholar?q=The+BlackWidow+High-Radix+Clos+Network 8. Scalable High-Radix Router Microarchitecture Using a Network Switch Organization — Jung Ho Ahn, Young Hoon Son, John Kim, 2013 https://scholar.google.com/scholar?q=Scalable+High-Radix+Router+Microarchitecture+Using+a+Network+Switch+Organization 9. A Scheme for Fast Parallel Communication — L. G. Valiant, 1982 https://scholar.google.com/scholar?q=A+Scheme+for+Fast+Parallel+Communication 10. Indirect Adaptive Routing on Large Scale Interconnection Networks — Nan Jiang, John Kim, William J. Dally, 2009 https://scholar.google.com/scholar?q=Indirect+Adaptive+Routing+on+Large+Scale+Interconnection+Networks 11. Rationale and Challenges for Optical Interconnects to Electronic Chips — David A. B. Miller, 2000 https://scholar.google.com/scholar?q=Rationale+and+Challenges+for+Optical+Interconnects+to+Electronic+Chips 12. Optical Interconnects for High-Performance Computing — Marc A. Taubenblatt, 2012 https://scholar.google.com/scholar?q=Optical+Interconnects+for+High-Performance+Computing 13. Optical Interconnects for Extreme Scale Computing Systems — Sebastien Rumley, Meisam Bahadori, Robert Polster, Simon D. Hammond, David M. Calhoun, Ke Wen, Arun Rodrigues, Keren Bergman, 2017 https://scholar.google.com/scholar?q=Optical+Interconnects+for+Extreme+Scale+Computing+Systems 14. Mission Apollo: Landing Optical Circuit Switching at Datacenter Scale — Ryohei Urata, Hong Liu, Kevin Yasumura, Erji Mao, Jill Berger, Xiang Zhou, Cedric Lam, Roy Bannon, Darren Hutchinson, Daniel Nelson, Leon Poutievski, Arjun Singh, Joon Ong, Amin Vahdat, 2022 https://scholar.google.com/scholar?q=Mission+Apollo:+Landing+Optical+Circuit+Switching+at+Datacenter+Scale 15. Adaptive Routing in High-Radix Clos Network — John Kim, William J. Dally, Dennis Abts, 2006 https://doi.org/10.1145/1188455.1188552 16. Efficient All-to-All Collective Communication Schedules for Direct-Connect Topologies — Prithwish Basu, Liangyu Zhao, Jason Fantl, Siddharth Pal, Arvind Krishnamurthy, Joud Khoury, 2024 https://doi.org/10.1145/3625549.3658656 17. Toward lower-diameter large-scale HPC and data center networks with co-packaged optics — Pavlos Maniotis, Laurent Schares, Benjamin G. Lee, Marc A. Taubenblatt, Daniel M. Kuchta, 2021 https://scholar.google.com/scholar?q=Toward+lower-diameter+large-scale+HPC+and+data+center+networks+with+co-packaged+optics 18. Toward higher-radix switches with co-packaged optics for improved network locality in data center and HPC networks [Invited] — Pavlos Maniotis, Laurent Schares, Daniel M. Kuchta, Bengi Karacali, 2022 https://scholar.google.com/scholar?q=Toward+higher-radix+switches+with+co-packaged+optics+for+improved+network+locality+in+data+center+and+HPC+networks+[Invited] 19. Exploring the benefits of using co-packaged optics in data center and AI supercomputer networks: a simulation-based analysis [Invited] — Pavlos Maniotis, Daniel M. Kuchta, 2024 https://scholar.google.com/scholar?q=Exploring+the+benefits+of+using+co-packaged+optics+in+data+center+and+AI+supercomputer+networks:+a+simulation-based+analysis+[Invited] 20. Enhanced UGAL Routing Schemes for Dragonfly Networks — Ram Sharan Chaulagain, Xin Yuan, 2024 https://scholar.google.com/scholar?q=Enhanced+UGAL+Routing+Schemes+for+Dragonfly+Networks 21. On Selection Functions in Adaptive Routing — Alejandro Cano, Cristobal Camarero, Carmen Martinez, 2025 https://scholar.google.com/scholar?q=On+Selection+Functions+in+Adaptive+Routing 22. Co-packaged optics (CPO): status, challenges, and solutions — Min Tan and coauthors, 2023 https://scholar.google.com/scholar?q=Co-packaged+optics+(CPO):+status,+challenges,+and+solutions 23. AI Post Transformers: Computation-Bandwidth-Memory Trade-offs for AI Infrastructure — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-09-computation-bandwidth-memory-trade-offs-a83f2b.mp3 24. AI Post Transformers: FengHuang for Rack-Scale LLM Inference Memory — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-12-fenghuang-for-rack-scale-llm-inference-m-62708e.mp3 25. AI Post Transformers: Serving MoE Models with Disaggregated Expert Parallelism — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-19-serving-moe-models-with-disaggregated-ex-6979d2.mp3 26. AI Post Transformers: Lossless Sparse Deltas for RL Networks — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-15-lossless-sparse-deltas-for-rl-networks-84d676.mp3

Snap's Microkernel Approach to Host Networking

This episode explores Google’s Snap system, which moves major host-networking functions out of the kernel and into isolated userspace services while trying to keep the performance benefits usually associated with kernel bypass. It examines why that shift mattered operationally at fleet scale: kernel networking changes could take one to two months to deploy, while Snap enabled roughly weekly releases and had already been adopted across more than half of Google’s machines. The discussion breaks down Snap’s architecture, including centralized host services, microkernel-style isolation, lock-free engine communication, the MicroQuanta scheduler design, latency-sensitive congestion control, and Pony Express as a flagship transport for reliable, asynchronous messaging. Listeners would find it interesting because it frames host networking as a platform-design problem, not just a packet-speed problem, and argues that upgradeability, policy control, and performance can be engineered together rather than traded off. Sources: 1. Snap's Microkernel Approach to Host Networking https://storage.googleapis.com/gweb-research2023-media/pubtools/5281.pdf 2. L4 Microkernels: The Lessons from 20 Years of Research and Deployment — Gernot Heiser, Kevin Elphinstone, 2016 https://trustworthy.systems/publications/nicta_full_text/8988.pdf 3. Arrakis: The Operating System is the Control Plane — Simon Peter, Jialin Li, Irene Zhang, Dan R. K. Ports, Doug Woos, Arvind Krishnamurthy, Thomas Anderson, Timothy Roscoe, 2014 https://www.usenix.org/conference/osdi14/technical-sessions/presentation/peter 4. Snap: a Microkernel Approach to Host Networking — Michael Marty, Marc de Kruijf, Jacob Adriaens, Nandita Dukkipati, Amin Vahdat, et al., 2019 https://research.google/pubs/snap-a-microkernel-approach-to-host-networking/ 5. netmap: A Novel Framework for Fast Packet I/O — Luigi Rizzo, 2012 https://www.usenix.org/conference/atc12/technical-sessions/presentation/rizzo 6. mTCP: A Highly Scalable User-level TCP Stack for Multicore Systems — EunYoung Jeong, Shinae Woo, Muhammad Jamshed, Haewon Jeong, Sunghwan Ihm, Dongsu Han, KyoungSoo Park, 2014 https://www.usenix.org/system/files/conference/nsdi14/nsdi14-paper-jeong.pdf 7. IX: A Protected Dataplane Operating System for High Throughput and Low Latency — Adam Belay, George Prekas, Ana Klimovic, Samuel Grossman, Christos Kozyrakis, Edouard Bugnion, 2014 https://csl.stanford.edu/~christos/publications/2014.ix.osdi.pdf 8. VL2: A Scalable and Flexible Data Center Network — Albert Greenberg, James R. Hamilton, Navendu Jain, Srikanth Kandula, Changhoon Kim, Parantap Lahiri, Dave Maltz, Parveen Patel, Sudipta Sengupta, 2009 https://www.microsoft.com/en-us/research/publication/vl2-a-scalable-and-flexible-data-center-network/ 9. Andromeda: Performance, Isolation, and Velocity at Scale in Cloud Network Virtualization — Michael Dalton, David Schultz, Jacob Adriaens, Ahsan Arefin, Anshuman Gupta, Amin Vahdat, et al., 2018 https://www.usenix.org/conference/nsdi18/presentation/dalton 10. Carousel: Scalable Traffic Shaping at End-Hosts — Ahmed Saeed, Nandita Dukkipati, Valas Valancius, Terry Lam, Carlo Contavalli, Amin Vahdat, 2017 https://research.google/pubs/carousel-scalable-traffic-shaping-at-end-hosts/ 11. FaRM: Fast Remote Memory — Aleksandar Dragojevic, Dushyanth Narayanan, Orion Hodson, Miguel Castro, 2014 https://www.usenix.org/conference/nsdi14/technical-sessions/dragojevi%C4%87 12. Using RDMA Efficiently for Key-Value Services — Anuj Kalia, Michael Kaminsky, David G. Andersen, 2014 https://www.pdl.cmu.edu/PDL-FTP/Storage/herd-sigcomm2014.pdf 13. Datacenter RPCs can be General and Fast — Anuj Kalia, Michael Kaminsky, David Andersen, 2019 https://www.usenix.org/conference/nsdi19/presentation/kalia 14. Shenango: Achieving High CPU Efficiency for Latency-sensitive Datacenter Workloads — Amy Ousterhout, Joshua Fried, Jonathan Behrens, Adam Belay, Hari Balakrishnan, 2019 https://www.usenix.org/conference/nsdi19/presentation/ousterhout 15. Caladan: Mitigating Interference at Microsecond Timescales — Joshua Fried, Zhenyuan Ruan, Amy Ousterhout, Adam Belay, 2020 https://www.usenix.org/conference/osdi20/presentation/fried 16. TAS: TCP Acceleration as an OS Service — Antoine Kaufmann, Tim Stamler, Simon Peter, Naveen Kr. Sharma, Arvind Krishnamurthy, and Thomas Anderson, 2019 https://scholar.google.com/scholar?q=TAS:+TCP+Acceleration+as+an+OS+Service 17. FaSST: Fast, Scalable and Simple Distributed Transactions with Two-Sided (RDMA) Datagram RPCs — Anuj Kalia, Michael Kaminsky, and David G. Andersen, 2016 https://scholar.google.com/scholar?q=FaSST:+Fast,+Scalable+and+Simple+Distributed+Transactions+with+Two-Sided+(RDMA)+Datagram+RPCs 18. Implementing Network Protocols at User Level — C. A. Thekkath, T. D. Nguyen, E. Moy, and E. D. Lazowska, 1993 https://scholar.google.com/scholar?q=Implementing+Network+Protocols+at+User+Level 19. NetEdit: An Orchestration Platform for eBPF Network Functions at Scale — Theophilus A. Benson et al., 2024 https://doi.org/10.1145/3651890.3672227 20. Demystifying Performance of eBPF Network Applications — Farbod Shahinfar, Sebastiano Miano, Aurojit Panda, Gianni Antichi, 2025 https://cs.nyu.edu/~apanda/assets/papers/conext25.pdf 21. Unleashing Unprivileged eBPF Potential with Dynamic Sandboxing — Soo Yee Lim, Xueyuan Han, Thomas Pasquier, 2023 https://arxiv.org/abs/2308.01983 22. Efficient Scheduler Live Update for Linux Kernel with Modularization — Teng Ma et al., 2023 https://doi.org/10.1145/3582016.3582054 23. Communication Offloading on SmartNIC DPUs: A Quantitative Approach — Jacob Wahlgren et al., 2026 https://arxiv.org/abs/2605.04842

Do Language Models Need Sleep?

This episode explores a paper proposing that language models could handle long-context reasoning by periodically pausing, replaying soon-to-be-evicted context offline, and consolidating it into fixed-size fast-weight memory instead of carrying an ever-growing KV cache. It explains the core machinery behind the idea, including state space models and Gated Delta Networks, and clarifies why this is more than prompt summarization or retrieval: the model is rewriting its internal bounded memory during inference. The discussion highlights the paper’s central argument that extra compute may be better spent during these offline “sleep” passes, so later token prediction stays cheap while older information is metabolized into usable latent state. Listeners would find it interesting because it frames long-context scaling as a memory-systems problem, raises concrete questions about whether this consolidation actually improves reasoning, and connects the proposal to broader debates about how future LLMs should trade off memory, compute, and exact recall. Sources: 1. Do Language Models Need Sleep? Offline Recurrence for Improved Online Inference — Sangyun Lee, Sean McLeish, Tom Goldstein, Giulia Fanti, 2026 http://arxiv.org/abs/2605.26099 2. Replay in Deep Learning: Current Approaches and Missing Biological Elements — Tyler L. Hayes, Giri P. Krishnan, Maxim Bazhenov, Hava T. Siegelmann, Terrence J. Sejnowski, Christopher Kanan, 2021 https://scholar.google.com/scholar?q=Replay+in+Deep+Learning:+Current+Approaches+and+Missing+Biological+Elements 3. Can sleep protect memories from catastrophic forgetting? — Oscar C. Gonzalez, Yury Sokolov, Giri P. Krishnan, Jean Erik Delanois, Maxim Bazhenov, 2020 https://scholar.google.com/scholar?q=Can+sleep+protect+memories+from+catastrophic+forgetting? 4. Sleep-like unsupervised replay reduces catastrophic forgetting in artificial neural networks — Timothy Tadros, Giri P. Krishnan, Ramyaa Ramyaa, Maxim Bazhenov, 2022 https://scholar.google.com/scholar?q=Sleep-like+unsupervised+replay+reduces+catastrophic+forgetting+in+artificial+neural+networks 5. Do Language Models Need Sleep? Offline Recurrence for Improved Online Inference — Sangyun Lee, Sean McLeish, Tom Goldstein, Giulia Fanti, 2026 https://scholar.google.com/scholar?q=Do+Language+Models+Need+Sleep?+Offline+Recurrence+for+Improved+Online+Inference 6. Using Fast Weights to Attend to the Recent Past — Jimmy Ba, Geoffrey Hinton, Volodymyr Mnih, Joel Z. Leibo, Catalin Ionescu, 2016 https://scholar.google.com/scholar?q=Using+Fast+Weights+to+Attend+to+the+Recent+Past 7. Linear Transformers Are Secretly Fast Weight Programmers — Imanol Schlag, Kazuki Irie, Jürgen Schmidhuber, 2021 https://scholar.google.com/scholar?q=Linear+Transformers+Are+Secretly+Fast+Weight+Programmers 8. Fast weight programming and linear transformers: from machine learning to neurobiology — Kazuki Irie, Samuel J. Gershman, 2026 https://scholar.google.com/scholar?q=Fast+weight+programming+and+linear+transformers:+from+machine+learning+to+neurobiology 9. TRELLIS: Learning to Compress Key-Value Memory in Attention Models — Mahdi Karami, Ali Behrouz, Praneeth Kacham, Vahab Mirrokni, 2025 https://scholar.google.com/scholar?q=TRELLIS:+Learning+to+Compress+Key-Value+Memory+in+Attention+Models 10. Gated Delta Networks: Improving Mamba2 with Delta Rule — Songlin Yang, Jan Kautz, Ali Hatamizadeh, 2024 https://scholar.google.com/scholar?q=Gated+Delta+Networks:+Improving+Mamba2+with+Delta+Rule 11. Titans: Learning to Memorize at Test Time — Ali Behrouz, Peilin Zhong, Vahab Mirrokni, 2025 https://scholar.google.com/scholar?q=Titans:+Learning+to+Memorize+at+Test+Time 12. Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach — Jonas Geiping, Sean McLeish, Neel Jain, et al., 2025 https://scholar.google.com/scholar?q=Scaling+up+Test-Time+Compute+with+Latent+Reasoning:+A+Recurrent+Depth+Approach 13. In-context Autoencoder for Context Compression in a Large Language Model — Tao Ge, Jing Hu, Lei Wang, Xun Wang, Si-Qing Chen, Furu Wei, 2023 https://scholar.google.com/scholar?q=In-context+Autoencoder+for+Context+Compression+in+a+Large+Language+Model 14. Cartridges: Lightweight and general-purpose long context representations via self-study — Sabri Eyuboglu, Ryan Ehrlich, Simran Arora, et al., 2025 https://scholar.google.com/scholar?q=Cartridges:+Lightweight+and+general-purpose+long+context+representations+via+self-study 15. Repeat After Me: Transformers are Better than State Space Models at Copying — Samy Jelassi, David Brandfonbrener, Sham M. Kakade, Eran Malach, 2024 https://scholar.google.com/scholar?q=Repeat+After+Me:+Transformers+are+Better+than+State+Space+Models+at+Copying 16. End-to-End Test-Time Training for Long Context — Arnuv Tandon et al., 2025 https://scholar.google.com/scholar?q=End-to-End+Test-Time+Training+for+Long+Context 17. Let's (not) just put things in Context: Test-Time Training for Long-Context LLMs — Rachit Bansal et al., 2025 https://scholar.google.com/scholar?q=Let's+(not)+just+put+things+in+Context:+Test-Time+Training+for+Long-Context+LLMs 18. Test-Time Training Done Right — Tianyuan Zhang et al., 2025 https://scholar.google.com/scholar?q=Test-Time+Training+Done+Right 19. Not All Heads Matter: A Head-Level KV Cache Compression Method with Integrated Retrieval and Reasoning — Yu Fu et al., 2024 https://scholar.google.com/scholar?q=Not+All+Heads+Matter:+A+Head-Level+KV+Cache+Compression+Method+with+Integrated+Retrieval+and+Reasoning 20. Beyond RAG: Task-Aware KV Cache Compression for Comprehensive Knowledge Reasoning — Giulio Corallo et al., 2025 https://scholar.google.com/scholar?q=Beyond+RAG:+Task-Aware+KV+Cache+Compression+for+Comprehensive+Knowledge+Reasoning 21. SideQuest: Model-Driven KV Cache Management for Long-Horizon Agentic Reasoning — Sanjay Kariyappa and G. Edward Suh, 2026 https://scholar.google.com/scholar?q=SideQuest:+Model-Driven+KV+Cache+Management+for+Long-Horizon+Agentic+Reasoning 22. Loop, Think, & Generalize: Implicit Reasoning in Recurrent-Depth Transformers — Harsh Kohli et al., 2026 https://scholar.google.com/scholar?q=Loop,+Think,+&+Generalize:+Implicit+Reasoning+in+Recurrent-Depth+Transformers 23. AI Post Transformers: Titans: Learning to Memorize at Test Time — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-20-titans-learning-to-memorize-at-test-time-054662.mp3 24. AI Post Transformers: In-Place Test-Time Training for Transformers — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-09-in-place-test-time-training-for-transfor-d0b976.mp3 25. AI Post Transformers: Recursive Language Models for Arbitrarily Long Prompts — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-04-recursive-language-models-for-arbitraril-fbcd1c.mp3 26. AI Post Transformers: Explicit Information Transmission for Context Compression — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-05-explicit-information-transmission-for-co-24e3c2.mp3 27. AI Post Transformers: KVzip for Query-Agnostic KV Cache Compression — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-29-kvzip-for-query-agnostic-kv-cache-compre-72afe5.mp3 28. AI Post Transformers: Gated Linear Attention for Efficient Long Sequences — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-18-gated-linear-attention-for-efficient-lon-c858ab.mp3 29. AI Post Transformers: MiA-Signature and Global Activation for Long Context — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-13-mia-signature-and-global-activation-for-5ad62f.mp3

KVzap: Fast, Adaptive, Faithful KV Cache Pruning

This episode explores KVzap, a method for pruning transformer KV caches by learning a cheap surrogate for a much stronger oracle, with the goal of making cache eviction practical during both prompt prefilling and token-by-token decoding. It explains why KV caches dominate long-context inference costs, clarifies the difference between prefilling and decoding, and lays out why serving systems have favored quantization and paging over content-aware token deletion: removing the wrong token can quietly break later answers. The discussion places KVzap alongside KVzip, Expected Attention, and DMS, arguing that its key advance is a learned per-layer, per-head importance predictor trained to imitate a richer KVzip+ teacher that measures not just attention but actual contribution to the residual stream. Listeners would find it interesting because it ties together systems bottlenecks, adaptive eviction policies such as delayed eviction and sliding windows, and concrete training choices into a broader case for faster, more faithful long-context inference. Sources: 1. KVzap: Fast, Adaptive, Faithful KV Cache Pruning https://arxiv.org/pdf/2601.07891 2. H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models — Zhenyu Zhang, Ying Sheng, Tianyi Zhou, Tianlong Chen, Beidi Chen, et al., 2023 https://scholar.google.com/scholar?q=H2O:+Heavy-Hitter+Oracle+for+Efficient+Generative+Inference+of+Large+Language+Models 3. SnapKV: LLM Knows What You are Looking for Before Generation — Yuhong Li, Yingbing Huang, Bowen Yang, Bharat Venkitesh, Patrick Lewis, et al., 2024 https://scholar.google.com/scholar?q=SnapKV:+LLM+Knows+What+You+are+Looking+for+Before+Generation 4. Expected Attention: KV Cache Compression by Estimating Attention from Future Queries Distribution — Alessio Devoto, Maximilian Jeblick, Simon Jegou, 2025 https://scholar.google.com/scholar?q=Expected+Attention:+KV+Cache+Compression+by+Estimating+Attention+from+Future+Queries+Distribution 5. KVzip: Query-Agnostic KV Cache Compression with Context Reconstruction — Jang-Hyun Kim, Jinuk Kim, Sangwoo Kwon, Jae W. Lee, Sangdoo Yun, Hyun Oh Song, 2025 https://scholar.google.com/scholar?q=KVzip:+Query-Agnostic+KV+Cache+Compression+with+Context+Reconstruction 6. Inference-Time Hyper-Scaling with KV Cache Compression — Adrian Lancucki, Konrad Staniszewski, Piotr Nawrot, Edoardo M. Ponti, 2025 https://scholar.google.com/scholar?q=Inference-Time+Hyper-Scaling+with+KV+Cache+Compression 7. Compactor: Calibrated Query-Agnostic KV Cache Compression with Approximate Leverage Scores — Vivek Chari, Benjamin Van Durme, 2025 https://scholar.google.com/scholar?q=Compactor:+Calibrated+Query-Agnostic+KV+Cache+Compression+with+Approximate+Leverage+Scores 8. DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads — Guangxuan Xiao, Jiaming Tang, Jingwei Zuo, Junxian Guo, Shang Yang, Haotian Tang, Yao Fu, Song Han, 2024 https://scholar.google.com/scholar?q=DuoAttention:+Efficient+Long-Context+LLM+Inference+with+Retrieval+and+Streaming+Heads 9. Retrieval Head Mechanistically Explains Long-Context Factuality — Wenhao Wu et al., 2024 https://scholar.google.com/scholar?q=Retrieval+Head+Mechanistically+Explains+Long-Context+Factuality 10. Query-Focused Retrieval Heads Improve Long-Context Reasoning and Re-ranking — Wuwei Zhang et al., 2025 https://scholar.google.com/scholar?q=Query-Focused+Retrieval+Heads+Improve+Long-Context+Reasoning+and+Re-ranking 11. MInference 1.0: Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention — Huiqiang Jiang et al., 2024 https://scholar.google.com/scholar?q=MInference+1.0:+Accelerating+Pre-filling+for+Long-Context+LLMs+via+Dynamic+Sparse+Attention 12. KV-Compress: Paged KV-Cache Compression with Variable Compression Rates per Attention Head — Isaac Rehg, 2024 https://scholar.google.com/scholar?q=KV-Compress:+Paged+KV-Cache+Compression+with+Variable+Compression+Rates+per+Attention+Head 13. PagedEviction: Structured Block-wise KV Cache Pruning for Efficient Large Language Model Inference — Krishna Teja Chitty-Venkata et al., 2025 https://scholar.google.com/scholar?q=PagedEviction:+Structured+Block-wise+KV+Cache+Pruning+for+Efficient+Large+Language+Model+Inference 14. KVFlow: Efficient Prefix Caching for Accelerating LLM-Based Multi-Agent Workflows — Zaifeng Pan et al., 2025 https://scholar.google.com/scholar?q=KVFlow:+Efficient+Prefix+Caching+for+Accelerating+LLM-Based+Multi-Agent+Workflows 15. AI Post Transformers: Affordable Large-Scale Decoding Through Model-System Co-Design — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-19-affordable-large-scale-decoding-through-e1d7ed.mp3 16. AI Post Transformers: TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-03-25-turboquant-online-vector-quantiz-1967b7.mp3 17. AI Post Transformers: Deep Kernel Fusion for Transformer Decoding — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-15-deep-kernel-fusion-for-transformer-decod-b1a703.mp3 18. AI Post Transformers: DeepSeek-V4 and Practical Million-Token Context — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-25-deepseek-v4-and-practical-million-token-6f4de1.mp3 19. AI Post Transformers: How Induction Heads Emerge in Transformers — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-03-how-induction-heads-emerge-in-transforme-a7bfcb.mp3 20. AI Post Transformers: When Many-Shot CoT Becomes Test-Time Learning — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-15-when-many-shot-cot-becomes-test-time-lea-c25bfe.mp3

30 de may de 20261 h 0 min

KVzip for Query-Agnostic KV Cache Compression

This episode explores KVzip, a query-agnostic method for compressing long-context KV caches so a model can reuse a shared document, codebase, or memory bank across many later questions without optimizing for just one query. It explains why KV cache has become a major systems bottleneck, including the striking example that a 120,000-token context for Qwen2.5-14B can require more memory for cache than for the model weights themselves. The discussion contrasts KVzip with exact prefix caching and query-aware pruning methods like SnapKV, then breaks down KVzip’s core idea: replay the original context, measure which cached states receive the most attention during reconstruction, and keep those as durable memory. Listeners would find it interesting because the paper ties a clean systems insight to concrete gains, reporting roughly 394x smaller decoding-time KV caches and about 2x lower FlashAttention latency across LLaMA, Qwen, and Gemma models on very long contexts. Sources: 1. KVzip for Query-Agnostic KV Cache Compression https://arxiv.org/pdf/2505.23416 2. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding — Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova, 2018 https://scholar.google.com/scholar?q=BERT:+Pre-training+of+Deep+Bidirectional+Transformers+for+Language+Understanding 3. SnapKV: LLM Knows What You are Looking for Before Generation — Yuhong Li, Yingbing Huang, Bowen Yang, Bharat Venkitesh, Acyr Locatelli, Hanchen Ye, Tianle Cai, Patrick Lewis, Deming Chen, 2024 https://scholar.google.com/scholar?q=SnapKV:+LLM+Knows+What+You+are+Looking+for+Before+Generation 4. KVzip: Query-Agnostic KV Cache Compression with Context Reconstruction — Jang-Hyun Kim, Jinuk Kim, Sangwoo Kwon, Jae W. Lee, Sangdoo Yun, Hyun Oh Song, 2025 https://scholar.google.com/scholar?q=KVzip:+Query-Agnostic+KV+Cache+Compression+with+Context+Reconstruction 5. Rethinking Key-Value Cache Compression Techniques for Large Language Model Serving — Wei Gao, Xinyu Zhou, Peng Sun, Tianwei Zhang, Yonggang Wen, 2025 https://scholar.google.com/scholar?q=Rethinking+Key-Value+Cache+Compression+Techniques+for+Large+Language+Model+Serving 6. SCBench: A KV Cache-Centric Analysis of Long-Context Methods — Yudong Li, Hongkang Jiang, Qihui Wu, Xintong Luo, Sohee Ahn, Chen Zhang, and others, 2025 https://scholar.google.com/scholar?q=SCBench:+A+KV+Cache-Centric+Analysis+of+Long-Context+Methods 7. DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads — Guangxuan Xiao, Jiaming Tang, Jingwei Zuo, Junxian Guo, Shang Yang, Haotian Tang, Yao Fu, Song Han, 2025 https://scholar.google.com/scholar?q=DuoAttention:+Efficient+Long-Context+LLM+Inference+with+Retrieval+and+Streaming+Heads 8. Compactor: Calibrated Query-Agnostic KV Cache Compression with Approximate Leverage Scores — Vivek Chari, Benjamin Van Durme, 2025 https://scholar.google.com/scholar?q=Compactor:+Calibrated+Query-Agnostic+KV+Cache+Compression+with+Approximate+Leverage+Scores 9. No Token Left Behind: Reliable KV Cache Compression via Importance-Aware Mixed Precision Quantization — June Yong Yang, Byeongwook Kim, Jeongin Bae, Beomseok Kwon, Gunho Park, Eunho Yang, Se Jung Kwon, Dongsoo Lee, 2024 https://scholar.google.com/scholar?q=No+Token+Left+Behind:+Reliable+KV+Cache+Compression+via+Importance-Aware+Mixed+Precision+Quantization 10. Safety Alignment Should Be Made More Than Just a Few Tokens Deep — Xiangyu Qi, Ashwinee Panda, Kaifeng Lyu, Xiao Ma, Subhrajit Roy, Ahmad Beirami, Prateek Mittal, Peter Henderson, 2025 https://scholar.google.com/scholar?q=Safety+Alignment+Should+Be+Made+More+Than+Just+a+Few+Tokens+Deep 11. The Residual Stream Is All You Need: On the Redundancy of the KV Cache in Transformer Inference — Kaleem Ullah Qasim et al., 2026 https://arxiv.org/abs/2603.19664 12. DeltaKV: Residual-Based KV Cache Compression via Long-Range Similarity — Jitai Hao et al., 2026 https://arxiv.org/abs/2602.08005 13. ParisKV: Fast and Drift-Robust KV-Cache Retrieval for Long-Context LLMs — Yanlin Qi et al., 2026 https://arxiv.org/abs/2602.07721 14. HeteroCache: A Dynamic Retrieval Approach to Heterogeneous KV Cache Compression for Long-Context LLM Inference — Zhiyuan Shi et al., 2026 https://arxiv.org/abs/2601.13684 15. R-KV: Redundancy-aware KV Cache Compression for Reasoning Models — Zefan Cai et al., 2025 https://arxiv.org/abs/2505.24133 16. Hold Onto That Thought: Assessing KV Cache Compression On Reasoning — Minghui Liu et al., 2025 https://arxiv.org/abs/2512.12008 17. SideQuest: Model-Driven KV Cache Management for Long-Horizon Agentic Reasoning — Sanjay Kariyappa and G. Edward Suh, 2026 https://arxiv.org/abs/2602.22603 18. AI Post Transformers: PackKV Lossy Compression for KV Caches — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-04-packkv-lossy-compression-for-kv-caches-b37bce.mp3 19. AI Post Transformers: Affordable Large-Scale Decoding Through Model-System Co-Design — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-19-affordable-large-scale-decoding-through-e1d7ed.mp3 20. AI Post Transformers: Deep Kernel Fusion for Transformer Decoding — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-15-deep-kernel-fusion-for-transformer-decod-b1a703.mp3 21. AI Post Transformers: DeepSeek-V4 and Practical Million-Token Context — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-25-deepseek-v4-and-practical-million-token-6f4de1.mp3 22. AI Post Transformers: TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-03-25-turboquant-online-vector-quantiz-1967b7.mp3

29 de may de 20261 h 0 min