Do Language Models Need Sleep?

Beskrivelse

This episode explores a paper proposing that language models could handle long-context reasoning by periodically pausing, replaying soon-to-be-evicted context offline, and consolidating it into fixed-size fast-weight memory instead of carrying an ever-growing KV cache. It explains the core machinery behind the idea, including state space models and Gated Delta Networks, and clarifies why this is more than prompt summarization or retrieval: the model is rewriting its internal bounded memory during inference. The discussion highlights the paper’s central argument that extra compute may be better spent during these offline “sleep” passes, so later token prediction stays cheap while older information is metabolized into usable latent state. Listeners would find it interesting because it frames long-context scaling as a memory-systems problem, raises concrete questions about whether this consolidation actually improves reasoning, and connects the proposal to broader debates about how future LLMs should trade off memory, compute, and exact recall. Sources: 1. Do Language Models Need Sleep? Offline Recurrence for Improved Online Inference — Sangyun Lee, Sean McLeish, Tom Goldstein, Giulia Fanti, 2026 http://arxiv.org/abs/2605.26099 2. Replay in Deep Learning: Current Approaches and Missing Biological Elements — Tyler L. Hayes, Giri P. Krishnan, Maxim Bazhenov, Hava T. Siegelmann, Terrence J. Sejnowski, Christopher Kanan, 2021 https://scholar.google.com/scholar?q=Replay+in+Deep+Learning:+Current+Approaches+and+Missing+Biological+Elements 3. Can sleep protect memories from catastrophic forgetting? — Oscar C. Gonzalez, Yury Sokolov, Giri P. Krishnan, Jean Erik Delanois, Maxim Bazhenov, 2020 https://scholar.google.com/scholar?q=Can+sleep+protect+memories+from+catastrophic+forgetting? 4. Sleep-like unsupervised replay reduces catastrophic forgetting in artificial neural networks — Timothy Tadros, Giri P. Krishnan, Ramyaa Ramyaa, Maxim Bazhenov, 2022 https://scholar.google.com/scholar?q=Sleep-like+unsupervised+replay+reduces+catastrophic+forgetting+in+artificial+neural+networks 5. Do Language Models Need Sleep? Offline Recurrence for Improved Online Inference — Sangyun Lee, Sean McLeish, Tom Goldstein, Giulia Fanti, 2026 https://scholar.google.com/scholar?q=Do+Language+Models+Need+Sleep?+Offline+Recurrence+for+Improved+Online+Inference 6. Using Fast Weights to Attend to the Recent Past — Jimmy Ba, Geoffrey Hinton, Volodymyr Mnih, Joel Z. Leibo, Catalin Ionescu, 2016 https://scholar.google.com/scholar?q=Using+Fast+Weights+to+Attend+to+the+Recent+Past 7. Linear Transformers Are Secretly Fast Weight Programmers — Imanol Schlag, Kazuki Irie, Jürgen Schmidhuber, 2021 https://scholar.google.com/scholar?q=Linear+Transformers+Are+Secretly+Fast+Weight+Programmers 8. Fast weight programming and linear transformers: from machine learning to neurobiology — Kazuki Irie, Samuel J. Gershman, 2026 https://scholar.google.com/scholar?q=Fast+weight+programming+and+linear+transformers:+from+machine+learning+to+neurobiology 9. TRELLIS: Learning to Compress Key-Value Memory in Attention Models — Mahdi Karami, Ali Behrouz, Praneeth Kacham, Vahab Mirrokni, 2025 https://scholar.google.com/scholar?q=TRELLIS:+Learning+to+Compress+Key-Value+Memory+in+Attention+Models 10. Gated Delta Networks: Improving Mamba2 with Delta Rule — Songlin Yang, Jan Kautz, Ali Hatamizadeh, 2024 https://scholar.google.com/scholar?q=Gated+Delta+Networks:+Improving+Mamba2+with+Delta+Rule 11. Titans: Learning to Memorize at Test Time — Ali Behrouz, Peilin Zhong, Vahab Mirrokni, 2025 https://scholar.google.com/scholar?q=Titans:+Learning+to+Memorize+at+Test+Time 12. Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach — Jonas Geiping, Sean McLeish, Neel Jain, et al., 2025 https://scholar.google.com/scholar?q=Scaling+up+Test-Time+Compute+with+Latent+Reasoning:+A+Recurrent+Depth+Approach 13. In-context Autoencoder for Context Compression in a Large Language Model — Tao Ge, Jing Hu, Lei Wang, Xun Wang, Si-Qing Chen, Furu Wei, 2023 https://scholar.google.com/scholar?q=In-context+Autoencoder+for+Context+Compression+in+a+Large+Language+Model 14. Cartridges: Lightweight and general-purpose long context representations via self-study — Sabri Eyuboglu, Ryan Ehrlich, Simran Arora, et al., 2025 https://scholar.google.com/scholar?q=Cartridges:+Lightweight+and+general-purpose+long+context+representations+via+self-study 15. Repeat After Me: Transformers are Better than State Space Models at Copying — Samy Jelassi, David Brandfonbrener, Sham M. Kakade, Eran Malach, 2024 https://scholar.google.com/scholar?q=Repeat+After+Me:+Transformers+are+Better+than+State+Space+Models+at+Copying 16. End-to-End Test-Time Training for Long Context — Arnuv Tandon et al., 2025 https://scholar.google.com/scholar?q=End-to-End+Test-Time+Training+for+Long+Context 17. Let's (not) just put things in Context: Test-Time Training for Long-Context LLMs — Rachit Bansal et al., 2025 https://scholar.google.com/scholar?q=Let's+(not)+just+put+things+in+Context:+Test-Time+Training+for+Long-Context+LLMs 18. Test-Time Training Done Right — Tianyuan Zhang et al., 2025 https://scholar.google.com/scholar?q=Test-Time+Training+Done+Right 19. Not All Heads Matter: A Head-Level KV Cache Compression Method with Integrated Retrieval and Reasoning — Yu Fu et al., 2024 https://scholar.google.com/scholar?q=Not+All+Heads+Matter:+A+Head-Level+KV+Cache+Compression+Method+with+Integrated+Retrieval+and+Reasoning 20. Beyond RAG: Task-Aware KV Cache Compression for Comprehensive Knowledge Reasoning — Giulio Corallo et al., 2025 https://scholar.google.com/scholar?q=Beyond+RAG:+Task-Aware+KV+Cache+Compression+for+Comprehensive+Knowledge+Reasoning 21. SideQuest: Model-Driven KV Cache Management for Long-Horizon Agentic Reasoning — Sanjay Kariyappa and G. Edward Suh, 2026 https://scholar.google.com/scholar?q=SideQuest:+Model-Driven+KV+Cache+Management+for+Long-Horizon+Agentic+Reasoning 22. Loop, Think, & Generalize: Implicit Reasoning in Recurrent-Depth Transformers — Harsh Kohli et al., 2026 https://scholar.google.com/scholar?q=Loop,+Think,+&+Generalize:+Implicit+Reasoning+in+Recurrent-Depth+Transformers 23. AI Post Transformers: Titans: Learning to Memorize at Test Time — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-20-titans-learning-to-memorize-at-test-time-054662.mp3 24. AI Post Transformers: In-Place Test-Time Training for Transformers — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-09-in-place-test-time-training-for-transfor-d0b976.mp3 25. AI Post Transformers: Recursive Language Models for Arbitrarily Long Prompts — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-04-recursive-language-models-for-arbitraril-fbcd1c.mp3 26. AI Post Transformers: Explicit Information Transmission for Context Compression — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-05-explicit-information-transmission-for-co-24e3c2.mp3 27. AI Post Transformers: KVzip for Query-Agnostic KV Cache Compression — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-29-kvzip-for-query-agnostic-kv-cache-compre-72afe5.mp3 28. AI Post Transformers: Gated Linear Attention for Efficient Long Sequences — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-18-gated-linear-attention-for-efficient-lon-c858ab.mp3 29. AI Post Transformers: MiA-Signature and Global Activation for Long Context — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-13-mia-signature-and-global-activation-for-5ad62f.mp3

SmolLM2 and the Power of Better Data

This episode explores SmolLM2, a 1.7 billion parameter language model from Hugging Face that tries to compete with stronger small models not by changing the transformer architecture, but by radically improving the training data mix and sequencing across roughly 11 trillion tokens. It explains the distinction between pretraining and instruction tuning, then argues that for compact models, dataset quality and curriculum can function almost like part of the architecture itself. The discussion connects SmolLM2 to earlier work such as Chinchilla, TinyStories, Textbooks Are All You Need, FineWeb-Edu, and DataComp-LM to show why educational web text, curated math and code data, and staged rebalancing matter so much when model capacity is tight. Listeners would find it interesting because it frames a practical question with real deployment stakes: whether careful data design can make smaller, cheaper, lower-latency models genuinely useful without relying on giant-scale compute. Sources: 1. SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model — Loubna Ben Allal, Anton Lozhkov, Elie Bakouch, Gabriel Martín Blázquez, Guilherme Penedo, Lewis Tunstall, Andrés Marafioti, Hynek Kydlíček, Agustín Piqueres Lajarín, Vaibhav Srivastav, Joshua Lochner, Caleb Fahlgren, Xuan-Son Nguyen, Clémentine Fourrier, Ben Burtenshaw, Hugo Larcher, Haojun Zhao, Cyril Zakka, Mathieu Morlon, Colin Raffel, Leandro von Werra, Thomas Wolf, 2025 http://arxiv.org/abs/2502.02737 2. Training Compute-Optimal Large Language Models — Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, Tom Hennigan, Eric Nalisnick, Daniel Yamins, Timothy Lillicrap, Oriol Vinyals, Jeff Dean, et al., 2022 https://scholar.google.com/scholar?q=Training+Compute-Optimal+Large+Language+Models 3. TinyStories: How Small Can Language Models Be and Still Speak Coherent English? — Ronen Eldan, Yuanzhi Li, 2023 https://scholar.google.com/scholar?q=TinyStories:+How+Small+Can+Language+Models+Be+and+Still+Speak+Coherent+English? 4. Textbooks Are All You Need — Suriya Gunasekar, Yi Zhang, Jyoti Aneja, Caio C. T. Mendes, Allie Del Giorno, Sivakanth Gopi, Mojan Javaheripi, Piero Kauffmann, Gustavo de Rosa, Olli Saarikivi, Adil Salim, Shital Shah, Harkirat Singh Behl, Xin Wang, Sebastien Bubeck, Ronen Eldan, Adam Tauman Kalai, Yin Tat Lee, Yuanzhi Li, 2023 https://scholar.google.com/scholar?q=Textbooks+Are+All+You+Need 5. MobileLLM: Optimizing Sub-billion Parameter Language Models for On-Device Use Cases — Zechun Liu, Changsheng Zhao, Forrest Iandola, Chen Lai, Yuandong Tian, Igor Fedorov, Yunyang Xiong, Ernie Chang, Yangyang Shi, Raghuraman Krishnamoorthi, Liangzhen Lai, Vikas Chandra, 2024 https://scholar.google.com/scholar?q=MobileLLM:+Optimizing+Sub-billion+Parameter+Language+Models+for+On-Device+Use+Cases 6. Dolma: an Open Corpus of Three Trillion Tokens for Language Model Pretraining Research — Luca Soldaini, Rodney Kinney, Dustin Schwenk, Siddharth Goyal, Alessandro Sordoni, Kyle Lo, Noah A. Smith, and collaborators, 2024 https://scholar.google.com/scholar?q=Dolma:+an+Open+Corpus+of+Three+Trillion+Tokens+for+Language+Model+Pretraining+Research 7. The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale — Guilherme Penedo, Hynek Kydlíček, Loubna Ben Allal, Anton Lozhkov, Margaret Mitchell, Colin Raffel, Leandro von Werra, Thomas Wolf, 2024 https://scholar.google.com/scholar?q=The+FineWeb+Datasets:+Decanting+the+Web+for+the+Finest+Text+Data+at+Scale 8. Data-Centric AI in the Age of Large Language Models — Xinyi Xu, Zhaoxuan Wu, Rui Qiao, Arun Verma, Yao Shu, Jingtan Wang, Xinyuan Niu, Zhenfeng He, Jiangwei Chen, Zijian Zhou, Gregory Kang Ruey Lau, Hieu Dao, Lucas Agussurja, Rachael Hwee Ling Sim, Xiaoqiang Lin, Wenyang Hu, Zhongxiang Dai, Pang Wei Koh, Bryan Kian Hsiang Low, 2024 https://scholar.google.com/scholar?q=Data-Centric+AI+in+the+Age+of+Large+Language+Models 9. The Stack: 3 TB of permissively licensed source code — Denis Kocetkov, Raymond Li, Loubna Ben Allal, Jia Li, Chenghao Mou, Carlos Muñoz Ferrandis, Yacine Jernite, Margaret Mitchell, Sean Hughes, Thomas Wolf, Dzmitry Bahdanau, Leandro von Werra, Harm de Vries, 2022 https://scholar.google.com/scholar?q=The+Stack:+3+TB+of+permissively+licensed+source+code 10. Enhancing Chat Language Models by Scaling High-quality Instructional Conversations — Ning Ding, Yulin Chen, Bokai Xu, et al., 2023 https://scholar.google.com/scholar?q=Enhancing+Chat+Language+Models+by+Scaling+High-quality+Instructional+Conversations 11. OpenMathInstruct-2: Accelerating AI for Math with Massive Open-Source Instruction Data — Shubham Toshniwal, Wei Du, Ivan Moshkov, Branislav Kisacanin, Alexan Ayrapetyan, Igor Gitman, 2024 https://scholar.google.com/scholar?q=OpenMathInstruct-2:+Accelerating+AI+for+Math+with+Massive+Open-Source+Instruction+Data 12. SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model — Loubna Ben Allal, Anton Lozhkov, Elie Bakouch, Gabriel Martín Blázquez, Guilherme Penedo, Lewis Tunstall, Andrés Marafioti, Hynek Kydlíček, Agustín Piqueres Lajarín, Vaibhav Srivastav, Joshua Lochner, Caleb Fahlgren, Xuan-Son Nguyen, Clémentine Fourrier, Ben Burtenshaw, Hugo Larcher, Haojun Zhao, Cyril Zakka, Mathieu Morlon, Colin Raffel, Leandro von Werra, Thomas Wolf, 2025 https://scholar.google.com/scholar?q=SmolLM2:+When+Smol+Goes+Big+--+Data-Centric+Training+of+a+Small+Language+Model 13. DataComp-LM: In search of the next generation of training sets for language models — Jeffrey Li, Alex Fang, Georgios Smyrnis, Maor Ivgi, Matt Jordan, Samir Gadre, Hritik Bansal, Etash Guha, Sedrick Keh, Kushal Arora, Saurabh Garg, Rui Xin, Niklas Muennighoff, and many others, 2024 https://scholar.google.com/scholar?q=DataComp-LM:+In+search+of+the+next+generation+of+training+sets+for+language+models 14. OpenWebMath: An Open Dataset of High-Quality Mathematical Web Text — Keiran Paster, Marco Dos Santos, Zhangir Azerbayev, Jimmy Ba, 2023 https://scholar.google.com/scholar?q=OpenWebMath:+An+Open+Dataset+of+High-Quality+Mathematical+Web+Text 15. InfiMM-WebMath-40B: Advancing Multimodal Pre-Training for Enhanced Mathematical Reasoning — Xiaotian Han, Yiren Jian, Xuefeng Hu, Haogeng Liu, Yiqi Wang, Qihang Fan, Yuang Ai, Huaibo Huang, Ran He, Zhenheng Yang, Quanzeng You, 2024 https://scholar.google.com/scholar?q=InfiMM-WebMath-40B:+Advancing+Multimodal+Pre-Training+for+Enhanced+Mathematical+Reasoning 16. DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models — Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y.K. Li, Y. Wu, Daya Guo, 2024 https://scholar.google.com/scholar?q=DeepSeekMath:+Pushing+the+Limits+of+Mathematical+Reasoning+in+Open+Language+Models 17. 2 OLMo 2 Furious — Kyle Lo and the OLMo team, 2025 https://scholar.google.com/scholar?q=2+OLMo+2+Furious 18. Revisiting Scaling Laws for Language Models: The Role of Data Quality and Training Strategies — Zhengyu Chen, Siqi Wang, Teng Xiao, Yudong Wang, Shiqi Chen, Xunliang Cai, Junxian He, Jingang Wang, 2025 https://scholar.google.com/scholar?q=Revisiting+Scaling+Laws+for+Language+Models:+The+Role+of+Data+Quality+and+Training+Strategies 19. GRAPE: Optimize Data Mixture for Group Robust Multi-target Adaptive Pretraining — Simin Fan, Maria Ios Glarou, Martin Jaggi, 2025 https://scholar.google.com/scholar?q=GRAPE:+Optimize+Data+Mixture+for+Group+Robust+Multi-target+Adaptive+Pretraining 20. Optimizing Pre-Training Data Mixtures with Mixtures of Data Expert Models — Lior Belenki, Alekh Agarwal, Tianze Shi, Kristina Toutanova, 2025 https://scholar.google.com/scholar?q=Optimizing+Pre-Training+Data+Mixtures+with+Mixtures+of+Data+Expert+Models 21. Scaling Laws with Vocabulary: Larger Models Deserve Larger Vocabularies — Chaofan Tao, Qian Liu, Longxu Dou, Niklas Muennighoff, Zhongwei Wan, Ping Luo, Min Lin, Ngai Wong, 2024 https://scholar.google.com/scholar?q=Scaling+Laws+with+Vocabulary:+Larger+Models+Deserve+Larger+Vocabularies 22. Distilling Reasoning Capabilities into Smaller Language Models — Kumar Shridhar, Alessandro Stolfo, Mrinmaya Sachan, 2023 https://scholar.google.com/scholar?q=Distilling+Reasoning+Capabilities+into+Smaller+Language+Models 23. Teaching Small Language Models Reasoning through Counterfactual Distillation — Tao Feng, Yicheng Li, Chenglin Li, Hao Chen, Fei Yu, Yin Zhang, 2024 https://scholar.google.com/scholar?q=Teaching+Small+Language+Models+Reasoning+through+Counterfactual+Distillation 24. AI Post Transformers: Self-Improving Pretraining With Post-Trained Models — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-02-self-improving-pretraining-with-post-tra-e37460.mp3 25. AI Post Transformers: Scaling Laws for Multilingual Code Pretraining — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-15-scaling-laws-for-multilingual-code-pretr-7d220e.mp3 26. AI Post Transformers: Can Models Learn from Long Context? — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-05-can-models-learn-from-long-context-77533e.mp3 27. AI Post Transformers: ASI-Evolve for Data, Architectures, and RL — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-05-asi-evolve-for-data-architectures-and-rl-197b2b.mp3 28. AI Post Transformers: Muon Is Scalable for LLM Training — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-25-muon-is-scalable-for-llm-training-587ed8.mp3

I går1 h 0 min

Do Language Models Need Sleep?

I går1 h 0 min

Do Language Models Need Sleep?

Beskrivelse

Kommentarer

1 måned kun 9 kr.

Alle episoder