Post-Trained MoE Skips Half Its Experts

Description

This episode explores a post-training method for making mixture-of-experts language models cheaper at inference time without retraining them from scratch. It explains how the paper converts a fully trained static MoE into a dynamic one by adding parameter-free zero experts, allowing some tokens to skip normal experts, and then uses self-distillation to preserve the original model’s behavior under this lower-compute routing scheme. The discussion highlights why this deployment-focused approach matters for real production systems, especially when pretraining, fine-tuning, and alignment are already complete and inference cost is the main bottleneck. Listeners would find it interesting for its clear breakdown of dynamic versus static MoE compute, its practical framing around latency and serving costs, and its focus on whether large post-trained models can cut expert FLOPs substantially without losing capability. Sources: 1. Post-Trained MoE Can Skip Half Experts via Self-Distillation — Xingtai Lv, Li Sheng, Kaiyan Zhang, Yichen You, Siyan Gao, Xueheng Luo, Yuxin Zuo, Yuchen Fan, Junlin Yang, Ganqu Cui, Bingning Wang, Fan Yang, Youbang Sun, Ning Ding, Bowen Zhou, 2026 http://arxiv.org/abs/2605.18643 2. MoE++: Accelerating Mixture-of-Experts Methods with Zero-Computation Experts — Peng Jin, Bo Zhu, Li Yuan, Shuicheng Yan, 2024 https://scholar.google.com/scholar?q=MoE++:+Accelerating+Mixture-of-Experts+Methods+with+Zero-Computation+Experts 3. Not All Experts are Equal: Efficient Expert Pruning and Skipping for Mixture-of-Experts Large Language Models — Xudong Lu, Qi Liu, Yuhui Xu, Aojun Zhou, Siyuan Huang, Bo Zhang, Junchi Yan, Hongsheng Li, 2024 https://scholar.google.com/scholar?q=Not+All+Experts+are+Equal:+Efficient+Expert+Pruning+and+Skipping+for+Mixture-of-Experts+Large+Language+Models 4. Task-Specific Expert Pruning for Sparse Mixture-of-Experts — Tianyu Chen, Shaohan Huang, Yuan Xie, Binxing Jiao, Daxin Jiang, Haoyi Zhou, Jianxin Li, Furu Wei, 2022 https://scholar.google.com/scholar?q=Task-Specific+Expert+Pruning+for+Sparse+Mixture-of-Experts 5. Auxiliary-Loss-Free Load Balancing Strategy for Mixture-of-Experts — DeepSeek-AI et al., 2024 https://scholar.google.com/scholar?q=Auxiliary-Loss-Free+Load+Balancing+Strategy+for+Mixture-of-Experts 6. ST-MoE: Designing Stable and Transferable Sparse Expert Models — Barret Zoph, Noam Shazeer, William Fedus, et al., 2022 https://scholar.google.com/scholar?q=ST-MoE:+Designing+Stable+and+Transferable+Sparse+Expert+Models 7. AdaMoE: Token-Adaptive Routing with Null Experts for Mixture-of-Experts Language Models — Zihao Zeng, Yibo Miao, Hongcheng Gao, Hao Zhang, Zhijie Deng, 2024 https://scholar.google.com/scholar?q=AdaMoE:+Token-Adaptive+Routing+with+Null+Experts+for+Mixture-of-Experts+Language+Models 8. Harder Task Needs More Experts: Dynamic Routing in MoE Models — Quzhe Huang, Zhenwei An, Nan Zhuang, Mingxu Tao, Chen Zhang, Yang Jin, Kun Xu, Liwei Chen, Songfang Huang, Yansong Feng, 2024 https://scholar.google.com/scholar?q=Harder+Task+Needs+More+Experts:+Dynamic+Routing+in+MoE+Models 9. MoE Pathfinder: Trajectory-driven Expert Pruning — Xican Yang, Yuanhe Tian, Yan Song, 2025 https://scholar.google.com/scholar?q=MoE+Pathfinder:+Trajectory-driven+Expert+Pruning 10. Discovering Important Experts for Mixture-of-Experts Models Pruning Through a Theoretical Perspective — approximate only; title verified, authors not confidently recovered, 2025/2026 https://scholar.google.com/scholar?q=Discovering+Important+Experts+for+Mixture-of-Experts+Models+Pruning+Through+a+Theoretical+Perspective 11. MoEEdit: Efficient and Routing-Stable Knowledge Editing for Mixture-of-Experts LLMs — Yupu Gu, Rongzhe Wei, Andy Zhu, Pan Li, 2026 https://scholar.google.com/scholar?q=MoEEdit:+Efficient+and+Routing-Stable+Knowledge+Editing+for+Mixture-of-Experts+LLMs 12. ReLibra: Routing-Replay-Guided Load Balancing for MoE Training in Reinforcement Learning — Chao Jin, Xinming Wei, Yinmin Zhong, Chengxu Yang, Bingyang Wu, Ruidong Zhu, Zili Zhang, Yuliang Liu, Xin Jin, 2026 https://scholar.google.com/scholar?q=ReLibra:+Routing-Replay-Guided+Load+Balancing+for+MoE+Training+in+Reinforcement+Learning 13. Sparse MoE Students for Efficient Knowledge Distillation — approximate only; exact author list not confidently recovered, 2025 https://scholar.google.com/scholar?q=Sparse+MoE+Students+for+Efficient+Knowledge+Distillation 14. AI Post Transformers: Batch-Aware Expert Routing for Faster MoE Decoding — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-04-batch-aware-expert-routing-for-faster-mo-683ab6.mp3 15. AI Post Transformers: Serving MoE Models with Disaggregated Expert Parallelism — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-19-serving-moe-models-with-disaggregated-ex-6979d2.mp3 16. AI Post Transformers: Ministral 3: Cascade Distillation for Long-Context Multimodal Models — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-15-cascade-distillation-for-long-context-mu-0ebd1a.mp3 17. AI Post Transformers: Nemotron 3 Super Hybrid Mamba-Transformer MoE — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-19-nemotron-3-super-hybrid-mamba-transforme-31ac75.mp3 18. AI Post Transformers: LPU Chip for Low-Latency LLM Inference — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-20-lpu-chip-for-low-latency-llm-inference-be13c3.mp3

SmolLM2 and the Power of Better Data

This episode explores SmolLM2, a 1.7 billion parameter language model from Hugging Face that tries to compete with stronger small models not by changing the transformer architecture, but by radically improving the training data mix and sequencing across roughly 11 trillion tokens. It explains the distinction between pretraining and instruction tuning, then argues that for compact models, dataset quality and curriculum can function almost like part of the architecture itself. The discussion connects SmolLM2 to earlier work such as Chinchilla, TinyStories, Textbooks Are All You Need, FineWeb-Edu, and DataComp-LM to show why educational web text, curated math and code data, and staged rebalancing matter so much when model capacity is tight. Listeners would find it interesting because it frames a practical question with real deployment stakes: whether careful data design can make smaller, cheaper, lower-latency models genuinely useful without relying on giant-scale compute. Sources: 1. SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model — Loubna Ben Allal, Anton Lozhkov, Elie Bakouch, Gabriel Martín Blázquez, Guilherme Penedo, Lewis Tunstall, Andrés Marafioti, Hynek Kydlíček, Agustín Piqueres Lajarín, Vaibhav Srivastav, Joshua Lochner, Caleb Fahlgren, Xuan-Son Nguyen, Clémentine Fourrier, Ben Burtenshaw, Hugo Larcher, Haojun Zhao, Cyril Zakka, Mathieu Morlon, Colin Raffel, Leandro von Werra, Thomas Wolf, 2025 http://arxiv.org/abs/2502.02737 2. Training Compute-Optimal Large Language Models — Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, Tom Hennigan, Eric Nalisnick, Daniel Yamins, Timothy Lillicrap, Oriol Vinyals, Jeff Dean, et al., 2022 https://scholar.google.com/scholar?q=Training+Compute-Optimal+Large+Language+Models 3. TinyStories: How Small Can Language Models Be and Still Speak Coherent English? — Ronen Eldan, Yuanzhi Li, 2023 https://scholar.google.com/scholar?q=TinyStories:+How+Small+Can+Language+Models+Be+and+Still+Speak+Coherent+English? 4. Textbooks Are All You Need — Suriya Gunasekar, Yi Zhang, Jyoti Aneja, Caio C. T. Mendes, Allie Del Giorno, Sivakanth Gopi, Mojan Javaheripi, Piero Kauffmann, Gustavo de Rosa, Olli Saarikivi, Adil Salim, Shital Shah, Harkirat Singh Behl, Xin Wang, Sebastien Bubeck, Ronen Eldan, Adam Tauman Kalai, Yin Tat Lee, Yuanzhi Li, 2023 https://scholar.google.com/scholar?q=Textbooks+Are+All+You+Need 5. MobileLLM: Optimizing Sub-billion Parameter Language Models for On-Device Use Cases — Zechun Liu, Changsheng Zhao, Forrest Iandola, Chen Lai, Yuandong Tian, Igor Fedorov, Yunyang Xiong, Ernie Chang, Yangyang Shi, Raghuraman Krishnamoorthi, Liangzhen Lai, Vikas Chandra, 2024 https://scholar.google.com/scholar?q=MobileLLM:+Optimizing+Sub-billion+Parameter+Language+Models+for+On-Device+Use+Cases 6. Dolma: an Open Corpus of Three Trillion Tokens for Language Model Pretraining Research — Luca Soldaini, Rodney Kinney, Dustin Schwenk, Siddharth Goyal, Alessandro Sordoni, Kyle Lo, Noah A. Smith, and collaborators, 2024 https://scholar.google.com/scholar?q=Dolma:+an+Open+Corpus+of+Three+Trillion+Tokens+for+Language+Model+Pretraining+Research 7. The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale — Guilherme Penedo, Hynek Kydlíček, Loubna Ben Allal, Anton Lozhkov, Margaret Mitchell, Colin Raffel, Leandro von Werra, Thomas Wolf, 2024 https://scholar.google.com/scholar?q=The+FineWeb+Datasets:+Decanting+the+Web+for+the+Finest+Text+Data+at+Scale 8. Data-Centric AI in the Age of Large Language Models — Xinyi Xu, Zhaoxuan Wu, Rui Qiao, Arun Verma, Yao Shu, Jingtan Wang, Xinyuan Niu, Zhenfeng He, Jiangwei Chen, Zijian Zhou, Gregory Kang Ruey Lau, Hieu Dao, Lucas Agussurja, Rachael Hwee Ling Sim, Xiaoqiang Lin, Wenyang Hu, Zhongxiang Dai, Pang Wei Koh, Bryan Kian Hsiang Low, 2024 https://scholar.google.com/scholar?q=Data-Centric+AI+in+the+Age+of+Large+Language+Models 9. The Stack: 3 TB of permissively licensed source code — Denis Kocetkov, Raymond Li, Loubna Ben Allal, Jia Li, Chenghao Mou, Carlos Muñoz Ferrandis, Yacine Jernite, Margaret Mitchell, Sean Hughes, Thomas Wolf, Dzmitry Bahdanau, Leandro von Werra, Harm de Vries, 2022 https://scholar.google.com/scholar?q=The+Stack:+3+TB+of+permissively+licensed+source+code 10. Enhancing Chat Language Models by Scaling High-quality Instructional Conversations — Ning Ding, Yulin Chen, Bokai Xu, et al., 2023 https://scholar.google.com/scholar?q=Enhancing+Chat+Language+Models+by+Scaling+High-quality+Instructional+Conversations 11. OpenMathInstruct-2: Accelerating AI for Math with Massive Open-Source Instruction Data — Shubham Toshniwal, Wei Du, Ivan Moshkov, Branislav Kisacanin, Alexan Ayrapetyan, Igor Gitman, 2024 https://scholar.google.com/scholar?q=OpenMathInstruct-2:+Accelerating+AI+for+Math+with+Massive+Open-Source+Instruction+Data 12. SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model — Loubna Ben Allal, Anton Lozhkov, Elie Bakouch, Gabriel Martín Blázquez, Guilherme Penedo, Lewis Tunstall, Andrés Marafioti, Hynek Kydlíček, Agustín Piqueres Lajarín, Vaibhav Srivastav, Joshua Lochner, Caleb Fahlgren, Xuan-Son Nguyen, Clémentine Fourrier, Ben Burtenshaw, Hugo Larcher, Haojun Zhao, Cyril Zakka, Mathieu Morlon, Colin Raffel, Leandro von Werra, Thomas Wolf, 2025 https://scholar.google.com/scholar?q=SmolLM2:+When+Smol+Goes+Big+--+Data-Centric+Training+of+a+Small+Language+Model 13. DataComp-LM: In search of the next generation of training sets for language models — Jeffrey Li, Alex Fang, Georgios Smyrnis, Maor Ivgi, Matt Jordan, Samir Gadre, Hritik Bansal, Etash Guha, Sedrick Keh, Kushal Arora, Saurabh Garg, Rui Xin, Niklas Muennighoff, and many others, 2024 https://scholar.google.com/scholar?q=DataComp-LM:+In+search+of+the+next+generation+of+training+sets+for+language+models 14. OpenWebMath: An Open Dataset of High-Quality Mathematical Web Text — Keiran Paster, Marco Dos Santos, Zhangir Azerbayev, Jimmy Ba, 2023 https://scholar.google.com/scholar?q=OpenWebMath:+An+Open+Dataset+of+High-Quality+Mathematical+Web+Text 15. InfiMM-WebMath-40B: Advancing Multimodal Pre-Training for Enhanced Mathematical Reasoning — Xiaotian Han, Yiren Jian, Xuefeng Hu, Haogeng Liu, Yiqi Wang, Qihang Fan, Yuang Ai, Huaibo Huang, Ran He, Zhenheng Yang, Quanzeng You, 2024 https://scholar.google.com/scholar?q=InfiMM-WebMath-40B:+Advancing+Multimodal+Pre-Training+for+Enhanced+Mathematical+Reasoning 16. DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models — Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y.K. Li, Y. Wu, Daya Guo, 2024 https://scholar.google.com/scholar?q=DeepSeekMath:+Pushing+the+Limits+of+Mathematical+Reasoning+in+Open+Language+Models 17. 2 OLMo 2 Furious — Kyle Lo and the OLMo team, 2025 https://scholar.google.com/scholar?q=2+OLMo+2+Furious 18. Revisiting Scaling Laws for Language Models: The Role of Data Quality and Training Strategies — Zhengyu Chen, Siqi Wang, Teng Xiao, Yudong Wang, Shiqi Chen, Xunliang Cai, Junxian He, Jingang Wang, 2025 https://scholar.google.com/scholar?q=Revisiting+Scaling+Laws+for+Language+Models:+The+Role+of+Data+Quality+and+Training+Strategies 19. GRAPE: Optimize Data Mixture for Group Robust Multi-target Adaptive Pretraining — Simin Fan, Maria Ios Glarou, Martin Jaggi, 2025 https://scholar.google.com/scholar?q=GRAPE:+Optimize+Data+Mixture+for+Group+Robust+Multi-target+Adaptive+Pretraining 20. Optimizing Pre-Training Data Mixtures with Mixtures of Data Expert Models — Lior Belenki, Alekh Agarwal, Tianze Shi, Kristina Toutanova, 2025 https://scholar.google.com/scholar?q=Optimizing+Pre-Training+Data+Mixtures+with+Mixtures+of+Data+Expert+Models 21. Scaling Laws with Vocabulary: Larger Models Deserve Larger Vocabularies — Chaofan Tao, Qian Liu, Longxu Dou, Niklas Muennighoff, Zhongwei Wan, Ping Luo, Min Lin, Ngai Wong, 2024 https://scholar.google.com/scholar?q=Scaling+Laws+with+Vocabulary:+Larger+Models+Deserve+Larger+Vocabularies 22. Distilling Reasoning Capabilities into Smaller Language Models — Kumar Shridhar, Alessandro Stolfo, Mrinmaya Sachan, 2023 https://scholar.google.com/scholar?q=Distilling+Reasoning+Capabilities+into+Smaller+Language+Models 23. Teaching Small Language Models Reasoning through Counterfactual Distillation — Tao Feng, Yicheng Li, Chenglin Li, Hao Chen, Fei Yu, Yin Zhang, 2024 https://scholar.google.com/scholar?q=Teaching+Small+Language+Models+Reasoning+through+Counterfactual+Distillation 24. AI Post Transformers: Self-Improving Pretraining With Post-Trained Models — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-02-self-improving-pretraining-with-post-tra-e37460.mp3 25. AI Post Transformers: Scaling Laws for Multilingual Code Pretraining — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-15-scaling-laws-for-multilingual-code-pretr-7d220e.mp3 26. AI Post Transformers: Can Models Learn from Long Context? — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-05-can-models-learn-from-long-context-77533e.mp3 27. AI Post Transformers: ASI-Evolve for Data, Architectures, and RL — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-05-asi-evolve-for-data-architectures-and-rl-197b2b.mp3 28. AI Post Transformers: Muon Is Scalable for LLM Training — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-25-muon-is-scalable-for-llm-training-587ed8.mp3

Yesterday1 h 0 min

Do Language Models Need Sleep?

This episode explores a paper proposing that language models could handle long-context reasoning by periodically pausing, replaying soon-to-be-evicted context offline, and consolidating it into fixed-size fast-weight memory instead of carrying an ever-growing KV cache. It explains the core machinery behind the idea, including state space models and Gated Delta Networks, and clarifies why this is more than prompt summarization or retrieval: the model is rewriting its internal bounded memory during inference. The discussion highlights the paper’s central argument that extra compute may be better spent during these offline “sleep” passes, so later token prediction stays cheap while older information is metabolized into usable latent state. Listeners would find it interesting because it frames long-context scaling as a memory-systems problem, raises concrete questions about whether this consolidation actually improves reasoning, and connects the proposal to broader debates about how future LLMs should trade off memory, compute, and exact recall. Sources: 1. Do Language Models Need Sleep? Offline Recurrence for Improved Online Inference — Sangyun Lee, Sean McLeish, Tom Goldstein, Giulia Fanti, 2026 http://arxiv.org/abs/2605.26099 2. Replay in Deep Learning: Current Approaches and Missing Biological Elements — Tyler L. Hayes, Giri P. Krishnan, Maxim Bazhenov, Hava T. Siegelmann, Terrence J. Sejnowski, Christopher Kanan, 2021 https://scholar.google.com/scholar?q=Replay+in+Deep+Learning:+Current+Approaches+and+Missing+Biological+Elements 3. Can sleep protect memories from catastrophic forgetting? — Oscar C. Gonzalez, Yury Sokolov, Giri P. Krishnan, Jean Erik Delanois, Maxim Bazhenov, 2020 https://scholar.google.com/scholar?q=Can+sleep+protect+memories+from+catastrophic+forgetting? 4. Sleep-like unsupervised replay reduces catastrophic forgetting in artificial neural networks — Timothy Tadros, Giri P. Krishnan, Ramyaa Ramyaa, Maxim Bazhenov, 2022 https://scholar.google.com/scholar?q=Sleep-like+unsupervised+replay+reduces+catastrophic+forgetting+in+artificial+neural+networks 5. Do Language Models Need Sleep? Offline Recurrence for Improved Online Inference — Sangyun Lee, Sean McLeish, Tom Goldstein, Giulia Fanti, 2026 https://scholar.google.com/scholar?q=Do+Language+Models+Need+Sleep?+Offline+Recurrence+for+Improved+Online+Inference 6. Using Fast Weights to Attend to the Recent Past — Jimmy Ba, Geoffrey Hinton, Volodymyr Mnih, Joel Z. Leibo, Catalin Ionescu, 2016 https://scholar.google.com/scholar?q=Using+Fast+Weights+to+Attend+to+the+Recent+Past 7. Linear Transformers Are Secretly Fast Weight Programmers — Imanol Schlag, Kazuki Irie, Jürgen Schmidhuber, 2021 https://scholar.google.com/scholar?q=Linear+Transformers+Are+Secretly+Fast+Weight+Programmers 8. Fast weight programming and linear transformers: from machine learning to neurobiology — Kazuki Irie, Samuel J. Gershman, 2026 https://scholar.google.com/scholar?q=Fast+weight+programming+and+linear+transformers:+from+machine+learning+to+neurobiology 9. TRELLIS: Learning to Compress Key-Value Memory in Attention Models — Mahdi Karami, Ali Behrouz, Praneeth Kacham, Vahab Mirrokni, 2025 https://scholar.google.com/scholar?q=TRELLIS:+Learning+to+Compress+Key-Value+Memory+in+Attention+Models 10. Gated Delta Networks: Improving Mamba2 with Delta Rule — Songlin Yang, Jan Kautz, Ali Hatamizadeh, 2024 https://scholar.google.com/scholar?q=Gated+Delta+Networks:+Improving+Mamba2+with+Delta+Rule 11. Titans: Learning to Memorize at Test Time — Ali Behrouz, Peilin Zhong, Vahab Mirrokni, 2025 https://scholar.google.com/scholar?q=Titans:+Learning+to+Memorize+at+Test+Time 12. Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach — Jonas Geiping, Sean McLeish, Neel Jain, et al., 2025 https://scholar.google.com/scholar?q=Scaling+up+Test-Time+Compute+with+Latent+Reasoning:+A+Recurrent+Depth+Approach 13. In-context Autoencoder for Context Compression in a Large Language Model — Tao Ge, Jing Hu, Lei Wang, Xun Wang, Si-Qing Chen, Furu Wei, 2023 https://scholar.google.com/scholar?q=In-context+Autoencoder+for+Context+Compression+in+a+Large+Language+Model 14. Cartridges: Lightweight and general-purpose long context representations via self-study — Sabri Eyuboglu, Ryan Ehrlich, Simran Arora, et al., 2025 https://scholar.google.com/scholar?q=Cartridges:+Lightweight+and+general-purpose+long+context+representations+via+self-study 15. Repeat After Me: Transformers are Better than State Space Models at Copying — Samy Jelassi, David Brandfonbrener, Sham M. Kakade, Eran Malach, 2024 https://scholar.google.com/scholar?q=Repeat+After+Me:+Transformers+are+Better+than+State+Space+Models+at+Copying 16. End-to-End Test-Time Training for Long Context — Arnuv Tandon et al., 2025 https://scholar.google.com/scholar?q=End-to-End+Test-Time+Training+for+Long+Context 17. Let's (not) just put things in Context: Test-Time Training for Long-Context LLMs — Rachit Bansal et al., 2025 https://scholar.google.com/scholar?q=Let's+(not)+just+put+things+in+Context:+Test-Time+Training+for+Long-Context+LLMs 18. Test-Time Training Done Right — Tianyuan Zhang et al., 2025 https://scholar.google.com/scholar?q=Test-Time+Training+Done+Right 19. Not All Heads Matter: A Head-Level KV Cache Compression Method with Integrated Retrieval and Reasoning — Yu Fu et al., 2024 https://scholar.google.com/scholar?q=Not+All+Heads+Matter:+A+Head-Level+KV+Cache+Compression+Method+with+Integrated+Retrieval+and+Reasoning 20. Beyond RAG: Task-Aware KV Cache Compression for Comprehensive Knowledge Reasoning — Giulio Corallo et al., 2025 https://scholar.google.com/scholar?q=Beyond+RAG:+Task-Aware+KV+Cache+Compression+for+Comprehensive+Knowledge+Reasoning 21. SideQuest: Model-Driven KV Cache Management for Long-Horizon Agentic Reasoning — Sanjay Kariyappa and G. Edward Suh, 2026 https://scholar.google.com/scholar?q=SideQuest:+Model-Driven+KV+Cache+Management+for+Long-Horizon+Agentic+Reasoning 22. Loop, Think, & Generalize: Implicit Reasoning in Recurrent-Depth Transformers — Harsh Kohli et al., 2026 https://scholar.google.com/scholar?q=Loop,+Think,+&+Generalize:+Implicit+Reasoning+in+Recurrent-Depth+Transformers 23. AI Post Transformers: Titans: Learning to Memorize at Test Time — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-20-titans-learning-to-memorize-at-test-time-054662.mp3 24. AI Post Transformers: In-Place Test-Time Training for Transformers — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-09-in-place-test-time-training-for-transfor-d0b976.mp3 25. AI Post Transformers: Recursive Language Models for Arbitrarily Long Prompts — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-04-recursive-language-models-for-arbitraril-fbcd1c.mp3 26. AI Post Transformers: Explicit Information Transmission for Context Compression — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-05-explicit-information-transmission-for-co-24e3c2.mp3 27. AI Post Transformers: KVzip for Query-Agnostic KV Cache Compression — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-29-kvzip-for-query-agnostic-kv-cache-compre-72afe5.mp3 28. AI Post Transformers: Gated Linear Attention for Efficient Long Sequences — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-18-gated-linear-attention-for-efficient-lon-c858ab.mp3 29. AI Post Transformers: MiA-Signature and Global Activation for Long Context — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-13-mia-signature-and-global-activation-for-5ad62f.mp3

Yesterday1 h 0 min

Post-Trained MoE Skips Half Its Experts

Description

Comments

1 month for 9 kr.

All episodes