AI Post Transformers

Snap's Microkernel Approach to Host Networking

1 h 0 min · Gestern
Episode Snap's Microkernel Approach to Host Networking Cover

Beschreibung

This episode explores Google’s Snap system, which moves major host-networking functions out of the kernel and into isolated userspace services while trying to keep the performance benefits usually associated with kernel bypass. It examines why that shift mattered operationally at fleet scale: kernel networking changes could take one to two months to deploy, while Snap enabled roughly weekly releases and had already been adopted across more than half of Google’s machines. The discussion breaks down Snap’s architecture, including centralized host services, microkernel-style isolation, lock-free engine communication, the MicroQuanta scheduler design, latency-sensitive congestion control, and Pony Express as a flagship transport for reliable, asynchronous messaging. Listeners would find it interesting because it frames host networking as a platform-design problem, not just a packet-speed problem, and argues that upgradeability, policy control, and performance can be engineered together rather than traded off. Sources: 1. Snap's Microkernel Approach to Host Networking https://storage.googleapis.com/gweb-research2023-media/pubtools/5281.pdf 2. L4 Microkernels: The Lessons from 20 Years of Research and Deployment — Gernot Heiser, Kevin Elphinstone, 2016 https://trustworthy.systems/publications/nicta_full_text/8988.pdf 3. Arrakis: The Operating System is the Control Plane — Simon Peter, Jialin Li, Irene Zhang, Dan R. K. Ports, Doug Woos, Arvind Krishnamurthy, Thomas Anderson, Timothy Roscoe, 2014 https://www.usenix.org/conference/osdi14/technical-sessions/presentation/peter 4. Snap: a Microkernel Approach to Host Networking — Michael Marty, Marc de Kruijf, Jacob Adriaens, Nandita Dukkipati, Amin Vahdat, et al., 2019 https://research.google/pubs/snap-a-microkernel-approach-to-host-networking/ 5. netmap: A Novel Framework for Fast Packet I/O — Luigi Rizzo, 2012 https://www.usenix.org/conference/atc12/technical-sessions/presentation/rizzo 6. mTCP: A Highly Scalable User-level TCP Stack for Multicore Systems — EunYoung Jeong, Shinae Woo, Muhammad Jamshed, Haewon Jeong, Sunghwan Ihm, Dongsu Han, KyoungSoo Park, 2014 https://www.usenix.org/system/files/conference/nsdi14/nsdi14-paper-jeong.pdf 7. IX: A Protected Dataplane Operating System for High Throughput and Low Latency — Adam Belay, George Prekas, Ana Klimovic, Samuel Grossman, Christos Kozyrakis, Edouard Bugnion, 2014 https://csl.stanford.edu/~christos/publications/2014.ix.osdi.pdf 8. VL2: A Scalable and Flexible Data Center Network — Albert Greenberg, James R. Hamilton, Navendu Jain, Srikanth Kandula, Changhoon Kim, Parantap Lahiri, Dave Maltz, Parveen Patel, Sudipta Sengupta, 2009 https://www.microsoft.com/en-us/research/publication/vl2-a-scalable-and-flexible-data-center-network/ 9. Andromeda: Performance, Isolation, and Velocity at Scale in Cloud Network Virtualization — Michael Dalton, David Schultz, Jacob Adriaens, Ahsan Arefin, Anshuman Gupta, Amin Vahdat, et al., 2018 https://www.usenix.org/conference/nsdi18/presentation/dalton 10. Carousel: Scalable Traffic Shaping at End-Hosts — Ahmed Saeed, Nandita Dukkipati, Valas Valancius, Terry Lam, Carlo Contavalli, Amin Vahdat, 2017 https://research.google/pubs/carousel-scalable-traffic-shaping-at-end-hosts/ 11. FaRM: Fast Remote Memory — Aleksandar Dragojevic, Dushyanth Narayanan, Orion Hodson, Miguel Castro, 2014 https://www.usenix.org/conference/nsdi14/technical-sessions/dragojevi%C4%87 12. Using RDMA Efficiently for Key-Value Services — Anuj Kalia, Michael Kaminsky, David G. Andersen, 2014 https://www.pdl.cmu.edu/PDL-FTP/Storage/herd-sigcomm2014.pdf 13. Datacenter RPCs can be General and Fast — Anuj Kalia, Michael Kaminsky, David Andersen, 2019 https://www.usenix.org/conference/nsdi19/presentation/kalia 14. Shenango: Achieving High CPU Efficiency for Latency-sensitive Datacenter Workloads — Amy Ousterhout, Joshua Fried, Jonathan Behrens, Adam Belay, Hari Balakrishnan, 2019 https://www.usenix.org/conference/nsdi19/presentation/ousterhout 15. Caladan: Mitigating Interference at Microsecond Timescales — Joshua Fried, Zhenyuan Ruan, Amy Ousterhout, Adam Belay, 2020 https://www.usenix.org/conference/osdi20/presentation/fried 16. TAS: TCP Acceleration as an OS Service — Antoine Kaufmann, Tim Stamler, Simon Peter, Naveen Kr. Sharma, Arvind Krishnamurthy, and Thomas Anderson, 2019 https://scholar.google.com/scholar?q=TAS:+TCP+Acceleration+as+an+OS+Service 17. FaSST: Fast, Scalable and Simple Distributed Transactions with Two-Sided (RDMA) Datagram RPCs — Anuj Kalia, Michael Kaminsky, and David G. Andersen, 2016 https://scholar.google.com/scholar?q=FaSST:+Fast,+Scalable+and+Simple+Distributed+Transactions+with+Two-Sided+(RDMA)+Datagram+RPCs 18. Implementing Network Protocols at User Level — C. A. Thekkath, T. D. Nguyen, E. Moy, and E. D. Lazowska, 1993 https://scholar.google.com/scholar?q=Implementing+Network+Protocols+at+User+Level 19. NetEdit: An Orchestration Platform for eBPF Network Functions at Scale — Theophilus A. Benson et al., 2024 https://doi.org/10.1145/3651890.3672227 20. Demystifying Performance of eBPF Network Applications — Farbod Shahinfar, Sebastiano Miano, Aurojit Panda, Gianni Antichi, 2025 https://cs.nyu.edu/~apanda/assets/papers/conext25.pdf 21. Unleashing Unprivileged eBPF Potential with Dynamic Sandboxing — Soo Yee Lim, Xueyuan Han, Thomas Pasquier, 2023 https://arxiv.org/abs/2308.01983 22. Efficient Scheduler Live Update for Linux Kernel with Modularization — Teng Ma et al., 2023 https://doi.org/10.1145/3582016.3582054 23. Communication Offloading on SmartNIC DPUs: A Quantitative Approach — Jacob Wahlgren et al., 2026 https://arxiv.org/abs/2605.04842

Kommentare

0

Sei die erste Person, die kommentiert

Melde dich jetzt an und werde Teil der AI Post Transformers-Community!

Loslegen

2 Monate für 1 €

Dann 4,99 € / Monat · Jederzeit kündbar.

  • Podcasts nur bei Podimo
  • 20 Stunden Hörbücher / Monat
  • Alle kostenlosen Podcasts

Alle Folgen

670 Folgen

Episode Post-Trained MoE Skips Half Its Experts Cover

Post-Trained MoE Skips Half Its Experts

This episode explores a post-training method for making mixture-of-experts language models cheaper at inference time without retraining them from scratch. It explains how the paper converts a fully trained static MoE into a dynamic one by adding parameter-free zero experts, allowing some tokens to skip normal experts, and then uses self-distillation to preserve the original model’s behavior under this lower-compute routing scheme. The discussion highlights why this deployment-focused approach matters for real production systems, especially when pretraining, fine-tuning, and alignment are already complete and inference cost is the main bottleneck. Listeners would find it interesting for its clear breakdown of dynamic versus static MoE compute, its practical framing around latency and serving costs, and its focus on whether large post-trained models can cut expert FLOPs substantially without losing capability. Sources: 1. Post-Trained MoE Can Skip Half Experts via Self-Distillation — Xingtai Lv, Li Sheng, Kaiyan Zhang, Yichen You, Siyan Gao, Xueheng Luo, Yuxin Zuo, Yuchen Fan, Junlin Yang, Ganqu Cui, Bingning Wang, Fan Yang, Youbang Sun, Ning Ding, Bowen Zhou, 2026 http://arxiv.org/abs/2605.18643 2. MoE++: Accelerating Mixture-of-Experts Methods with Zero-Computation Experts — Peng Jin, Bo Zhu, Li Yuan, Shuicheng Yan, 2024 https://scholar.google.com/scholar?q=MoE++:+Accelerating+Mixture-of-Experts+Methods+with+Zero-Computation+Experts 3. Not All Experts are Equal: Efficient Expert Pruning and Skipping for Mixture-of-Experts Large Language Models — Xudong Lu, Qi Liu, Yuhui Xu, Aojun Zhou, Siyuan Huang, Bo Zhang, Junchi Yan, Hongsheng Li, 2024 https://scholar.google.com/scholar?q=Not+All+Experts+are+Equal:+Efficient+Expert+Pruning+and+Skipping+for+Mixture-of-Experts+Large+Language+Models 4. Task-Specific Expert Pruning for Sparse Mixture-of-Experts — Tianyu Chen, Shaohan Huang, Yuan Xie, Binxing Jiao, Daxin Jiang, Haoyi Zhou, Jianxin Li, Furu Wei, 2022 https://scholar.google.com/scholar?q=Task-Specific+Expert+Pruning+for+Sparse+Mixture-of-Experts 5. Auxiliary-Loss-Free Load Balancing Strategy for Mixture-of-Experts — DeepSeek-AI et al., 2024 https://scholar.google.com/scholar?q=Auxiliary-Loss-Free+Load+Balancing+Strategy+for+Mixture-of-Experts 6. ST-MoE: Designing Stable and Transferable Sparse Expert Models — Barret Zoph, Noam Shazeer, William Fedus, et al., 2022 https://scholar.google.com/scholar?q=ST-MoE:+Designing+Stable+and+Transferable+Sparse+Expert+Models 7. AdaMoE: Token-Adaptive Routing with Null Experts for Mixture-of-Experts Language Models — Zihao Zeng, Yibo Miao, Hongcheng Gao, Hao Zhang, Zhijie Deng, 2024 https://scholar.google.com/scholar?q=AdaMoE:+Token-Adaptive+Routing+with+Null+Experts+for+Mixture-of-Experts+Language+Models 8. Harder Task Needs More Experts: Dynamic Routing in MoE Models — Quzhe Huang, Zhenwei An, Nan Zhuang, Mingxu Tao, Chen Zhang, Yang Jin, Kun Xu, Liwei Chen, Songfang Huang, Yansong Feng, 2024 https://scholar.google.com/scholar?q=Harder+Task+Needs+More+Experts:+Dynamic+Routing+in+MoE+Models 9. MoE Pathfinder: Trajectory-driven Expert Pruning — Xican Yang, Yuanhe Tian, Yan Song, 2025 https://scholar.google.com/scholar?q=MoE+Pathfinder:+Trajectory-driven+Expert+Pruning 10. Discovering Important Experts for Mixture-of-Experts Models Pruning Through a Theoretical Perspective — approximate only; title verified, authors not confidently recovered, 2025/2026 https://scholar.google.com/scholar?q=Discovering+Important+Experts+for+Mixture-of-Experts+Models+Pruning+Through+a+Theoretical+Perspective 11. MoEEdit: Efficient and Routing-Stable Knowledge Editing for Mixture-of-Experts LLMs — Yupu Gu, Rongzhe Wei, Andy Zhu, Pan Li, 2026 https://scholar.google.com/scholar?q=MoEEdit:+Efficient+and+Routing-Stable+Knowledge+Editing+for+Mixture-of-Experts+LLMs 12. ReLibra: Routing-Replay-Guided Load Balancing for MoE Training in Reinforcement Learning — Chao Jin, Xinming Wei, Yinmin Zhong, Chengxu Yang, Bingyang Wu, Ruidong Zhu, Zili Zhang, Yuliang Liu, Xin Jin, 2026 https://scholar.google.com/scholar?q=ReLibra:+Routing-Replay-Guided+Load+Balancing+for+MoE+Training+in+Reinforcement+Learning 13. Sparse MoE Students for Efficient Knowledge Distillation — approximate only; exact author list not confidently recovered, 2025 https://scholar.google.com/scholar?q=Sparse+MoE+Students+for+Efficient+Knowledge+Distillation 14. AI Post Transformers: Batch-Aware Expert Routing for Faster MoE Decoding — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-04-batch-aware-expert-routing-for-faster-mo-683ab6.mp3 15. AI Post Transformers: Serving MoE Models with Disaggregated Expert Parallelism — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-19-serving-moe-models-with-disaggregated-ex-6979d2.mp3 16. AI Post Transformers: Ministral 3: Cascade Distillation for Long-Context Multimodal Models — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-15-cascade-distillation-for-long-context-mu-0ebd1a.mp3 17. AI Post Transformers: Nemotron 3 Super Hybrid Mamba-Transformer MoE — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-19-nemotron-3-super-hybrid-mamba-transforme-31ac75.mp3 18. AI Post Transformers: LPU Chip for Low-Latency LLM Inference — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-20-lpu-chip-for-low-latency-llm-inference-be13c3.mp3

Gestern1 h 0 min
Episode SmolLM2 and the Power of Better Data Cover

SmolLM2 and the Power of Better Data

This episode explores SmolLM2, a 1.7 billion parameter language model from Hugging Face that tries to compete with stronger small models not by changing the transformer architecture, but by radically improving the training data mix and sequencing across roughly 11 trillion tokens. It explains the distinction between pretraining and instruction tuning, then argues that for compact models, dataset quality and curriculum can function almost like part of the architecture itself. The discussion connects SmolLM2 to earlier work such as Chinchilla, TinyStories, Textbooks Are All You Need, FineWeb-Edu, and DataComp-LM to show why educational web text, curated math and code data, and staged rebalancing matter so much when model capacity is tight. Listeners would find it interesting because it frames a practical question with real deployment stakes: whether careful data design can make smaller, cheaper, lower-latency models genuinely useful without relying on giant-scale compute. Sources: 1. SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model — Loubna Ben Allal, Anton Lozhkov, Elie Bakouch, Gabriel Martín Blázquez, Guilherme Penedo, Lewis Tunstall, Andrés Marafioti, Hynek Kydlíček, Agustín Piqueres Lajarín, Vaibhav Srivastav, Joshua Lochner, Caleb Fahlgren, Xuan-Son Nguyen, Clémentine Fourrier, Ben Burtenshaw, Hugo Larcher, Haojun Zhao, Cyril Zakka, Mathieu Morlon, Colin Raffel, Leandro von Werra, Thomas Wolf, 2025 http://arxiv.org/abs/2502.02737 2. Training Compute-Optimal Large Language Models — Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, Tom Hennigan, Eric Nalisnick, Daniel Yamins, Timothy Lillicrap, Oriol Vinyals, Jeff Dean, et al., 2022 https://scholar.google.com/scholar?q=Training+Compute-Optimal+Large+Language+Models 3. TinyStories: How Small Can Language Models Be and Still Speak Coherent English? — Ronen Eldan, Yuanzhi Li, 2023 https://scholar.google.com/scholar?q=TinyStories:+How+Small+Can+Language+Models+Be+and+Still+Speak+Coherent+English? 4. Textbooks Are All You Need — Suriya Gunasekar, Yi Zhang, Jyoti Aneja, Caio C. T. Mendes, Allie Del Giorno, Sivakanth Gopi, Mojan Javaheripi, Piero Kauffmann, Gustavo de Rosa, Olli Saarikivi, Adil Salim, Shital Shah, Harkirat Singh Behl, Xin Wang, Sebastien Bubeck, Ronen Eldan, Adam Tauman Kalai, Yin Tat Lee, Yuanzhi Li, 2023 https://scholar.google.com/scholar?q=Textbooks+Are+All+You+Need 5. MobileLLM: Optimizing Sub-billion Parameter Language Models for On-Device Use Cases — Zechun Liu, Changsheng Zhao, Forrest Iandola, Chen Lai, Yuandong Tian, Igor Fedorov, Yunyang Xiong, Ernie Chang, Yangyang Shi, Raghuraman Krishnamoorthi, Liangzhen Lai, Vikas Chandra, 2024 https://scholar.google.com/scholar?q=MobileLLM:+Optimizing+Sub-billion+Parameter+Language+Models+for+On-Device+Use+Cases 6. Dolma: an Open Corpus of Three Trillion Tokens for Language Model Pretraining Research — Luca Soldaini, Rodney Kinney, Dustin Schwenk, Siddharth Goyal, Alessandro Sordoni, Kyle Lo, Noah A. Smith, and collaborators, 2024 https://scholar.google.com/scholar?q=Dolma:+an+Open+Corpus+of+Three+Trillion+Tokens+for+Language+Model+Pretraining+Research 7. The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale — Guilherme Penedo, Hynek Kydlíček, Loubna Ben Allal, Anton Lozhkov, Margaret Mitchell, Colin Raffel, Leandro von Werra, Thomas Wolf, 2024 https://scholar.google.com/scholar?q=The+FineWeb+Datasets:+Decanting+the+Web+for+the+Finest+Text+Data+at+Scale 8. Data-Centric AI in the Age of Large Language Models — Xinyi Xu, Zhaoxuan Wu, Rui Qiao, Arun Verma, Yao Shu, Jingtan Wang, Xinyuan Niu, Zhenfeng He, Jiangwei Chen, Zijian Zhou, Gregory Kang Ruey Lau, Hieu Dao, Lucas Agussurja, Rachael Hwee Ling Sim, Xiaoqiang Lin, Wenyang Hu, Zhongxiang Dai, Pang Wei Koh, Bryan Kian Hsiang Low, 2024 https://scholar.google.com/scholar?q=Data-Centric+AI+in+the+Age+of+Large+Language+Models 9. The Stack: 3 TB of permissively licensed source code — Denis Kocetkov, Raymond Li, Loubna Ben Allal, Jia Li, Chenghao Mou, Carlos Muñoz Ferrandis, Yacine Jernite, Margaret Mitchell, Sean Hughes, Thomas Wolf, Dzmitry Bahdanau, Leandro von Werra, Harm de Vries, 2022 https://scholar.google.com/scholar?q=The+Stack:+3+TB+of+permissively+licensed+source+code 10. Enhancing Chat Language Models by Scaling High-quality Instructional Conversations — Ning Ding, Yulin Chen, Bokai Xu, et al., 2023 https://scholar.google.com/scholar?q=Enhancing+Chat+Language+Models+by+Scaling+High-quality+Instructional+Conversations 11. OpenMathInstruct-2: Accelerating AI for Math with Massive Open-Source Instruction Data — Shubham Toshniwal, Wei Du, Ivan Moshkov, Branislav Kisacanin, Alexan Ayrapetyan, Igor Gitman, 2024 https://scholar.google.com/scholar?q=OpenMathInstruct-2:+Accelerating+AI+for+Math+with+Massive+Open-Source+Instruction+Data 12. SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model — Loubna Ben Allal, Anton Lozhkov, Elie Bakouch, Gabriel Martín Blázquez, Guilherme Penedo, Lewis Tunstall, Andrés Marafioti, Hynek Kydlíček, Agustín Piqueres Lajarín, Vaibhav Srivastav, Joshua Lochner, Caleb Fahlgren, Xuan-Son Nguyen, Clémentine Fourrier, Ben Burtenshaw, Hugo Larcher, Haojun Zhao, Cyril Zakka, Mathieu Morlon, Colin Raffel, Leandro von Werra, Thomas Wolf, 2025 https://scholar.google.com/scholar?q=SmolLM2:+When+Smol+Goes+Big+--+Data-Centric+Training+of+a+Small+Language+Model 13. DataComp-LM: In search of the next generation of training sets for language models — Jeffrey Li, Alex Fang, Georgios Smyrnis, Maor Ivgi, Matt Jordan, Samir Gadre, Hritik Bansal, Etash Guha, Sedrick Keh, Kushal Arora, Saurabh Garg, Rui Xin, Niklas Muennighoff, and many others, 2024 https://scholar.google.com/scholar?q=DataComp-LM:+In+search+of+the+next+generation+of+training+sets+for+language+models 14. OpenWebMath: An Open Dataset of High-Quality Mathematical Web Text — Keiran Paster, Marco Dos Santos, Zhangir Azerbayev, Jimmy Ba, 2023 https://scholar.google.com/scholar?q=OpenWebMath:+An+Open+Dataset+of+High-Quality+Mathematical+Web+Text 15. InfiMM-WebMath-40B: Advancing Multimodal Pre-Training for Enhanced Mathematical Reasoning — Xiaotian Han, Yiren Jian, Xuefeng Hu, Haogeng Liu, Yiqi Wang, Qihang Fan, Yuang Ai, Huaibo Huang, Ran He, Zhenheng Yang, Quanzeng You, 2024 https://scholar.google.com/scholar?q=InfiMM-WebMath-40B:+Advancing+Multimodal+Pre-Training+for+Enhanced+Mathematical+Reasoning 16. DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models — Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y.K. Li, Y. Wu, Daya Guo, 2024 https://scholar.google.com/scholar?q=DeepSeekMath:+Pushing+the+Limits+of+Mathematical+Reasoning+in+Open+Language+Models 17. 2 OLMo 2 Furious — Kyle Lo and the OLMo team, 2025 https://scholar.google.com/scholar?q=2+OLMo+2+Furious 18. Revisiting Scaling Laws for Language Models: The Role of Data Quality and Training Strategies — Zhengyu Chen, Siqi Wang, Teng Xiao, Yudong Wang, Shiqi Chen, Xunliang Cai, Junxian He, Jingang Wang, 2025 https://scholar.google.com/scholar?q=Revisiting+Scaling+Laws+for+Language+Models:+The+Role+of+Data+Quality+and+Training+Strategies 19. GRAPE: Optimize Data Mixture for Group Robust Multi-target Adaptive Pretraining — Simin Fan, Maria Ios Glarou, Martin Jaggi, 2025 https://scholar.google.com/scholar?q=GRAPE:+Optimize+Data+Mixture+for+Group+Robust+Multi-target+Adaptive+Pretraining 20. Optimizing Pre-Training Data Mixtures with Mixtures of Data Expert Models — Lior Belenki, Alekh Agarwal, Tianze Shi, Kristina Toutanova, 2025 https://scholar.google.com/scholar?q=Optimizing+Pre-Training+Data+Mixtures+with+Mixtures+of+Data+Expert+Models 21. Scaling Laws with Vocabulary: Larger Models Deserve Larger Vocabularies — Chaofan Tao, Qian Liu, Longxu Dou, Niklas Muennighoff, Zhongwei Wan, Ping Luo, Min Lin, Ngai Wong, 2024 https://scholar.google.com/scholar?q=Scaling+Laws+with+Vocabulary:+Larger+Models+Deserve+Larger+Vocabularies 22. Distilling Reasoning Capabilities into Smaller Language Models — Kumar Shridhar, Alessandro Stolfo, Mrinmaya Sachan, 2023 https://scholar.google.com/scholar?q=Distilling+Reasoning+Capabilities+into+Smaller+Language+Models 23. Teaching Small Language Models Reasoning through Counterfactual Distillation — Tao Feng, Yicheng Li, Chenglin Li, Hao Chen, Fei Yu, Yin Zhang, 2024 https://scholar.google.com/scholar?q=Teaching+Small+Language+Models+Reasoning+through+Counterfactual+Distillation 24. AI Post Transformers: Self-Improving Pretraining With Post-Trained Models — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-02-self-improving-pretraining-with-post-tra-e37460.mp3 25. AI Post Transformers: Scaling Laws for Multilingual Code Pretraining — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-15-scaling-laws-for-multilingual-code-pretr-7d220e.mp3 26. AI Post Transformers: Can Models Learn from Long Context? — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-05-can-models-learn-from-long-context-77533e.mp3 27. AI Post Transformers: ASI-Evolve for Data, Architectures, and RL — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-05-asi-evolve-for-data-architectures-and-rl-197b2b.mp3 28. AI Post Transformers: Muon Is Scalable for LLM Training — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-25-muon-is-scalable-for-llm-training-587ed8.mp3

Gestern1 h 0 min
Episode Dragonfly Topology for Scalable AI Networks Cover

Dragonfly Topology for Scalable AI Networks

This episode explores the 2008 Dragonfly network topology paper and why its ideas suddenly matter again for large-scale AI systems in 2026. It explains how Dragonfly uses high-radix routers and router groups to keep most traffic to a local hop, a single global hop, and another local hop, reducing the number of expensive long-distance optical links compared with flattened butterfly and folded Clos designs. The discussion highlights the paper’s core argument that topology and routing must be co-designed around pin bandwidth, cable cost, power, and congestion, with the authors claiming roughly 20 percent lower cost than flattened butterfly and 52 percent lower cost than folded Clos beyond 16K nodes under their assumptions. Listeners would find it interesting because it connects an old supercomputing interconnect idea to modern TPU fabrics, mixture-of-experts traffic, all-to-all communication, and the growing reality that network design now directly shapes AI system performance. Sources: 1. Dragonfly Topology for Scalable AI Networks https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/34926.pdf 2. Technology-Driven, Highly-Scalable Dragonfly Topology — John Kim, William J. Dally, Steve Scott, Dennis Abts, 2008 https://scholar.google.com/scholar?q=Technology-Driven,+Highly-Scalable+Dragonfly+Topology 3. Flattened Butterfly: A Cost-Efficient Topology for High-Radix Networks — John Kim, William J. Dally, Dennis Abts, 2007 https://scholar.google.com/scholar?q=Flattened+Butterfly:+A+Cost-Efficient+Topology+for+High-Radix+Networks 4. Topological Characterization of Hamming and Dragonfly Networks and Its Implications on Routing — Cristobal Camarero, Enrique Vallejo, Ramon Beivide, 2014 https://scholar.google.com/scholar?q=Topological+Characterization+of+Hamming+and+Dragonfly+Networks+and+Its+Implications+on+Routing 5. Slim Fly: A Cost Effective Low-Diameter Network Topology — Maciej Besta, Torsten Hoefler, 2014 https://scholar.google.com/scholar?q=Slim+Fly:+A+Cost+Effective+Low-Diameter+Network+Topology 6. Microarchitecture of a High-Radix Router — John Kim, William J. Dally, Brian Towles, Amit K. Gupta, 2005 https://scholar.google.com/scholar?q=Microarchitecture+of+a+High-Radix+Router 7. The BlackWidow High-Radix Clos Network — Steve Scott, Dennis Abts, John Kim, William J. Dally, 2006 https://scholar.google.com/scholar?q=The+BlackWidow+High-Radix+Clos+Network 8. Scalable High-Radix Router Microarchitecture Using a Network Switch Organization — Jung Ho Ahn, Young Hoon Son, John Kim, 2013 https://scholar.google.com/scholar?q=Scalable+High-Radix+Router+Microarchitecture+Using+a+Network+Switch+Organization 9. A Scheme for Fast Parallel Communication — L. G. Valiant, 1982 https://scholar.google.com/scholar?q=A+Scheme+for+Fast+Parallel+Communication 10. Indirect Adaptive Routing on Large Scale Interconnection Networks — Nan Jiang, John Kim, William J. Dally, 2009 https://scholar.google.com/scholar?q=Indirect+Adaptive+Routing+on+Large+Scale+Interconnection+Networks 11. Rationale and Challenges for Optical Interconnects to Electronic Chips — David A. B. Miller, 2000 https://scholar.google.com/scholar?q=Rationale+and+Challenges+for+Optical+Interconnects+to+Electronic+Chips 12. Optical Interconnects for High-Performance Computing — Marc A. Taubenblatt, 2012 https://scholar.google.com/scholar?q=Optical+Interconnects+for+High-Performance+Computing 13. Optical Interconnects for Extreme Scale Computing Systems — Sebastien Rumley, Meisam Bahadori, Robert Polster, Simon D. Hammond, David M. Calhoun, Ke Wen, Arun Rodrigues, Keren Bergman, 2017 https://scholar.google.com/scholar?q=Optical+Interconnects+for+Extreme+Scale+Computing+Systems 14. Mission Apollo: Landing Optical Circuit Switching at Datacenter Scale — Ryohei Urata, Hong Liu, Kevin Yasumura, Erji Mao, Jill Berger, Xiang Zhou, Cedric Lam, Roy Bannon, Darren Hutchinson, Daniel Nelson, Leon Poutievski, Arjun Singh, Joon Ong, Amin Vahdat, 2022 https://scholar.google.com/scholar?q=Mission+Apollo:+Landing+Optical+Circuit+Switching+at+Datacenter+Scale 15. Adaptive Routing in High-Radix Clos Network — John Kim, William J. Dally, Dennis Abts, 2006 https://doi.org/10.1145/1188455.1188552 16. Efficient All-to-All Collective Communication Schedules for Direct-Connect Topologies — Prithwish Basu, Liangyu Zhao, Jason Fantl, Siddharth Pal, Arvind Krishnamurthy, Joud Khoury, 2024 https://doi.org/10.1145/3625549.3658656 17. Toward lower-diameter large-scale HPC and data center networks with co-packaged optics — Pavlos Maniotis, Laurent Schares, Benjamin G. Lee, Marc A. Taubenblatt, Daniel M. Kuchta, 2021 https://scholar.google.com/scholar?q=Toward+lower-diameter+large-scale+HPC+and+data+center+networks+with+co-packaged+optics 18. Toward higher-radix switches with co-packaged optics for improved network locality in data center and HPC networks [Invited] — Pavlos Maniotis, Laurent Schares, Daniel M. Kuchta, Bengi Karacali, 2022 https://scholar.google.com/scholar?q=Toward+higher-radix+switches+with+co-packaged+optics+for+improved+network+locality+in+data+center+and+HPC+networks+[Invited] 19. Exploring the benefits of using co-packaged optics in data center and AI supercomputer networks: a simulation-based analysis [Invited] — Pavlos Maniotis, Daniel M. Kuchta, 2024 https://scholar.google.com/scholar?q=Exploring+the+benefits+of+using+co-packaged+optics+in+data+center+and+AI+supercomputer+networks:+a+simulation-based+analysis+[Invited] 20. Enhanced UGAL Routing Schemes for Dragonfly Networks — Ram Sharan Chaulagain, Xin Yuan, 2024 https://scholar.google.com/scholar?q=Enhanced+UGAL+Routing+Schemes+for+Dragonfly+Networks 21. On Selection Functions in Adaptive Routing — Alejandro Cano, Cristobal Camarero, Carmen Martinez, 2025 https://scholar.google.com/scholar?q=On+Selection+Functions+in+Adaptive+Routing 22. Co-packaged optics (CPO): status, challenges, and solutions — Min Tan and coauthors, 2023 https://scholar.google.com/scholar?q=Co-packaged+optics+(CPO):+status,+challenges,+and+solutions 23. AI Post Transformers: Computation-Bandwidth-Memory Trade-offs for AI Infrastructure — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-09-computation-bandwidth-memory-trade-offs-a83f2b.mp3 24. AI Post Transformers: FengHuang for Rack-Scale LLM Inference Memory — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-12-fenghuang-for-rack-scale-llm-inference-m-62708e.mp3 25. AI Post Transformers: Serving MoE Models with Disaggregated Expert Parallelism — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-19-serving-moe-models-with-disaggregated-ex-6979d2.mp3 26. AI Post Transformers: Lossless Sparse Deltas for RL Networks — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-15-lossless-sparse-deltas-for-rl-networks-84d676.mp3

Gestern1 h 0 min
Episode Snap's Microkernel Approach to Host Networking Cover

Snap's Microkernel Approach to Host Networking

This episode explores Google’s Snap system, which moves major host-networking functions out of the kernel and into isolated userspace services while trying to keep the performance benefits usually associated with kernel bypass. It examines why that shift mattered operationally at fleet scale: kernel networking changes could take one to two months to deploy, while Snap enabled roughly weekly releases and had already been adopted across more than half of Google’s machines. The discussion breaks down Snap’s architecture, including centralized host services, microkernel-style isolation, lock-free engine communication, the MicroQuanta scheduler design, latency-sensitive congestion control, and Pony Express as a flagship transport for reliable, asynchronous messaging. Listeners would find it interesting because it frames host networking as a platform-design problem, not just a packet-speed problem, and argues that upgradeability, policy control, and performance can be engineered together rather than traded off. Sources: 1. Snap's Microkernel Approach to Host Networking https://storage.googleapis.com/gweb-research2023-media/pubtools/5281.pdf 2. L4 Microkernels: The Lessons from 20 Years of Research and Deployment — Gernot Heiser, Kevin Elphinstone, 2016 https://trustworthy.systems/publications/nicta_full_text/8988.pdf 3. Arrakis: The Operating System is the Control Plane — Simon Peter, Jialin Li, Irene Zhang, Dan R. K. Ports, Doug Woos, Arvind Krishnamurthy, Thomas Anderson, Timothy Roscoe, 2014 https://www.usenix.org/conference/osdi14/technical-sessions/presentation/peter 4. Snap: a Microkernel Approach to Host Networking — Michael Marty, Marc de Kruijf, Jacob Adriaens, Nandita Dukkipati, Amin Vahdat, et al., 2019 https://research.google/pubs/snap-a-microkernel-approach-to-host-networking/ 5. netmap: A Novel Framework for Fast Packet I/O — Luigi Rizzo, 2012 https://www.usenix.org/conference/atc12/technical-sessions/presentation/rizzo 6. mTCP: A Highly Scalable User-level TCP Stack for Multicore Systems — EunYoung Jeong, Shinae Woo, Muhammad Jamshed, Haewon Jeong, Sunghwan Ihm, Dongsu Han, KyoungSoo Park, 2014 https://www.usenix.org/system/files/conference/nsdi14/nsdi14-paper-jeong.pdf 7. IX: A Protected Dataplane Operating System for High Throughput and Low Latency — Adam Belay, George Prekas, Ana Klimovic, Samuel Grossman, Christos Kozyrakis, Edouard Bugnion, 2014 https://csl.stanford.edu/~christos/publications/2014.ix.osdi.pdf 8. VL2: A Scalable and Flexible Data Center Network — Albert Greenberg, James R. Hamilton, Navendu Jain, Srikanth Kandula, Changhoon Kim, Parantap Lahiri, Dave Maltz, Parveen Patel, Sudipta Sengupta, 2009 https://www.microsoft.com/en-us/research/publication/vl2-a-scalable-and-flexible-data-center-network/ 9. Andromeda: Performance, Isolation, and Velocity at Scale in Cloud Network Virtualization — Michael Dalton, David Schultz, Jacob Adriaens, Ahsan Arefin, Anshuman Gupta, Amin Vahdat, et al., 2018 https://www.usenix.org/conference/nsdi18/presentation/dalton 10. Carousel: Scalable Traffic Shaping at End-Hosts — Ahmed Saeed, Nandita Dukkipati, Valas Valancius, Terry Lam, Carlo Contavalli, Amin Vahdat, 2017 https://research.google/pubs/carousel-scalable-traffic-shaping-at-end-hosts/ 11. FaRM: Fast Remote Memory — Aleksandar Dragojevic, Dushyanth Narayanan, Orion Hodson, Miguel Castro, 2014 https://www.usenix.org/conference/nsdi14/technical-sessions/dragojevi%C4%87 12. Using RDMA Efficiently for Key-Value Services — Anuj Kalia, Michael Kaminsky, David G. Andersen, 2014 https://www.pdl.cmu.edu/PDL-FTP/Storage/herd-sigcomm2014.pdf 13. Datacenter RPCs can be General and Fast — Anuj Kalia, Michael Kaminsky, David Andersen, 2019 https://www.usenix.org/conference/nsdi19/presentation/kalia 14. Shenango: Achieving High CPU Efficiency for Latency-sensitive Datacenter Workloads — Amy Ousterhout, Joshua Fried, Jonathan Behrens, Adam Belay, Hari Balakrishnan, 2019 https://www.usenix.org/conference/nsdi19/presentation/ousterhout 15. Caladan: Mitigating Interference at Microsecond Timescales — Joshua Fried, Zhenyuan Ruan, Amy Ousterhout, Adam Belay, 2020 https://www.usenix.org/conference/osdi20/presentation/fried 16. TAS: TCP Acceleration as an OS Service — Antoine Kaufmann, Tim Stamler, Simon Peter, Naveen Kr. Sharma, Arvind Krishnamurthy, and Thomas Anderson, 2019 https://scholar.google.com/scholar?q=TAS:+TCP+Acceleration+as+an+OS+Service 17. FaSST: Fast, Scalable and Simple Distributed Transactions with Two-Sided (RDMA) Datagram RPCs — Anuj Kalia, Michael Kaminsky, and David G. Andersen, 2016 https://scholar.google.com/scholar?q=FaSST:+Fast,+Scalable+and+Simple+Distributed+Transactions+with+Two-Sided+(RDMA)+Datagram+RPCs 18. Implementing Network Protocols at User Level — C. A. Thekkath, T. D. Nguyen, E. Moy, and E. D. Lazowska, 1993 https://scholar.google.com/scholar?q=Implementing+Network+Protocols+at+User+Level 19. NetEdit: An Orchestration Platform for eBPF Network Functions at Scale — Theophilus A. Benson et al., 2024 https://doi.org/10.1145/3651890.3672227 20. Demystifying Performance of eBPF Network Applications — Farbod Shahinfar, Sebastiano Miano, Aurojit Panda, Gianni Antichi, 2025 https://cs.nyu.edu/~apanda/assets/papers/conext25.pdf 21. Unleashing Unprivileged eBPF Potential with Dynamic Sandboxing — Soo Yee Lim, Xueyuan Han, Thomas Pasquier, 2023 https://arxiv.org/abs/2308.01983 22. Efficient Scheduler Live Update for Linux Kernel with Modularization — Teng Ma et al., 2023 https://doi.org/10.1145/3582016.3582054 23. Communication Offloading on SmartNIC DPUs: A Quantitative Approach — Jacob Wahlgren et al., 2026 https://arxiv.org/abs/2605.04842

Gestern1 h 0 min
Episode Do Language Models Need Sleep? Cover

Do Language Models Need Sleep?

This episode explores a paper proposing that language models could handle long-context reasoning by periodically pausing, replaying soon-to-be-evicted context offline, and consolidating it into fixed-size fast-weight memory instead of carrying an ever-growing KV cache. It explains the core machinery behind the idea, including state space models and Gated Delta Networks, and clarifies why this is more than prompt summarization or retrieval: the model is rewriting its internal bounded memory during inference. The discussion highlights the paper’s central argument that extra compute may be better spent during these offline “sleep” passes, so later token prediction stays cheap while older information is metabolized into usable latent state. Listeners would find it interesting because it frames long-context scaling as a memory-systems problem, raises concrete questions about whether this consolidation actually improves reasoning, and connects the proposal to broader debates about how future LLMs should trade off memory, compute, and exact recall. Sources: 1. Do Language Models Need Sleep? Offline Recurrence for Improved Online Inference — Sangyun Lee, Sean McLeish, Tom Goldstein, Giulia Fanti, 2026 http://arxiv.org/abs/2605.26099 2. Replay in Deep Learning: Current Approaches and Missing Biological Elements — Tyler L. Hayes, Giri P. Krishnan, Maxim Bazhenov, Hava T. Siegelmann, Terrence J. Sejnowski, Christopher Kanan, 2021 https://scholar.google.com/scholar?q=Replay+in+Deep+Learning:+Current+Approaches+and+Missing+Biological+Elements 3. Can sleep protect memories from catastrophic forgetting? — Oscar C. Gonzalez, Yury Sokolov, Giri P. Krishnan, Jean Erik Delanois, Maxim Bazhenov, 2020 https://scholar.google.com/scholar?q=Can+sleep+protect+memories+from+catastrophic+forgetting? 4. Sleep-like unsupervised replay reduces catastrophic forgetting in artificial neural networks — Timothy Tadros, Giri P. Krishnan, Ramyaa Ramyaa, Maxim Bazhenov, 2022 https://scholar.google.com/scholar?q=Sleep-like+unsupervised+replay+reduces+catastrophic+forgetting+in+artificial+neural+networks 5. Do Language Models Need Sleep? Offline Recurrence for Improved Online Inference — Sangyun Lee, Sean McLeish, Tom Goldstein, Giulia Fanti, 2026 https://scholar.google.com/scholar?q=Do+Language+Models+Need+Sleep?+Offline+Recurrence+for+Improved+Online+Inference 6. Using Fast Weights to Attend to the Recent Past — Jimmy Ba, Geoffrey Hinton, Volodymyr Mnih, Joel Z. Leibo, Catalin Ionescu, 2016 https://scholar.google.com/scholar?q=Using+Fast+Weights+to+Attend+to+the+Recent+Past 7. Linear Transformers Are Secretly Fast Weight Programmers — Imanol Schlag, Kazuki Irie, Jürgen Schmidhuber, 2021 https://scholar.google.com/scholar?q=Linear+Transformers+Are+Secretly+Fast+Weight+Programmers 8. Fast weight programming and linear transformers: from machine learning to neurobiology — Kazuki Irie, Samuel J. Gershman, 2026 https://scholar.google.com/scholar?q=Fast+weight+programming+and+linear+transformers:+from+machine+learning+to+neurobiology 9. TRELLIS: Learning to Compress Key-Value Memory in Attention Models — Mahdi Karami, Ali Behrouz, Praneeth Kacham, Vahab Mirrokni, 2025 https://scholar.google.com/scholar?q=TRELLIS:+Learning+to+Compress+Key-Value+Memory+in+Attention+Models 10. Gated Delta Networks: Improving Mamba2 with Delta Rule — Songlin Yang, Jan Kautz, Ali Hatamizadeh, 2024 https://scholar.google.com/scholar?q=Gated+Delta+Networks:+Improving+Mamba2+with+Delta+Rule 11. Titans: Learning to Memorize at Test Time — Ali Behrouz, Peilin Zhong, Vahab Mirrokni, 2025 https://scholar.google.com/scholar?q=Titans:+Learning+to+Memorize+at+Test+Time 12. Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach — Jonas Geiping, Sean McLeish, Neel Jain, et al., 2025 https://scholar.google.com/scholar?q=Scaling+up+Test-Time+Compute+with+Latent+Reasoning:+A+Recurrent+Depth+Approach 13. In-context Autoencoder for Context Compression in a Large Language Model — Tao Ge, Jing Hu, Lei Wang, Xun Wang, Si-Qing Chen, Furu Wei, 2023 https://scholar.google.com/scholar?q=In-context+Autoencoder+for+Context+Compression+in+a+Large+Language+Model 14. Cartridges: Lightweight and general-purpose long context representations via self-study — Sabri Eyuboglu, Ryan Ehrlich, Simran Arora, et al., 2025 https://scholar.google.com/scholar?q=Cartridges:+Lightweight+and+general-purpose+long+context+representations+via+self-study 15. Repeat After Me: Transformers are Better than State Space Models at Copying — Samy Jelassi, David Brandfonbrener, Sham M. Kakade, Eran Malach, 2024 https://scholar.google.com/scholar?q=Repeat+After+Me:+Transformers+are+Better+than+State+Space+Models+at+Copying 16. End-to-End Test-Time Training for Long Context — Arnuv Tandon et al., 2025 https://scholar.google.com/scholar?q=End-to-End+Test-Time+Training+for+Long+Context 17. Let's (not) just put things in Context: Test-Time Training for Long-Context LLMs — Rachit Bansal et al., 2025 https://scholar.google.com/scholar?q=Let's+(not)+just+put+things+in+Context:+Test-Time+Training+for+Long-Context+LLMs 18. Test-Time Training Done Right — Tianyuan Zhang et al., 2025 https://scholar.google.com/scholar?q=Test-Time+Training+Done+Right 19. Not All Heads Matter: A Head-Level KV Cache Compression Method with Integrated Retrieval and Reasoning — Yu Fu et al., 2024 https://scholar.google.com/scholar?q=Not+All+Heads+Matter:+A+Head-Level+KV+Cache+Compression+Method+with+Integrated+Retrieval+and+Reasoning 20. Beyond RAG: Task-Aware KV Cache Compression for Comprehensive Knowledge Reasoning — Giulio Corallo et al., 2025 https://scholar.google.com/scholar?q=Beyond+RAG:+Task-Aware+KV+Cache+Compression+for+Comprehensive+Knowledge+Reasoning 21. SideQuest: Model-Driven KV Cache Management for Long-Horizon Agentic Reasoning — Sanjay Kariyappa and G. Edward Suh, 2026 https://scholar.google.com/scholar?q=SideQuest:+Model-Driven+KV+Cache+Management+for+Long-Horizon+Agentic+Reasoning 22. Loop, Think, & Generalize: Implicit Reasoning in Recurrent-Depth Transformers — Harsh Kohli et al., 2026 https://scholar.google.com/scholar?q=Loop,+Think,+&+Generalize:+Implicit+Reasoning+in+Recurrent-Depth+Transformers 23. AI Post Transformers: Titans: Learning to Memorize at Test Time — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-20-titans-learning-to-memorize-at-test-time-054662.mp3 24. AI Post Transformers: In-Place Test-Time Training for Transformers — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-09-in-place-test-time-training-for-transfor-d0b976.mp3 25. AI Post Transformers: Recursive Language Models for Arbitrarily Long Prompts — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-04-recursive-language-models-for-arbitraril-fbcd1c.mp3 26. AI Post Transformers: Explicit Information Transmission for Context Compression — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-05-explicit-information-transmission-for-co-24e3c2.mp3 27. AI Post Transformers: KVzip for Query-Agnostic KV Cache Compression — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-29-kvzip-for-query-agnostic-kv-cache-compre-72afe5.mp3 28. AI Post Transformers: Gated Linear Attention for Efficient Long Sequences — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-18-gated-linear-attention-for-efficient-lon-c858ab.mp3 29. AI Post Transformers: MiA-Signature and Global Activation for Long Context — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-13-mia-signature-and-global-activation-for-5ad62f.mp3

Gestern1 h 0 min