AI Post Transformers

Post-Trained MoE Skips Half Its Experts

1 h 0 min · 1. kesä 2026
jakson Post-Trained MoE Skips Half Its Experts kansikuva

Kuvaus

This episode explores a post-training method for making mixture-of-experts language models cheaper at inference time without retraining them from scratch. It explains how the paper converts a fully trained static MoE into a dynamic one by adding parameter-free zero experts, allowing some tokens to skip normal experts, and then uses self-distillation to preserve the original model’s behavior under this lower-compute routing scheme. The discussion highlights why this deployment-focused approach matters for real production systems, especially when pretraining, fine-tuning, and alignment are already complete and inference cost is the main bottleneck. Listeners would find it interesting for its clear breakdown of dynamic versus static MoE compute, its practical framing around latency and serving costs, and its focus on whether large post-trained models can cut expert FLOPs substantially without losing capability. Sources: 1. Post-Trained MoE Can Skip Half Experts via Self-Distillation — Xingtai Lv, Li Sheng, Kaiyan Zhang, Yichen You, Siyan Gao, Xueheng Luo, Yuxin Zuo, Yuchen Fan, Junlin Yang, Ganqu Cui, Bingning Wang, Fan Yang, Youbang Sun, Ning Ding, Bowen Zhou, 2026 http://arxiv.org/abs/2605.18643 2. MoE++: Accelerating Mixture-of-Experts Methods with Zero-Computation Experts — Peng Jin, Bo Zhu, Li Yuan, Shuicheng Yan, 2024 https://scholar.google.com/scholar?q=MoE++:+Accelerating+Mixture-of-Experts+Methods+with+Zero-Computation+Experts 3. Not All Experts are Equal: Efficient Expert Pruning and Skipping for Mixture-of-Experts Large Language Models — Xudong Lu, Qi Liu, Yuhui Xu, Aojun Zhou, Siyuan Huang, Bo Zhang, Junchi Yan, Hongsheng Li, 2024 https://scholar.google.com/scholar?q=Not+All+Experts+are+Equal:+Efficient+Expert+Pruning+and+Skipping+for+Mixture-of-Experts+Large+Language+Models 4. Task-Specific Expert Pruning for Sparse Mixture-of-Experts — Tianyu Chen, Shaohan Huang, Yuan Xie, Binxing Jiao, Daxin Jiang, Haoyi Zhou, Jianxin Li, Furu Wei, 2022 https://scholar.google.com/scholar?q=Task-Specific+Expert+Pruning+for+Sparse+Mixture-of-Experts 5. Auxiliary-Loss-Free Load Balancing Strategy for Mixture-of-Experts — DeepSeek-AI et al., 2024 https://scholar.google.com/scholar?q=Auxiliary-Loss-Free+Load+Balancing+Strategy+for+Mixture-of-Experts 6. ST-MoE: Designing Stable and Transferable Sparse Expert Models — Barret Zoph, Noam Shazeer, William Fedus, et al., 2022 https://scholar.google.com/scholar?q=ST-MoE:+Designing+Stable+and+Transferable+Sparse+Expert+Models 7. AdaMoE: Token-Adaptive Routing with Null Experts for Mixture-of-Experts Language Models — Zihao Zeng, Yibo Miao, Hongcheng Gao, Hao Zhang, Zhijie Deng, 2024 https://scholar.google.com/scholar?q=AdaMoE:+Token-Adaptive+Routing+with+Null+Experts+for+Mixture-of-Experts+Language+Models 8. Harder Task Needs More Experts: Dynamic Routing in MoE Models — Quzhe Huang, Zhenwei An, Nan Zhuang, Mingxu Tao, Chen Zhang, Yang Jin, Kun Xu, Liwei Chen, Songfang Huang, Yansong Feng, 2024 https://scholar.google.com/scholar?q=Harder+Task+Needs+More+Experts:+Dynamic+Routing+in+MoE+Models 9. MoE Pathfinder: Trajectory-driven Expert Pruning — Xican Yang, Yuanhe Tian, Yan Song, 2025 https://scholar.google.com/scholar?q=MoE+Pathfinder:+Trajectory-driven+Expert+Pruning 10. Discovering Important Experts for Mixture-of-Experts Models Pruning Through a Theoretical Perspective — approximate only; title verified, authors not confidently recovered, 2025/2026 https://scholar.google.com/scholar?q=Discovering+Important+Experts+for+Mixture-of-Experts+Models+Pruning+Through+a+Theoretical+Perspective 11. MoEEdit: Efficient and Routing-Stable Knowledge Editing for Mixture-of-Experts LLMs — Yupu Gu, Rongzhe Wei, Andy Zhu, Pan Li, 2026 https://scholar.google.com/scholar?q=MoEEdit:+Efficient+and+Routing-Stable+Knowledge+Editing+for+Mixture-of-Experts+LLMs 12. ReLibra: Routing-Replay-Guided Load Balancing for MoE Training in Reinforcement Learning — Chao Jin, Xinming Wei, Yinmin Zhong, Chengxu Yang, Bingyang Wu, Ruidong Zhu, Zili Zhang, Yuliang Liu, Xin Jin, 2026 https://scholar.google.com/scholar?q=ReLibra:+Routing-Replay-Guided+Load+Balancing+for+MoE+Training+in+Reinforcement+Learning 13. Sparse MoE Students for Efficient Knowledge Distillation — approximate only; exact author list not confidently recovered, 2025 https://scholar.google.com/scholar?q=Sparse+MoE+Students+for+Efficient+Knowledge+Distillation 14. AI Post Transformers: Batch-Aware Expert Routing for Faster MoE Decoding — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-04-batch-aware-expert-routing-for-faster-mo-683ab6.mp3 15. AI Post Transformers: Serving MoE Models with Disaggregated Expert Parallelism — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-19-serving-moe-models-with-disaggregated-ex-6979d2.mp3 16. AI Post Transformers: Ministral 3: Cascade Distillation for Long-Context Multimodal Models — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-15-cascade-distillation-for-long-context-mu-0ebd1a.mp3 17. AI Post Transformers: Nemotron 3 Super Hybrid Mamba-Transformer MoE — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-19-nemotron-3-super-hybrid-mamba-transforme-31ac75.mp3 18. AI Post Transformers: LPU Chip for Low-Latency LLM Inference — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-20-lpu-chip-for-low-latency-llm-inference-be13c3.mp3

Kommentit

0

Ole ensimmäinen kommentoija

Rekisteröidy nyt ja liity AI Post Transformers-yhteisöön!

Aloita maksutta

14 vrk ilmainen kokeilu

Kokeilun jälkeen 7,99 € / kuukausi. · Peru milloin tahansa.

  • Podimon podcastit
  • 20 kuunteluaikaa / kuukausi
  • Lataa offline-käyttöön

Kaikki jaksot

672 jaksot

jakson Memory-Bound, Not Bandwidth-Limited Batch-1 LLM Decode kansikuva

Memory-Bound, Not Bandwidth-Limited Batch-1 LLM Decode

This episode explores why batch-1 LLM decode for robots, edge copilots, and other single-session agents behaves very differently from high-throughput serving, and why next-token latency cannot be explained by memory bandwidth alone. It breaks down the paper’s main test: compare real decode time against an analytic memory floor based on model-weight and KV-cache traffic, then run that across Qwen-2.5-7B, Mistral-7B-v0.3, and Llama-3.1-8B on L4, L40S, A100, and H100 GPUs over contexts from 2048 to 16384. The discussion argues that because these models already use grouped-query attention to cut KV traffic, the remaining latency gap is driven by runtime details such as CUDA Graphs, launch overhead, kernel quality, and whether quantization actually helps in this tiny decode regime. Listeners would find it interesting because it challenges the simple idea that buying a faster-memory GPU automatically lowers token latency, especially for physical AI systems where one delayed token can stall the whole interaction. Sources: 1. Memory-Bound, Not Bandwidth-Limited Batch-1 LLM Decode https://arxiv.org/pdf/2605.30571 2. Orca: A Distributed Serving System for Transformer-Based Generative Models — Gyeong-In Yu, Joo Seong Jeong, Geon-Woo Kim, Soojeong Kim, Byung-Gon Chun, 2022 https://scholar.google.com/scholar?q=Orca:+A+Distributed+Serving+System+for+Transformer-Based+Generative+Models 3. Efficient Memory Management for Large Language Model Serving with PagedAttention — Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, Ion Stoica, 2023 https://scholar.google.com/scholar?q=Efficient+Memory+Management+for+Large+Language+Model+Serving+with+PagedAttention 4. Splitwise: Efficient Generative LLM Inference Using Phase Splitting — Pratyush Patel, Esha Choukse, Chaojie Zhang, Aashaka Shah, Inigo Goiri, Saeed Maleki, Ricardo Bianchini, 2024 https://scholar.google.com/scholar?q=Splitwise:+Efficient+Generative+LLM+Inference+Using+Phase+Splitting 5. Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving — Ruoyu Qin, Zheming Li, Weiran He, Mingxing Zhang, Yongwei Wu, Weimin Zheng, Xinran Xu, 2024 https://scholar.google.com/scholar?q=Mooncake:+A+KVCache-centric+Disaggregated+Architecture+for+LLM+Serving 6. Boosting Performance of Iterative Applications on GPUs: Kernel Batching with CUDA Graphs — Jonah Ekelund, Stefano Markidis, Ivy Peng, 2025 https://scholar.google.com/scholar?q=Boosting+Performance+of+Iterative+Applications+on+GPUs:+Kernel+Batching+with+CUDA+Graphs 7. PyGraph: Robust Compiler Support for CUDA Graphs in PyTorch — Abhishek Ghosh, Ajay Nayak, Ashish Panwar, Arkaprava Basu, 2025 https://scholar.google.com/scholar?q=PyGraph:+Robust+Compiler+Support+for+CUDA+Graphs+in+PyTorch 8. Foundry: Template-Based CUDA Graph Context Materialization for Fast LLM Serving Cold Start — Xueshen Liu, Yongji Wu, Yuncheng Yao, Danyang Zhuo, Ion Stoica, Z. Morley Mao, 2026 https://scholar.google.com/scholar?q=Foundry:+Template-Based+CUDA+Graph+Context+Materialization+for+Fast+LLM+Serving+Cold+Start 9. Memory-Bound but Not Bandwidth-Limited: The Physical AI Inference Gap in Batch-1 LLM Decode — Josef Chen, 2026 https://scholar.google.com/scholar?q=Memory-Bound+but+Not+Bandwidth-Limited:+The+Physical+AI+Inference+Gap+in+Batch-1+LLM+Decode 10. GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints — Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebron, Sumit Sanghai, 2023 https://scholar.google.com/scholar?q=GQA:+Training+Generalized+Multi-Query+Transformer+Models+from+Multi-Head+Checkpoints 11. GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers — Elias Frantar, Saleh Ashkboos, Torsten Hoefler, Dan Alistarh, 2022 https://scholar.google.com/scholar?q=GPTQ:+Accurate+Post-Training+Quantization+for+Generative+Pre-trained+Transformers 12. AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration — Ji Lin et al., 2023 https://scholar.google.com/scholar?q=AWQ:+Activation-aware+Weight+Quantization+for+LLM+Compression+and+Acceleration 13. FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision — Jay Shah, Ganesh Bikshandi, Ying Zhang, Vijay Thakkar, Pradeep Ramani, Tri Dao, 2024 https://scholar.google.com/scholar?q=FlashAttention-3:+Fast+and+Accurate+Attention+with+Asynchrony+and+Low-precision 14. FlashDecoding++: Faster Large Language Model Inference on GPUs — Ke Hong et al., 2023 https://scholar.google.com/scholar?q=FlashDecoding++:+Faster+Large+Language+Model+Inference+on+GPUs 15. Mind the Memory Gap: Unveiling GPU Bottlenecks in Large-Batch LLM Inference — Pol G. Recasens et al., 2025 https://scholar.google.com/scholar?q=Mind+the+Memory+Gap:+Unveiling+GPU+Bottlenecks+in+Large-Batch+LLM+Inference 16. Challenges and Research Directions for Large Language Model Inference Hardware — Xiaoyu Ma, David Patterson, 2026 https://scholar.google.com/scholar?q=Challenges+and+Research+Directions+for+Large+Language+Model+Inference+Hardware 17. Medusa: Accelerating Serverless LLM Inference with Materialization — Shaoxun Zeng et al., 2025 https://scholar.google.com/scholar?q=Medusa:+Accelerating+Serverless+LLM+Inference+with+Materialization 18. Hybrid JIT-CUDA Graph Optimization for Low-Latency Large Language Model Inference — Divakar Kumar Yadav and Tian Zhao, 2026 https://scholar.google.com/scholar?q=Hybrid+JIT-CUDA+Graph+Optimization+for+Low-Latency+Large+Language+Model+Inference 19. AWQ: Activation-aware Weight Quantization for On-Device LLM Compression and Acceleration — Ji Lin et al., 2024 https://scholar.google.com/scholar?q=AWQ:+Activation-aware+Weight+Quantization+for+On-Device+LLM+Compression+and+Acceleration 20. SpQR: A Sparse-Quantized Representation for Near-Lossless LLM Weight Compression — Tim Dettmers et al., 2024 https://scholar.google.com/scholar?q=SpQR:+A+Sparse-Quantized+Representation+for+Near-Lossless+LLM+Weight+Compression 21. Don't Waste Bits! Adaptive KV-Cache Quantization for Lightweight On-Device LLMs — Sayed Pedram Haeri Boroujeni et al., 2026 https://scholar.google.com/scholar?q=Don't+Waste+Bits!+Adaptive+KV-Cache+Quantization+for+Lightweight+On-Device+LLMs 22. KVLink: Accelerating Large Language Models via Efficient KV Cache Reuse — Jingbo Yang et al., 2025 https://scholar.google.com/scholar?q=KVLink:+Accelerating+Large+Language+Models+via+Efficient+KV+Cache+Reuse 23. LMCache: An Efficient KV Cache Layer for Enterprise-Scale LLM Inference — Yuhan Liu, Yihua Cheng et al., 2025 https://scholar.google.com/scholar?q=LMCache:+An+Efficient+KV+Cache+Layer+for+Enterprise-Scale+LLM+Inference 24. Selective KV-Cache Sharing to Mitigate Timing Side-Channels in LLM Inference — Kexin Chu et al., 2025 https://scholar.google.com/scholar?q=Selective+KV-Cache+Sharing+to+Mitigate+Timing+Side-Channels+in+LLM+Inference 25. AI Post Transformers: Deep Kernel Fusion for Transformer Decoding — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-15-deep-kernel-fusion-for-transformer-decod-b1a703.mp3 26. AI Post Transformers: Speculative Decoding in Real vLLM Serving — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-04-speculative-decoding-in-real-vllm-servin-6f4e2b.mp3 27. AI Post Transformers: LPU Chip for Low-Latency LLM Inference — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-20-lpu-chip-for-low-latency-llm-inference-be13c3.mp3 28. AI Post Transformers: CXL Computational Memory Offloading for Lower Runtime — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-04-cxl-computational-memory-offloading-for-3b2124.mp3 29. AI Post Transformers: Computation-Bandwidth-Memory Trade-offs for AI Infrastructure — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-09-computation-bandwidth-memory-trade-offs-a83f2b.mp3 30. AI Post Transformers: Serving MoE Models with Disaggregated Expert Parallelism — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-19-serving-moe-models-with-disaggregated-ex-6979d2.mp3

Eilen1 h 0 min
jakson TRELLIS and Bounded-Memory Transformer KV Compression kansikuva

TRELLIS and Bounded-Memory Transformer KV Compression

This episode explores TRELLIS, a bounded-memory transformer architecture that replaces the usual ever-growing key-value cache with a fixed set of learned memory slots that are rewritten during inference. It explains why long-context serving is constrained less by training-time quadratic attention than by the linear growth, latency, and fragility of KV caches, and situates TRELLIS in the progression from Transformer-XL and Compressive Transformers to ABC and GSA. The discussion highlights TRELLIS’s central idea: treating memory as fast weights for a small online regression layer, updating that memory with test-time gradient descent and state decay so the model can reconstruct useful representations while learning what to forget. Listeners would find it interesting because it connects deployment pain points in modern LLMs to a concrete alternative architecture that aims to preserve quality even as context grows while memory stays fixed. Sources: 1. TRELLIS and Bounded-Memory Transformer KV Compression https://arxiv.org/pdf/2512.23852 2. Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context — Zihang Dai, Zhilin Yang, Yiming Yang, Jaime Carbonell, Quoc V. Le, Ruslan Salakhutdinov, 2019 https://scholar.google.com/scholar?q=Transformer-XL:+Attentive+Language+Models+Beyond+a+Fixed-Length+Context 3. Compressive Transformers for Long-Range Sequence Modelling — Jack W. Rae, Anna Potapenko, Siddhant M. Jayakumar, Timothy P. Lillicrap, 2020 https://scholar.google.com/scholar?q=Compressive+Transformers+for+Long-Range+Sequence+Modelling 4. Recurrent Memory Transformer — Aydar Bulatov, Yury Kuratov, Mikhail Burtsev, 2022 https://scholar.google.com/scholar?q=Recurrent+Memory+Transformer 5. Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention — Tsendsuren Munkhdalai, Manaal Faruqui, Siddharth Gopal, 2024 https://scholar.google.com/scholar?q=Leave+No+Context+Behind:+Efficient+Infinite+Context+Transformers+with+Infini-attention 6. ABC: Attention with Bounded-Memory Control — Hao Peng et al., 2021 https://scholar.google.com/scholar?q=ABC:+Attention+with+Bounded-Memory+Control 7. Gated Slot Attention for Efficient Linear-Time Sequence Modeling — Yu Zhang et al., 2024 https://scholar.google.com/scholar?q=Gated+Slot+Attention+for+Efficient+Linear-Time+Sequence+Modeling 8. Learning to (Learn at Test Time): RNNs with Expressive Hidden States — Yu Sun et al., 2024 https://scholar.google.com/scholar?q=Learning+to+(Learn+at+Test+Time):+RNNs+with+Expressive+Hidden+States 9. Lattice: Learning to Efficiently Compress the Memory — Mahdi Karami, Razvan Pascanu, Vahab Mirrokni, 2025 https://scholar.google.com/scholar?q=Lattice:+Learning+to+Efficiently+Compress+the+Memory 10. You Only Cache Once: Decoder-Decoder Architectures for Language Models — Yutao Sun et al., 2024 https://scholar.google.com/scholar?q=You+Only+Cache+Once:+Decoder-Decoder+Architectures+for+Language+Models 11. KVLink: Accelerating Large Language Models via Efficient KV Cache Reuse — Jingbo Yang et al., 2025 https://arxiv.org/abs/2502.16002 12. LMCache: An Efficient KV Cache Layer for Enterprise-Scale LLM Inference — Yihua Cheng et al., 2025 https://arxiv.org/abs/2510.09665 13. Selective KV-Cache Sharing to Mitigate Timing Side-Channels in LLM Inference — Kexin Chu et al., 2025 https://arxiv.org/abs/2508.08438 14. SparK: Query-Aware Unstructured Sparsity with Recoverable KV Cache Channel Pruning — Huanxuan Liao et al., 2025 https://arxiv.org/abs/2508.15212 15. Test-Time Training Provably Improves Transformers as In-context Learners — Halil Alperen Gozeten et al., 2025 https://arxiv.org/abs/2503.11842 16. Linearizing Vision Transformer with Test-Time Training — Yining Li et al., 2026 https://arxiv.org/abs/2605.02772 17. AI Post Transformers: Titans: Learning to Memorize at Test Time — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-20-titans-learning-to-memorize-at-test-time-054662.mp3 18. AI Post Transformers: Explicit Information Transmission for Context Compression — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-05-explicit-information-transmission-for-co-24e3c2.mp3 19. AI Post Transformers: Gated Linear Attention for Efficient Long Sequences — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-18-gated-linear-attention-for-efficient-lon-c858ab.mp3 20. AI Post Transformers: In-Place Test-Time Training for Transformers — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-09-in-place-test-time-training-for-transfor-d0b976.mp3 21. AI Post Transformers: Parallelizing DeltaNet Linear Transformers over Sequence Length — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-18-parallelizing-deltanet-linear-transforme-2d0377.mp3 22. AI Post Transformers: Long Context Pre-Training with Lighthouse Attention — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-13-long-context-pre-training-with-lighthous-e85bbe.mp3 23. AI Post Transformers: Compressed Convolutional Attention in Latent Space — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-25-compressed-convolutional-attention-in-la-61e1cf.mp3 Interactive Visualization: TRELLIS and Bounded-Memory Transformer KV Compression [https://podcast.do-not-panic.com/viz/2026-06-02-trellis-and-bounded-memory-transformer-k-81f237.html]

Eilen1 h 0 min
jakson Post-Trained MoE Skips Half Its Experts kansikuva

Post-Trained MoE Skips Half Its Experts

This episode explores a post-training method for making mixture-of-experts language models cheaper at inference time without retraining them from scratch. It explains how the paper converts a fully trained static MoE into a dynamic one by adding parameter-free zero experts, allowing some tokens to skip normal experts, and then uses self-distillation to preserve the original model’s behavior under this lower-compute routing scheme. The discussion highlights why this deployment-focused approach matters for real production systems, especially when pretraining, fine-tuning, and alignment are already complete and inference cost is the main bottleneck. Listeners would find it interesting for its clear breakdown of dynamic versus static MoE compute, its practical framing around latency and serving costs, and its focus on whether large post-trained models can cut expert FLOPs substantially without losing capability. Sources: 1. Post-Trained MoE Can Skip Half Experts via Self-Distillation — Xingtai Lv, Li Sheng, Kaiyan Zhang, Yichen You, Siyan Gao, Xueheng Luo, Yuxin Zuo, Yuchen Fan, Junlin Yang, Ganqu Cui, Bingning Wang, Fan Yang, Youbang Sun, Ning Ding, Bowen Zhou, 2026 http://arxiv.org/abs/2605.18643 2. MoE++: Accelerating Mixture-of-Experts Methods with Zero-Computation Experts — Peng Jin, Bo Zhu, Li Yuan, Shuicheng Yan, 2024 https://scholar.google.com/scholar?q=MoE++:+Accelerating+Mixture-of-Experts+Methods+with+Zero-Computation+Experts 3. Not All Experts are Equal: Efficient Expert Pruning and Skipping for Mixture-of-Experts Large Language Models — Xudong Lu, Qi Liu, Yuhui Xu, Aojun Zhou, Siyuan Huang, Bo Zhang, Junchi Yan, Hongsheng Li, 2024 https://scholar.google.com/scholar?q=Not+All+Experts+are+Equal:+Efficient+Expert+Pruning+and+Skipping+for+Mixture-of-Experts+Large+Language+Models 4. Task-Specific Expert Pruning for Sparse Mixture-of-Experts — Tianyu Chen, Shaohan Huang, Yuan Xie, Binxing Jiao, Daxin Jiang, Haoyi Zhou, Jianxin Li, Furu Wei, 2022 https://scholar.google.com/scholar?q=Task-Specific+Expert+Pruning+for+Sparse+Mixture-of-Experts 5. Auxiliary-Loss-Free Load Balancing Strategy for Mixture-of-Experts — DeepSeek-AI et al., 2024 https://scholar.google.com/scholar?q=Auxiliary-Loss-Free+Load+Balancing+Strategy+for+Mixture-of-Experts 6. ST-MoE: Designing Stable and Transferable Sparse Expert Models — Barret Zoph, Noam Shazeer, William Fedus, et al., 2022 https://scholar.google.com/scholar?q=ST-MoE:+Designing+Stable+and+Transferable+Sparse+Expert+Models 7. AdaMoE: Token-Adaptive Routing with Null Experts for Mixture-of-Experts Language Models — Zihao Zeng, Yibo Miao, Hongcheng Gao, Hao Zhang, Zhijie Deng, 2024 https://scholar.google.com/scholar?q=AdaMoE:+Token-Adaptive+Routing+with+Null+Experts+for+Mixture-of-Experts+Language+Models 8. Harder Task Needs More Experts: Dynamic Routing in MoE Models — Quzhe Huang, Zhenwei An, Nan Zhuang, Mingxu Tao, Chen Zhang, Yang Jin, Kun Xu, Liwei Chen, Songfang Huang, Yansong Feng, 2024 https://scholar.google.com/scholar?q=Harder+Task+Needs+More+Experts:+Dynamic+Routing+in+MoE+Models 9. MoE Pathfinder: Trajectory-driven Expert Pruning — Xican Yang, Yuanhe Tian, Yan Song, 2025 https://scholar.google.com/scholar?q=MoE+Pathfinder:+Trajectory-driven+Expert+Pruning 10. Discovering Important Experts for Mixture-of-Experts Models Pruning Through a Theoretical Perspective — approximate only; title verified, authors not confidently recovered, 2025/2026 https://scholar.google.com/scholar?q=Discovering+Important+Experts+for+Mixture-of-Experts+Models+Pruning+Through+a+Theoretical+Perspective 11. MoEEdit: Efficient and Routing-Stable Knowledge Editing for Mixture-of-Experts LLMs — Yupu Gu, Rongzhe Wei, Andy Zhu, Pan Li, 2026 https://scholar.google.com/scholar?q=MoEEdit:+Efficient+and+Routing-Stable+Knowledge+Editing+for+Mixture-of-Experts+LLMs 12. ReLibra: Routing-Replay-Guided Load Balancing for MoE Training in Reinforcement Learning — Chao Jin, Xinming Wei, Yinmin Zhong, Chengxu Yang, Bingyang Wu, Ruidong Zhu, Zili Zhang, Yuliang Liu, Xin Jin, 2026 https://scholar.google.com/scholar?q=ReLibra:+Routing-Replay-Guided+Load+Balancing+for+MoE+Training+in+Reinforcement+Learning 13. Sparse MoE Students for Efficient Knowledge Distillation — approximate only; exact author list not confidently recovered, 2025 https://scholar.google.com/scholar?q=Sparse+MoE+Students+for+Efficient+Knowledge+Distillation 14. AI Post Transformers: Batch-Aware Expert Routing for Faster MoE Decoding — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-04-batch-aware-expert-routing-for-faster-mo-683ab6.mp3 15. AI Post Transformers: Serving MoE Models with Disaggregated Expert Parallelism — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-19-serving-moe-models-with-disaggregated-ex-6979d2.mp3 16. AI Post Transformers: Ministral 3: Cascade Distillation for Long-Context Multimodal Models — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-15-cascade-distillation-for-long-context-mu-0ebd1a.mp3 17. AI Post Transformers: Nemotron 3 Super Hybrid Mamba-Transformer MoE — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-19-nemotron-3-super-hybrid-mamba-transforme-31ac75.mp3 18. AI Post Transformers: LPU Chip for Low-Latency LLM Inference — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-20-lpu-chip-for-low-latency-llm-inference-be13c3.mp3

1. kesä 20261 h 0 min
jakson SmolLM2 and the Power of Better Data kansikuva

SmolLM2 and the Power of Better Data

This episode explores SmolLM2, a 1.7 billion parameter language model from Hugging Face that tries to compete with stronger small models not by changing the transformer architecture, but by radically improving the training data mix and sequencing across roughly 11 trillion tokens. It explains the distinction between pretraining and instruction tuning, then argues that for compact models, dataset quality and curriculum can function almost like part of the architecture itself. The discussion connects SmolLM2 to earlier work such as Chinchilla, TinyStories, Textbooks Are All You Need, FineWeb-Edu, and DataComp-LM to show why educational web text, curated math and code data, and staged rebalancing matter so much when model capacity is tight. Listeners would find it interesting because it frames a practical question with real deployment stakes: whether careful data design can make smaller, cheaper, lower-latency models genuinely useful without relying on giant-scale compute. Sources: 1. SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model — Loubna Ben Allal, Anton Lozhkov, Elie Bakouch, Gabriel Martín Blázquez, Guilherme Penedo, Lewis Tunstall, Andrés Marafioti, Hynek Kydlíček, Agustín Piqueres Lajarín, Vaibhav Srivastav, Joshua Lochner, Caleb Fahlgren, Xuan-Son Nguyen, Clémentine Fourrier, Ben Burtenshaw, Hugo Larcher, Haojun Zhao, Cyril Zakka, Mathieu Morlon, Colin Raffel, Leandro von Werra, Thomas Wolf, 2025 http://arxiv.org/abs/2502.02737 2. Training Compute-Optimal Large Language Models — Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, Tom Hennigan, Eric Nalisnick, Daniel Yamins, Timothy Lillicrap, Oriol Vinyals, Jeff Dean, et al., 2022 https://scholar.google.com/scholar?q=Training+Compute-Optimal+Large+Language+Models 3. TinyStories: How Small Can Language Models Be and Still Speak Coherent English? — Ronen Eldan, Yuanzhi Li, 2023 https://scholar.google.com/scholar?q=TinyStories:+How+Small+Can+Language+Models+Be+and+Still+Speak+Coherent+English? 4. Textbooks Are All You Need — Suriya Gunasekar, Yi Zhang, Jyoti Aneja, Caio C. T. Mendes, Allie Del Giorno, Sivakanth Gopi, Mojan Javaheripi, Piero Kauffmann, Gustavo de Rosa, Olli Saarikivi, Adil Salim, Shital Shah, Harkirat Singh Behl, Xin Wang, Sebastien Bubeck, Ronen Eldan, Adam Tauman Kalai, Yin Tat Lee, Yuanzhi Li, 2023 https://scholar.google.com/scholar?q=Textbooks+Are+All+You+Need 5. MobileLLM: Optimizing Sub-billion Parameter Language Models for On-Device Use Cases — Zechun Liu, Changsheng Zhao, Forrest Iandola, Chen Lai, Yuandong Tian, Igor Fedorov, Yunyang Xiong, Ernie Chang, Yangyang Shi, Raghuraman Krishnamoorthi, Liangzhen Lai, Vikas Chandra, 2024 https://scholar.google.com/scholar?q=MobileLLM:+Optimizing+Sub-billion+Parameter+Language+Models+for+On-Device+Use+Cases 6. Dolma: an Open Corpus of Three Trillion Tokens for Language Model Pretraining Research — Luca Soldaini, Rodney Kinney, Dustin Schwenk, Siddharth Goyal, Alessandro Sordoni, Kyle Lo, Noah A. Smith, and collaborators, 2024 https://scholar.google.com/scholar?q=Dolma:+an+Open+Corpus+of+Three+Trillion+Tokens+for+Language+Model+Pretraining+Research 7. The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale — Guilherme Penedo, Hynek Kydlíček, Loubna Ben Allal, Anton Lozhkov, Margaret Mitchell, Colin Raffel, Leandro von Werra, Thomas Wolf, 2024 https://scholar.google.com/scholar?q=The+FineWeb+Datasets:+Decanting+the+Web+for+the+Finest+Text+Data+at+Scale 8. Data-Centric AI in the Age of Large Language Models — Xinyi Xu, Zhaoxuan Wu, Rui Qiao, Arun Verma, Yao Shu, Jingtan Wang, Xinyuan Niu, Zhenfeng He, Jiangwei Chen, Zijian Zhou, Gregory Kang Ruey Lau, Hieu Dao, Lucas Agussurja, Rachael Hwee Ling Sim, Xiaoqiang Lin, Wenyang Hu, Zhongxiang Dai, Pang Wei Koh, Bryan Kian Hsiang Low, 2024 https://scholar.google.com/scholar?q=Data-Centric+AI+in+the+Age+of+Large+Language+Models 9. The Stack: 3 TB of permissively licensed source code — Denis Kocetkov, Raymond Li, Loubna Ben Allal, Jia Li, Chenghao Mou, Carlos Muñoz Ferrandis, Yacine Jernite, Margaret Mitchell, Sean Hughes, Thomas Wolf, Dzmitry Bahdanau, Leandro von Werra, Harm de Vries, 2022 https://scholar.google.com/scholar?q=The+Stack:+3+TB+of+permissively+licensed+source+code 10. Enhancing Chat Language Models by Scaling High-quality Instructional Conversations — Ning Ding, Yulin Chen, Bokai Xu, et al., 2023 https://scholar.google.com/scholar?q=Enhancing+Chat+Language+Models+by+Scaling+High-quality+Instructional+Conversations 11. OpenMathInstruct-2: Accelerating AI for Math with Massive Open-Source Instruction Data — Shubham Toshniwal, Wei Du, Ivan Moshkov, Branislav Kisacanin, Alexan Ayrapetyan, Igor Gitman, 2024 https://scholar.google.com/scholar?q=OpenMathInstruct-2:+Accelerating+AI+for+Math+with+Massive+Open-Source+Instruction+Data 12. SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model — Loubna Ben Allal, Anton Lozhkov, Elie Bakouch, Gabriel Martín Blázquez, Guilherme Penedo, Lewis Tunstall, Andrés Marafioti, Hynek Kydlíček, Agustín Piqueres Lajarín, Vaibhav Srivastav, Joshua Lochner, Caleb Fahlgren, Xuan-Son Nguyen, Clémentine Fourrier, Ben Burtenshaw, Hugo Larcher, Haojun Zhao, Cyril Zakka, Mathieu Morlon, Colin Raffel, Leandro von Werra, Thomas Wolf, 2025 https://scholar.google.com/scholar?q=SmolLM2:+When+Smol+Goes+Big+--+Data-Centric+Training+of+a+Small+Language+Model 13. DataComp-LM: In search of the next generation of training sets for language models — Jeffrey Li, Alex Fang, Georgios Smyrnis, Maor Ivgi, Matt Jordan, Samir Gadre, Hritik Bansal, Etash Guha, Sedrick Keh, Kushal Arora, Saurabh Garg, Rui Xin, Niklas Muennighoff, and many others, 2024 https://scholar.google.com/scholar?q=DataComp-LM:+In+search+of+the+next+generation+of+training+sets+for+language+models 14. OpenWebMath: An Open Dataset of High-Quality Mathematical Web Text — Keiran Paster, Marco Dos Santos, Zhangir Azerbayev, Jimmy Ba, 2023 https://scholar.google.com/scholar?q=OpenWebMath:+An+Open+Dataset+of+High-Quality+Mathematical+Web+Text 15. InfiMM-WebMath-40B: Advancing Multimodal Pre-Training for Enhanced Mathematical Reasoning — Xiaotian Han, Yiren Jian, Xuefeng Hu, Haogeng Liu, Yiqi Wang, Qihang Fan, Yuang Ai, Huaibo Huang, Ran He, Zhenheng Yang, Quanzeng You, 2024 https://scholar.google.com/scholar?q=InfiMM-WebMath-40B:+Advancing+Multimodal+Pre-Training+for+Enhanced+Mathematical+Reasoning 16. DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models — Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y.K. Li, Y. Wu, Daya Guo, 2024 https://scholar.google.com/scholar?q=DeepSeekMath:+Pushing+the+Limits+of+Mathematical+Reasoning+in+Open+Language+Models 17. 2 OLMo 2 Furious — Kyle Lo and the OLMo team, 2025 https://scholar.google.com/scholar?q=2+OLMo+2+Furious 18. Revisiting Scaling Laws for Language Models: The Role of Data Quality and Training Strategies — Zhengyu Chen, Siqi Wang, Teng Xiao, Yudong Wang, Shiqi Chen, Xunliang Cai, Junxian He, Jingang Wang, 2025 https://scholar.google.com/scholar?q=Revisiting+Scaling+Laws+for+Language+Models:+The+Role+of+Data+Quality+and+Training+Strategies 19. GRAPE: Optimize Data Mixture for Group Robust Multi-target Adaptive Pretraining — Simin Fan, Maria Ios Glarou, Martin Jaggi, 2025 https://scholar.google.com/scholar?q=GRAPE:+Optimize+Data+Mixture+for+Group+Robust+Multi-target+Adaptive+Pretraining 20. Optimizing Pre-Training Data Mixtures with Mixtures of Data Expert Models — Lior Belenki, Alekh Agarwal, Tianze Shi, Kristina Toutanova, 2025 https://scholar.google.com/scholar?q=Optimizing+Pre-Training+Data+Mixtures+with+Mixtures+of+Data+Expert+Models 21. Scaling Laws with Vocabulary: Larger Models Deserve Larger Vocabularies — Chaofan Tao, Qian Liu, Longxu Dou, Niklas Muennighoff, Zhongwei Wan, Ping Luo, Min Lin, Ngai Wong, 2024 https://scholar.google.com/scholar?q=Scaling+Laws+with+Vocabulary:+Larger+Models+Deserve+Larger+Vocabularies 22. Distilling Reasoning Capabilities into Smaller Language Models — Kumar Shridhar, Alessandro Stolfo, Mrinmaya Sachan, 2023 https://scholar.google.com/scholar?q=Distilling+Reasoning+Capabilities+into+Smaller+Language+Models 23. Teaching Small Language Models Reasoning through Counterfactual Distillation — Tao Feng, Yicheng Li, Chenglin Li, Hao Chen, Fei Yu, Yin Zhang, 2024 https://scholar.google.com/scholar?q=Teaching+Small+Language+Models+Reasoning+through+Counterfactual+Distillation 24. AI Post Transformers: Self-Improving Pretraining With Post-Trained Models — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-02-self-improving-pretraining-with-post-tra-e37460.mp3 25. AI Post Transformers: Scaling Laws for Multilingual Code Pretraining — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-15-scaling-laws-for-multilingual-code-pretr-7d220e.mp3 26. AI Post Transformers: Can Models Learn from Long Context? — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-05-can-models-learn-from-long-context-77533e.mp3 27. AI Post Transformers: ASI-Evolve for Data, Architectures, and RL — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-05-asi-evolve-for-data-architectures-and-rl-197b2b.mp3 28. AI Post Transformers: Muon Is Scalable for LLM Training — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-25-muon-is-scalable-for-llm-training-587ed8.mp3

1. kesä 20261 h 0 min
jakson Dragonfly Topology for Scalable AI Networks kansikuva

Dragonfly Topology for Scalable AI Networks

This episode explores the 2008 Dragonfly network topology paper and why its ideas suddenly matter again for large-scale AI systems in 2026. It explains how Dragonfly uses high-radix routers and router groups to keep most traffic to a local hop, a single global hop, and another local hop, reducing the number of expensive long-distance optical links compared with flattened butterfly and folded Clos designs. The discussion highlights the paper’s core argument that topology and routing must be co-designed around pin bandwidth, cable cost, power, and congestion, with the authors claiming roughly 20 percent lower cost than flattened butterfly and 52 percent lower cost than folded Clos beyond 16K nodes under their assumptions. Listeners would find it interesting because it connects an old supercomputing interconnect idea to modern TPU fabrics, mixture-of-experts traffic, all-to-all communication, and the growing reality that network design now directly shapes AI system performance. Sources: 1. Dragonfly Topology for Scalable AI Networks https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/34926.pdf 2. Technology-Driven, Highly-Scalable Dragonfly Topology — John Kim, William J. Dally, Steve Scott, Dennis Abts, 2008 https://scholar.google.com/scholar?q=Technology-Driven,+Highly-Scalable+Dragonfly+Topology 3. Flattened Butterfly: A Cost-Efficient Topology for High-Radix Networks — John Kim, William J. Dally, Dennis Abts, 2007 https://scholar.google.com/scholar?q=Flattened+Butterfly:+A+Cost-Efficient+Topology+for+High-Radix+Networks 4. Topological Characterization of Hamming and Dragonfly Networks and Its Implications on Routing — Cristobal Camarero, Enrique Vallejo, Ramon Beivide, 2014 https://scholar.google.com/scholar?q=Topological+Characterization+of+Hamming+and+Dragonfly+Networks+and+Its+Implications+on+Routing 5. Slim Fly: A Cost Effective Low-Diameter Network Topology — Maciej Besta, Torsten Hoefler, 2014 https://scholar.google.com/scholar?q=Slim+Fly:+A+Cost+Effective+Low-Diameter+Network+Topology 6. Microarchitecture of a High-Radix Router — John Kim, William J. Dally, Brian Towles, Amit K. Gupta, 2005 https://scholar.google.com/scholar?q=Microarchitecture+of+a+High-Radix+Router 7. The BlackWidow High-Radix Clos Network — Steve Scott, Dennis Abts, John Kim, William J. Dally, 2006 https://scholar.google.com/scholar?q=The+BlackWidow+High-Radix+Clos+Network 8. Scalable High-Radix Router Microarchitecture Using a Network Switch Organization — Jung Ho Ahn, Young Hoon Son, John Kim, 2013 https://scholar.google.com/scholar?q=Scalable+High-Radix+Router+Microarchitecture+Using+a+Network+Switch+Organization 9. A Scheme for Fast Parallel Communication — L. G. Valiant, 1982 https://scholar.google.com/scholar?q=A+Scheme+for+Fast+Parallel+Communication 10. Indirect Adaptive Routing on Large Scale Interconnection Networks — Nan Jiang, John Kim, William J. Dally, 2009 https://scholar.google.com/scholar?q=Indirect+Adaptive+Routing+on+Large+Scale+Interconnection+Networks 11. Rationale and Challenges for Optical Interconnects to Electronic Chips — David A. B. Miller, 2000 https://scholar.google.com/scholar?q=Rationale+and+Challenges+for+Optical+Interconnects+to+Electronic+Chips 12. Optical Interconnects for High-Performance Computing — Marc A. Taubenblatt, 2012 https://scholar.google.com/scholar?q=Optical+Interconnects+for+High-Performance+Computing 13. Optical Interconnects for Extreme Scale Computing Systems — Sebastien Rumley, Meisam Bahadori, Robert Polster, Simon D. Hammond, David M. Calhoun, Ke Wen, Arun Rodrigues, Keren Bergman, 2017 https://scholar.google.com/scholar?q=Optical+Interconnects+for+Extreme+Scale+Computing+Systems 14. Mission Apollo: Landing Optical Circuit Switching at Datacenter Scale — Ryohei Urata, Hong Liu, Kevin Yasumura, Erji Mao, Jill Berger, Xiang Zhou, Cedric Lam, Roy Bannon, Darren Hutchinson, Daniel Nelson, Leon Poutievski, Arjun Singh, Joon Ong, Amin Vahdat, 2022 https://scholar.google.com/scholar?q=Mission+Apollo:+Landing+Optical+Circuit+Switching+at+Datacenter+Scale 15. Adaptive Routing in High-Radix Clos Network — John Kim, William J. Dally, Dennis Abts, 2006 https://doi.org/10.1145/1188455.1188552 16. Efficient All-to-All Collective Communication Schedules for Direct-Connect Topologies — Prithwish Basu, Liangyu Zhao, Jason Fantl, Siddharth Pal, Arvind Krishnamurthy, Joud Khoury, 2024 https://doi.org/10.1145/3625549.3658656 17. Toward lower-diameter large-scale HPC and data center networks with co-packaged optics — Pavlos Maniotis, Laurent Schares, Benjamin G. Lee, Marc A. Taubenblatt, Daniel M. Kuchta, 2021 https://scholar.google.com/scholar?q=Toward+lower-diameter+large-scale+HPC+and+data+center+networks+with+co-packaged+optics 18. Toward higher-radix switches with co-packaged optics for improved network locality in data center and HPC networks [Invited] — Pavlos Maniotis, Laurent Schares, Daniel M. Kuchta, Bengi Karacali, 2022 https://scholar.google.com/scholar?q=Toward+higher-radix+switches+with+co-packaged+optics+for+improved+network+locality+in+data+center+and+HPC+networks+[Invited] 19. Exploring the benefits of using co-packaged optics in data center and AI supercomputer networks: a simulation-based analysis [Invited] — Pavlos Maniotis, Daniel M. Kuchta, 2024 https://scholar.google.com/scholar?q=Exploring+the+benefits+of+using+co-packaged+optics+in+data+center+and+AI+supercomputer+networks:+a+simulation-based+analysis+[Invited] 20. Enhanced UGAL Routing Schemes for Dragonfly Networks — Ram Sharan Chaulagain, Xin Yuan, 2024 https://scholar.google.com/scholar?q=Enhanced+UGAL+Routing+Schemes+for+Dragonfly+Networks 21. On Selection Functions in Adaptive Routing — Alejandro Cano, Cristobal Camarero, Carmen Martinez, 2025 https://scholar.google.com/scholar?q=On+Selection+Functions+in+Adaptive+Routing 22. Co-packaged optics (CPO): status, challenges, and solutions — Min Tan and coauthors, 2023 https://scholar.google.com/scholar?q=Co-packaged+optics+(CPO):+status,+challenges,+and+solutions 23. AI Post Transformers: Computation-Bandwidth-Memory Trade-offs for AI Infrastructure — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-09-computation-bandwidth-memory-trade-offs-a83f2b.mp3 24. AI Post Transformers: FengHuang for Rack-Scale LLM Inference Memory — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-12-fenghuang-for-rack-scale-llm-inference-m-62708e.mp3 25. AI Post Transformers: Serving MoE Models with Disaggregated Expert Parallelism — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-19-serving-moe-models-with-disaggregated-ex-6979d2.mp3 26. AI Post Transformers: Lossless Sparse Deltas for RL Networks — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-15-lossless-sparse-deltas-for-rl-networks-84d676.mp3

1. kesä 20261 h 0 min