AI Post Transformers
This episode explores a post-training method for making mixture-of-experts language models cheaper at inference time without retraining them from scratch. It explains how the paper converts a fully trained static MoE into a dynamic one by adding parameter-free zero experts, allowing some tokens to skip normal experts, and then uses self-distillation to preserve the original model’s behavior under this lower-compute routing scheme. The discussion highlights why this deployment-focused approach matters for real production systems, especially when pretraining, fine-tuning, and alignment are already complete and inference cost is the main bottleneck. Listeners would find it interesting for its clear breakdown of dynamic versus static MoE compute, its practical framing around latency and serving costs, and its focus on whether large post-trained models can cut expert FLOPs substantially without losing capability. Sources: 1. Post-Trained MoE Can Skip Half Experts via Self-Distillation — Xingtai Lv, Li Sheng, Kaiyan Zhang, Yichen You, Siyan Gao, Xueheng Luo, Yuxin Zuo, Yuchen Fan, Junlin Yang, Ganqu Cui, Bingning Wang, Fan Yang, Youbang Sun, Ning Ding, Bowen Zhou, 2026 http://arxiv.org/abs/2605.18643 2. MoE++: Accelerating Mixture-of-Experts Methods with Zero-Computation Experts — Peng Jin, Bo Zhu, Li Yuan, Shuicheng Yan, 2024 https://scholar.google.com/scholar?q=MoE++:+Accelerating+Mixture-of-Experts+Methods+with+Zero-Computation+Experts 3. Not All Experts are Equal: Efficient Expert Pruning and Skipping for Mixture-of-Experts Large Language Models — Xudong Lu, Qi Liu, Yuhui Xu, Aojun Zhou, Siyuan Huang, Bo Zhang, Junchi Yan, Hongsheng Li, 2024 https://scholar.google.com/scholar?q=Not+All+Experts+are+Equal:+Efficient+Expert+Pruning+and+Skipping+for+Mixture-of-Experts+Large+Language+Models 4. Task-Specific Expert Pruning for Sparse Mixture-of-Experts — Tianyu Chen, Shaohan Huang, Yuan Xie, Binxing Jiao, Daxin Jiang, Haoyi Zhou, Jianxin Li, Furu Wei, 2022 https://scholar.google.com/scholar?q=Task-Specific+Expert+Pruning+for+Sparse+Mixture-of-Experts 5. Auxiliary-Loss-Free Load Balancing Strategy for Mixture-of-Experts — DeepSeek-AI et al., 2024 https://scholar.google.com/scholar?q=Auxiliary-Loss-Free+Load+Balancing+Strategy+for+Mixture-of-Experts 6. ST-MoE: Designing Stable and Transferable Sparse Expert Models — Barret Zoph, Noam Shazeer, William Fedus, et al., 2022 https://scholar.google.com/scholar?q=ST-MoE:+Designing+Stable+and+Transferable+Sparse+Expert+Models 7. AdaMoE: Token-Adaptive Routing with Null Experts for Mixture-of-Experts Language Models — Zihao Zeng, Yibo Miao, Hongcheng Gao, Hao Zhang, Zhijie Deng, 2024 https://scholar.google.com/scholar?q=AdaMoE:+Token-Adaptive+Routing+with+Null+Experts+for+Mixture-of-Experts+Language+Models 8. Harder Task Needs More Experts: Dynamic Routing in MoE Models — Quzhe Huang, Zhenwei An, Nan Zhuang, Mingxu Tao, Chen Zhang, Yang Jin, Kun Xu, Liwei Chen, Songfang Huang, Yansong Feng, 2024 https://scholar.google.com/scholar?q=Harder+Task+Needs+More+Experts:+Dynamic+Routing+in+MoE+Models 9. MoE Pathfinder: Trajectory-driven Expert Pruning — Xican Yang, Yuanhe Tian, Yan Song, 2025 https://scholar.google.com/scholar?q=MoE+Pathfinder:+Trajectory-driven+Expert+Pruning 10. Discovering Important Experts for Mixture-of-Experts Models Pruning Through a Theoretical Perspective — approximate only; title verified, authors not confidently recovered, 2025/2026 https://scholar.google.com/scholar?q=Discovering+Important+Experts+for+Mixture-of-Experts+Models+Pruning+Through+a+Theoretical+Perspective 11. MoEEdit: Efficient and Routing-Stable Knowledge Editing for Mixture-of-Experts LLMs — Yupu Gu, Rongzhe Wei, Andy Zhu, Pan Li, 2026 https://scholar.google.com/scholar?q=MoEEdit:+Efficient+and+Routing-Stable+Knowledge+Editing+for+Mixture-of-Experts+LLMs 12. ReLibra: Routing-Replay-Guided Load Balancing for MoE Training in Reinforcement Learning — Chao Jin, Xinming Wei, Yinmin Zhong, Chengxu Yang, Bingyang Wu, Ruidong Zhu, Zili Zhang, Yuliang Liu, Xin Jin, 2026 https://scholar.google.com/scholar?q=ReLibra:+Routing-Replay-Guided+Load+Balancing+for+MoE+Training+in+Reinforcement+Learning 13. Sparse MoE Students for Efficient Knowledge Distillation — approximate only; exact author list not confidently recovered, 2025 https://scholar.google.com/scholar?q=Sparse+MoE+Students+for+Efficient+Knowledge+Distillation 14. AI Post Transformers: Batch-Aware Expert Routing for Faster MoE Decoding — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-04-batch-aware-expert-routing-for-faster-mo-683ab6.mp3 15. AI Post Transformers: Serving MoE Models with Disaggregated Expert Parallelism — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-19-serving-moe-models-with-disaggregated-ex-6979d2.mp3 16. AI Post Transformers: Ministral 3: Cascade Distillation for Long-Context Multimodal Models — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-15-cascade-distillation-for-long-context-mu-0ebd1a.mp3 17. AI Post Transformers: Nemotron 3 Super Hybrid Mamba-Transformer MoE — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-19-nemotron-3-super-hybrid-mamba-transforme-31ac75.mp3 18. AI Post Transformers: LPU Chip for Low-Latency LLM Inference — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-20-lpu-chip-for-low-latency-llm-inference-be13c3.mp3
670 episodes
Comments
0Be the first to comment
Sign up now and become a member of the AI Post Transformers community!