AI Post Transformers
This episode explores how a line of recent systems papers culminates in NanoFlow, a serving approach that breaks LLM inference into very small “nano-batches” so different GPU-intensive operations can overlap instead of running in strict sequence. It explains the shift from thinking only about memory bottlenecks such as KV-cache movement and fragmentation toward a more nuanced claim: even if some kernels are memory-bound, overall serving throughput can still be limited by underused compute when prefill and decode are serialized. The discussion walks through the progression from micro-batching and iteration-level scheduling to chunked prefill, then shows how NanoFlow extends that logic with an auto-searched schedule that jointly chooses nano-batch size, operation ordering, and GPU resource allocation. A listener would find it interesting because it frames LLM serving not as a single-kernel optimization problem but as a broader question of hardware utilization, scheduling strategy, and the economics of running large models efficiently at scale. Sources: 1. NanoFlow and the Future of LLM Serving https://www.usenix.org/system/files/osdi25-zhu-kan.pdf 2. 2601.11822v1 https://arxiv.org/html/2601.11822v1 3. 2410.18038v2 https://arxiv.org/html/2410.18038v2 4. https://www.usenix.org/system/files/osdi24-agrawal.pdf https://www.usenix.org/system/files/osdi24-agrawal.pdf 5. 1811.06965 https://arxiv.org/pdf/1811.06965 6. FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness — Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, Christopher Re, 2022 https://scholar.google.com/scholar?q=FlashAttention:+Fast+and+Memory-Efficient+Exact+Attention+with+IO-Awareness 7. Orca: A Distributed Serving System for Transformer-Based Generative Models — Gyeong-In Yu, Joo Seong Jeong, Geon-Woo Kim, Soojeong Kim, Byung-Gon Chun, 2022 https://scholar.google.com/scholar?q=Orca:+A+Distributed+Serving+System+for+Transformer-Based+Generative+Models 8. Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve — Amey Agrawal, Nitin Kedia, Ashish Panwar, Jayashree Mohan, Nipun Kwatra, Bhargav Gulavani, Alexey Tumanov, Ramachandran Ramjee, 2024 https://scholar.google.com/scholar?q=Taming+Throughput-Latency+Tradeoff+in+LLM+Inference+with+Sarathi-Serve 9. NanoFlow: Towards Optimal Large Language Model Serving Throughput — Kan Zhu, Yufei Gao, Yilong Zhao, Liangyu Zhao, Gefei Zuo, Yile Gu, Dedong Xie, Zihao Ye, Keisuke Kamahori, Chien-Yu Lin, Ziren Wang, Stephanie Wang, Arvind Krishnamurthy, Baris Kasikci, 2025 https://scholar.google.com/scholar?q=NanoFlow:+Towards+Optimal+Large+Language+Model+Serving+Throughput 10. Efficient Memory Management for Large Language Model Serving with PagedAttention — Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, Ion Stoica, 2023 https://scholar.google.com/scholar?q=Efficient+Memory+Management+for+Large+Language+Model+Serving+with+PagedAttention 11. DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving — Yinmin Zhong, Shengyu Liu, Junda Chen, Jianbo Hu, Yibo Zhu, Xuanzhe Liu, Xin Jin, Hao Zhang, 2024 https://scholar.google.com/scholar?q=DistServe:+Disaggregating+Prefill+and+Decoding+for+Goodput-optimized+Large+Language+Model+Serving 12. ASPEN: Breaking Operator Barriers for Efficient Parallelization of Deep Neural Networks — Jongseok Park, Kyungmin Bin, Gibum Park, Sangtae Ha, Kyunghan Lee, 2023 https://scholar.google.com/scholar?q=ASPEN:+Breaking+Operator+Barriers+for+Efficient+Parallelization+of+Deep+Neural+Networks 13. vAttention: Dynamic Memory Management for Serving LLMs without PagedAttention — Ramya Prabhu, Ajay Nayak, Jayashree Mohan, Ramachandran Ramjee, Ashish Panwar, 2024 https://scholar.google.com/scholar?q=vAttention:+Dynamic+Memory+Management+for+Serving+LLMs+without+PagedAttention 14. MemServe: Context Caching for Disaggregated LLM Serving with Elastic Memory Pool — Cunchen Hu et al., 2024 https://scholar.google.com/scholar?q=MemServe:+Context+Caching+for+Disaggregated+LLM+Serving+with+Elastic+Memory+Pool 15. DynaServe: Unified and Elastic Execution for Dynamic Disaggregated LLM Serving — Chaoyi Ruan et al., 2025 https://scholar.google.com/scholar?q=DynaServe:+Unified+and+Elastic+Execution+for+Dynamic+Disaggregated+LLM+Serving 16. LServe: Efficient Long-sequence LLM Serving with Unified Sparse Attention — Shang Yang et al., 2025 https://scholar.google.com/scholar?q=LServe:+Efficient+Long-sequence+LLM+Serving+with+Unified+Sparse+Attention 17. PyramidInfer: Pyramid KV Cache Compression for High-throughput LLM Inference — Dongjie Yang et al., 2024 https://scholar.google.com/scholar?q=PyramidInfer:+Pyramid+KV+Cache+Compression+for+High-throughput+LLM+Inference 18. ChunkKV: Semantic-Preserving KV Cache Compression for Efficient Long-Context LLM Inference — Xiang Liu et al., 2025 https://scholar.google.com/scholar?q=ChunkKV:+Semantic-Preserving+KV+Cache+Compression+for+Efficient+Long-Context+LLM+Inference 19. Inference-Time Hyper-Scaling with KV Cache Compression — Adrian Łańcucki, Konrad Staniszewski, Piotr Nawrot, Edoardo M. Ponti, 2025 https://scholar.google.com/scholar?q=Inference-Time+Hyper-Scaling+with+KV+Cache+Compression 20. Slice-Level Scheduling for High Throughput and Load Balanced LLM Serving — Ke Cheng et al., 2024 https://scholar.google.com/scholar?q=Slice-Level+Scheduling+for+High+Throughput+and+Load+Balanced+LLM+Serving 21. Efficient Interactive LLM Serving with Proxy Model-based Sequence Length Prediction — Haoran Qiu et al., 2024 https://scholar.google.com/scholar?q=Efficient+Interactive+LLM+Serving+with+Proxy+Model-based+Sequence+Length+Prediction 22. Deferred Continuous Batching in Resource-Efficient Large Language Model Serving — Yongjun He, Yao Lu, Gustavo Alonso, 2024 https://scholar.google.com/scholar?q=Deferred+Continuous+Batching+in+Resource-Efficient+Large+Language+Model+Serving 23. AI Post Transformers: Why LLM Serving Needs Mathematical Optimization — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-05-why-llm-serving-needs-mathematical-optim-647fc6.mp3 24. AI Post Transformers: LAPS for Length-Aware LLM Serving — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-05-laps-for-length-aware-llm-serving-0c6149.mp3 25. AI Post Transformers: Splitwise: Phase-Split LLM Inference — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-03-26-splitwise-phase-split-llm-inference-e8945b.mp3 26. AI Post Transformers: Continuous Batching for LLM Inference: Throughput and Latency Gains — Hal Turing & Dr. Ada Shannon, 2025 https://podcast.do-not-panic.com/episodes/continuous-batching-for-llm-inference-throughput-and-latency-gains/ 27. AI Post Transformers: Speculative Decoding in Real vLLM Serving — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-04-speculative-decoding-in-real-vllm-servin-6f4e2b.mp3 28. AI Post Transformers: FengHuang for Rack-Scale LLM Inference Memory — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-12-fenghuang-for-rack-scale-llm-inference-m-62708e.mp3 29. AI Post Transformers: CacheFlow and 3D-Parallel KV Cache Restoration — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-01-cacheflow-and-3d-parallel-kv-cache-resto-8db883.mp3 30. AI Post Transformers: ScoutAttention for Efficient KV Cache Offloading — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-24-scoutattention-for-efficient-kv-cache-of-b26699.mp3 31. AI Post Transformers: FlashFuser and Hopper-Era FFN Kernel Fusion — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-15-flashfuser-and-hopper-era-ffn-kernel-fus-e1fce9.mp3 Interactive Visualization: NanoFlow and the Future of LLM Serving [https://podcast.do-not-panic.com/viz/2026-05-15-nanoflow-and-the-future-of-llm-serving-7429c9.html]
663 episodios
Comentarios
0Sé la primera persona en comentar
¡Regístrate ahora y forma parte de la comunidad de AI Post Transformers!