AI Post Transformers
This episode explores why batch-1 LLM decode for robots, edge copilots, and other single-session agents behaves very differently from high-throughput serving, and why next-token latency cannot be explained by memory bandwidth alone. It breaks down the paper’s main test: compare real decode time against an analytic memory floor based on model-weight and KV-cache traffic, then run that across Qwen-2.5-7B, Mistral-7B-v0.3, and Llama-3.1-8B on L4, L40S, A100, and H100 GPUs over contexts from 2048 to 16384. The discussion argues that because these models already use grouped-query attention to cut KV traffic, the remaining latency gap is driven by runtime details such as CUDA Graphs, launch overhead, kernel quality, and whether quantization actually helps in this tiny decode regime. Listeners would find it interesting because it challenges the simple idea that buying a faster-memory GPU automatically lowers token latency, especially for physical AI systems where one delayed token can stall the whole interaction. Sources: 1. Memory-Bound, Not Bandwidth-Limited Batch-1 LLM Decode https://arxiv.org/pdf/2605.30571 2. Orca: A Distributed Serving System for Transformer-Based Generative Models — Gyeong-In Yu, Joo Seong Jeong, Geon-Woo Kim, Soojeong Kim, Byung-Gon Chun, 2022 https://scholar.google.com/scholar?q=Orca:+A+Distributed+Serving+System+for+Transformer-Based+Generative+Models 3. Efficient Memory Management for Large Language Model Serving with PagedAttention — Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, Ion Stoica, 2023 https://scholar.google.com/scholar?q=Efficient+Memory+Management+for+Large+Language+Model+Serving+with+PagedAttention 4. Splitwise: Efficient Generative LLM Inference Using Phase Splitting — Pratyush Patel, Esha Choukse, Chaojie Zhang, Aashaka Shah, Inigo Goiri, Saeed Maleki, Ricardo Bianchini, 2024 https://scholar.google.com/scholar?q=Splitwise:+Efficient+Generative+LLM+Inference+Using+Phase+Splitting 5. Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving — Ruoyu Qin, Zheming Li, Weiran He, Mingxing Zhang, Yongwei Wu, Weimin Zheng, Xinran Xu, 2024 https://scholar.google.com/scholar?q=Mooncake:+A+KVCache-centric+Disaggregated+Architecture+for+LLM+Serving 6. Boosting Performance of Iterative Applications on GPUs: Kernel Batching with CUDA Graphs — Jonah Ekelund, Stefano Markidis, Ivy Peng, 2025 https://scholar.google.com/scholar?q=Boosting+Performance+of+Iterative+Applications+on+GPUs:+Kernel+Batching+with+CUDA+Graphs 7. PyGraph: Robust Compiler Support for CUDA Graphs in PyTorch — Abhishek Ghosh, Ajay Nayak, Ashish Panwar, Arkaprava Basu, 2025 https://scholar.google.com/scholar?q=PyGraph:+Robust+Compiler+Support+for+CUDA+Graphs+in+PyTorch 8. Foundry: Template-Based CUDA Graph Context Materialization for Fast LLM Serving Cold Start — Xueshen Liu, Yongji Wu, Yuncheng Yao, Danyang Zhuo, Ion Stoica, Z. Morley Mao, 2026 https://scholar.google.com/scholar?q=Foundry:+Template-Based+CUDA+Graph+Context+Materialization+for+Fast+LLM+Serving+Cold+Start 9. Memory-Bound but Not Bandwidth-Limited: The Physical AI Inference Gap in Batch-1 LLM Decode — Josef Chen, 2026 https://scholar.google.com/scholar?q=Memory-Bound+but+Not+Bandwidth-Limited:+The+Physical+AI+Inference+Gap+in+Batch-1+LLM+Decode 10. GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints — Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebron, Sumit Sanghai, 2023 https://scholar.google.com/scholar?q=GQA:+Training+Generalized+Multi-Query+Transformer+Models+from+Multi-Head+Checkpoints 11. GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers — Elias Frantar, Saleh Ashkboos, Torsten Hoefler, Dan Alistarh, 2022 https://scholar.google.com/scholar?q=GPTQ:+Accurate+Post-Training+Quantization+for+Generative+Pre-trained+Transformers 12. AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration — Ji Lin et al., 2023 https://scholar.google.com/scholar?q=AWQ:+Activation-aware+Weight+Quantization+for+LLM+Compression+and+Acceleration 13. FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision — Jay Shah, Ganesh Bikshandi, Ying Zhang, Vijay Thakkar, Pradeep Ramani, Tri Dao, 2024 https://scholar.google.com/scholar?q=FlashAttention-3:+Fast+and+Accurate+Attention+with+Asynchrony+and+Low-precision 14. FlashDecoding++: Faster Large Language Model Inference on GPUs — Ke Hong et al., 2023 https://scholar.google.com/scholar?q=FlashDecoding++:+Faster+Large+Language+Model+Inference+on+GPUs 15. Mind the Memory Gap: Unveiling GPU Bottlenecks in Large-Batch LLM Inference — Pol G. Recasens et al., 2025 https://scholar.google.com/scholar?q=Mind+the+Memory+Gap:+Unveiling+GPU+Bottlenecks+in+Large-Batch+LLM+Inference 16. Challenges and Research Directions for Large Language Model Inference Hardware — Xiaoyu Ma, David Patterson, 2026 https://scholar.google.com/scholar?q=Challenges+and+Research+Directions+for+Large+Language+Model+Inference+Hardware 17. Medusa: Accelerating Serverless LLM Inference with Materialization — Shaoxun Zeng et al., 2025 https://scholar.google.com/scholar?q=Medusa:+Accelerating+Serverless+LLM+Inference+with+Materialization 18. Hybrid JIT-CUDA Graph Optimization for Low-Latency Large Language Model Inference — Divakar Kumar Yadav and Tian Zhao, 2026 https://scholar.google.com/scholar?q=Hybrid+JIT-CUDA+Graph+Optimization+for+Low-Latency+Large+Language+Model+Inference 19. AWQ: Activation-aware Weight Quantization for On-Device LLM Compression and Acceleration — Ji Lin et al., 2024 https://scholar.google.com/scholar?q=AWQ:+Activation-aware+Weight+Quantization+for+On-Device+LLM+Compression+and+Acceleration 20. SpQR: A Sparse-Quantized Representation for Near-Lossless LLM Weight Compression — Tim Dettmers et al., 2024 https://scholar.google.com/scholar?q=SpQR:+A+Sparse-Quantized+Representation+for+Near-Lossless+LLM+Weight+Compression 21. Don't Waste Bits! Adaptive KV-Cache Quantization for Lightweight On-Device LLMs — Sayed Pedram Haeri Boroujeni et al., 2026 https://scholar.google.com/scholar?q=Don't+Waste+Bits!+Adaptive+KV-Cache+Quantization+for+Lightweight+On-Device+LLMs 22. KVLink: Accelerating Large Language Models via Efficient KV Cache Reuse — Jingbo Yang et al., 2025 https://scholar.google.com/scholar?q=KVLink:+Accelerating+Large+Language+Models+via+Efficient+KV+Cache+Reuse 23. LMCache: An Efficient KV Cache Layer for Enterprise-Scale LLM Inference — Yuhan Liu, Yihua Cheng et al., 2025 https://scholar.google.com/scholar?q=LMCache:+An+Efficient+KV+Cache+Layer+for+Enterprise-Scale+LLM+Inference 24. Selective KV-Cache Sharing to Mitigate Timing Side-Channels in LLM Inference — Kexin Chu et al., 2025 https://scholar.google.com/scholar?q=Selective+KV-Cache+Sharing+to+Mitigate+Timing+Side-Channels+in+LLM+Inference 25. AI Post Transformers: Deep Kernel Fusion for Transformer Decoding — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-15-deep-kernel-fusion-for-transformer-decod-b1a703.mp3 26. AI Post Transformers: Speculative Decoding in Real vLLM Serving — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-04-speculative-decoding-in-real-vllm-servin-6f4e2b.mp3 27. AI Post Transformers: LPU Chip for Low-Latency LLM Inference — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-20-lpu-chip-for-low-latency-llm-inference-be13c3.mp3 28. AI Post Transformers: CXL Computational Memory Offloading for Lower Runtime — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-04-cxl-computational-memory-offloading-for-3b2124.mp3 29. AI Post Transformers: Computation-Bandwidth-Memory Trade-offs for AI Infrastructure — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-09-computation-bandwidth-memory-trade-offs-a83f2b.mp3 30. AI Post Transformers: Serving MoE Models with Disaggregated Expert Parallelism — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-19-serving-moe-models-with-disaggregated-ex-6979d2.mp3
672 afleveringen
Reacties
0Wees de eerste die een reactie plaatst
Meld je nu aan en word lid van de AI Post Transformers community!