Affordable Large-Scale Decoding Through Model-System Co-Design
This episode explores the paper’s claim that decoding cost in large language models is driven less by raw parameter counts and more by hardware-level behavior during autoregressive generation, especially memory bandwidth pressure from the KV cache. It explains why metrics like total or activated parameters can be misleading cost proxies, and walks through the tradeoffs among standard attention, grouped-query variants, and newer approaches such as MFA that aim to preserve expressive power while reducing cache overhead. The discussion also highlights the paper’s central systems argument: attention and FFN layers have very different performance bottlenecks, so separating them through Attention-FFN Disaggregation can make large models cheaper to serve without sacrificing capability. A listener would find it interesting for its concrete, skeptical look at why inference efficiency depends on model-system co-design rather than headline model size alone.
Sources:
1. Step-3 is Large yet Affordable: Model-system Co-design for Cost-effective Decoding — StepFun, :, Bin Wang, Bojun Wang, Changyi Wan, Guanzhe Huang, Hanpeng Hu, Haonan Jia, Hao Nie, Mingliang Li, Nuo Chen, Siyu Chen, Song Yuan, Wuxun Xie, Xiaoniu Song, Xing Chen, Xingping Yang, Xuelin Zhang, Yanbo Yu, Yaoyu Wang, Yibo Zhu, Yimin Jiang, Yu Zhou, Yuanwei Lu, Houyi Li, Jingcheng Hu, Ka Man Lo, Ailin Huang, Binxing Jiao, Bo Li, Boyu Chen, Changxin Miao, Chang Lou, Chen Hu, Chen Xu, Chenfeng Yu, Chengyuan Yao, Daokuan Lv, Dapeng Shi, Deshan Sun, Ding Huang, Dingyuan Hu, Dongqing Pang, Enle Liu, Fajie Zhang, Fanqi Wan, Gulin Yan, Han Zhang, Han Zhou, Hanghao Wu, Hangyu Guo, Hanqi Chen, Hanshan Zhang, Hao Wu, Haocheng Zhang, Haolong Yan, Haoran Lv, Haoran Wei, Hebin Zhou, Heng Wang, Heng Wang, Hongxin Li, Hongyu Zhou, Hongyuan Wang, Huiyong Guo, Jia Wang, Jiahao Gong, Jialing Xie, Jian Zhou, Jianjian Sun, Jiaoren Wu, Jiaran Zhang, Jiayu Liu, Jie Cheng, Jie Luo, Jie Yan, Jie Yang, Jieyi Hou, Jinguang Zhang, Jinlan Cao, Jisheng Yin, Junfeng Liu, Junhao Huang, Junzhe Lin, Kaijun Tan, Kaixiang Li, Kang An, Kangheng Lin, Kenkun Liu, Lei Yang, Liang Zhao, Liangyu Chen, Lieyu Shi, Liguo Tan, Lin Lin, Lin Zhang, Lina Chen, Liwen Huang, Liying Shi, Longlong Gu, Mei Chen, Mengqiang Ren, Ming Li, Mingzhe Chen, Na Wang, Nan Wu, Qi Han, Qian Zhao, Qiang Zhang, Qianni Liu, Qiaohui Chen, Qiling Wu, Qinglin He, Qinyuan Tan, Qiufeng Wang, Qiuping Wu, Qiuyan Liang, Quan Sun, Rui Li, Ruihang Miao, Ruosi Wan, Ruyan Guo, Shangwu Zhong, Shaoliang Pang, Shengjie Fan, Shijie Shang, Shilei Jiang, Shiliang Yang, Shiming Hao, Shuli Gao, Siming Huang, Siqi Liu, Tiancheng Cao, Tianhao Cheng, Tianhao Peng, Wang You, Wei Ji, Wen Sun, Wenjin Deng, Wenqing He, Wenzhen Zheng, Xi Chen, Xiangwen Kong, Xianzhen Luo, Xiaobo Yang, Xiaojia Liu, Xiaoxiao Ren, Xin Han, Xin Li, Xin Wu, Xu Zhao, Yanan Wei, Yang Li, Yangguang Li, Yangshijie Xu, Yanming Xu, Yaqiang Shi, Yeqing Shen, Yi Yang, Yifei Yang, Yifeng Gong, Yihan Chen, Yijing Yang, Yinmin Zhang, Yizhuang Zhou, Yuanhao Ding, Yuantao Fan, Yuanzhen Yang, Yuchu Luo, Yue Peng, Yufan Lu, Yuhang Deng, Yuhe Yin, Yujie Liu, Yukun Chen, Yuling Zhao, Yun Mou, Yunlong Li, Yunzhou Ju, Yusheng Li, Yuxiang Yang, Yuxiang Zhang, Yuyang Chen, Zejia Weng, Zhe Xie, Zheng Ge, Zheng Gong, Zhenyi Lu, Zhewei Huang, Zhichao Chang, Zhiguo Huang, Zhirui Wang, Zidong Yang, Zili Wang, Ziqi Wang, Zixin Zhang, Binxing Jiao, Daxin Jiang, Heung-Yeung Shum, Xiangyu Zhang, 2025
http://arxiv.org/abs/2507.19427
2. Fast Transformer Decoding: One Write-Head is All You Need — Noam Shazeer, 2019
https://scholar.google.com/scholar?q=Fast+Transformer+Decoding:+One+Write-Head+is+All+You+Need
3. GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints — Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebron, Sumit Sanghai, 2023
https://scholar.google.com/scholar?q=GQA:+Training+Generalized+Multi-Query+Transformer+Models+from+Multi-Head+Checkpoints
4. DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model — Zhihong Shao and DeepSeek-AI et al., 2024
https://scholar.google.com/scholar?q=DeepSeek-V2:+A+Strong,+Economical,+and+Efficient+Mixture-of-Experts+Language+Model
5. Multi-matrix Factorization Attention — Jingcheng Hu, Houyi Li, Yinmin Zhang, Zili Wang, Shuigeng Zhou, Xiangyu Zhang, Heung-Yeung Shum, Daxin Jiang, 2024
https://scholar.google.com/scholar?q=Multi-matrix+Factorization+Attention
6. Splitwise: Efficient generative LLM inference using phase splitting — Pratyush Patel, Esha Choukse, Chaojie Zhang, Aashaka Shah, Inigo Goiri, Saeed Maleki, Ricardo Bianchini, 2023
https://scholar.google.com/scholar?q=Splitwise:+Efficient+generative+LLM+inference+using+phase+splitting
7. P/D-Serve: Serving Disaggregated Large Language Model at Scale — Yibo Jin, Tao Wang, Huimin Lin and Huawei colleagues, 2024
https://scholar.google.com/scholar?q=P/D-Serve:+Serving+Disaggregated+Large+Language+Model+at+Scale
8. MegaScale-Infer: Serving Mixture-of-Experts at Scale with Disaggregated Expert Parallelism — Ruidong Zhu, Ziheng Jiang, Chao Jin and ByteDance colleagues, 2025
https://scholar.google.com/scholar?q=MegaScale-Infer:+Serving+Mixture-of-Experts+at+Scale+with+Disaggregated+Expert+Parallelism
9. Step-3 is Large yet Affordable: Model-system Co-design for Cost-effective Decoding — StepFun et al., 2025
https://scholar.google.com/scholar?q=Step-3+is+Large+yet+Affordable:+Model-system+Co-design+for+Cost-effective+Decoding
10. DeepSeek-V3 Technical Report — DeepSeek-AI et al., 2024
https://scholar.google.com/scholar?q=DeepSeek-V3+Technical+Report
11. Qwen3 MoE 235B — Qwen Team / Alibaba researchers, 2025
https://scholar.google.com/scholar?q=Qwen3+MoE+235B
12. Prefill-Decode Disaggregation — Relevant serving-systems authors cited as [18, 31], 2024-2025
https://scholar.google.com/scholar?q=Prefill-Decode+Disaggregation
13. Kimi K2 Technical Report — Moonshot AI et al., 2025
https://scholar.google.com/scholar?q=Kimi+K2+Technical+Report
14. MiniMax M1 — MiniMax researchers, 2025
https://scholar.google.com/scholar?q=MiniMax+M1
15. KVLink: Accelerating Large Language Models via Efficient KV Cache Reuse — Jingbo Yang et al., 2025
https://scholar.google.com/scholar?q=KVLink:+Accelerating+Large+Language+Models+via+Efficient+KV+Cache+Reuse
16. HyperRAG: Enhancing Quality-Efficiency Tradeoffs in Retrieval-Augmented Generation with Reranker KV-Cache Reuse — Yuwei An et al., 2025
https://scholar.google.com/scholar?q=HyperRAG:+Enhancing+Quality-Efficiency+Tradeoffs+in+Retrieval-Augmented+Generation+with+Reranker+KV-Cache+Reuse
17. ProphetKV: User-Query-Driven Selective Recomputation for Efficient KV Cache Reuse in Retrieval-Augmented Generation — Shihao Wang et al., 2026
https://scholar.google.com/scholar?q=ProphetKV:+User-Query-Driven+Selective+Recomputation+for+Efficient+KV+Cache+Reuse+in+Retrieval-Augmented+Generation
18. HyperAttention: Long-context Attention in Near-Linear Time — Insu Han et al., 2023
https://scholar.google.com/scholar?q=HyperAttention:+Long-context+Attention+in+Near-Linear+Time
19. Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention — Tsendsuren Munkhdalai et al., 2024
https://scholar.google.com/scholar?q=Leave+No+Context+Behind:+Efficient+Infinite+Context+Transformers+with+Infini-attention
20. Every Attention Matters: An Efficient Hybrid Architecture for Long-Context Reasoning — Ling Team et al., 2025
https://scholar.google.com/scholar?q=Every+Attention+Matters:+An+Efficient+Hybrid+Architecture+for+Long-Context+Reasoning
21. KVDirect: Distributed Disaggregated LLM Inference — Shiyang Chen et al., 2024
https://scholar.google.com/scholar?q=KVDirect:+Distributed+Disaggregated+LLM+Inference
22. HexGen-2: Disaggregated Generative Inference of LLMs in Heterogeneous Environment — Youhe Jiang et al., 2025
https://scholar.google.com/scholar?q=HexGen-2:+Disaggregated+Generative+Inference+of+LLMs+in+Heterogeneous+Environment
23. GRACE-MoE: Grouping and Replication with Locality-Aware Routing for Efficient Distributed MoE Inference — Yu Han et al., 2025
https://scholar.google.com/scholar?q=GRACE-MoE:+Grouping+and+Replication+with+Locality-Aware+Routing+for+Efficient+Distributed+MoE+Inference
24. AI Post Transformers: JANUS for Scalable MoE Inference — Hal Turing & Dr. Ada Shannon, 2026
https://podcast.do-not-panic.com/episodes/2026-05-15-janus-for-scalable-moe-inference-78ae30.mp3
25. AI Post Transformers: Prefill-as-a-Service for Cross-Datacenter KV Cache — Hal Turing & Dr. Ada Shannon, 2026
https://podcast.do-not-panic.com/episodes/2026-04-19-prefill-as-a-service-for-cross-datacente-7560be.mp3
26. AI Post Transformers: Batch-Aware Expert Routing for Faster MoE Decoding — Hal Turing & Dr. Ada Shannon, 2026
https://podcast.do-not-panic.com/episodes/2026-04-04-batch-aware-expert-routing-for-faster-mo-683ab6.mp3
27. AI Post Transformers: Deep Kernel Fusion for Transformer Decoding — Hal Turing & Dr. Ada Shannon, 2026
https://podcast.do-not-panic.com/episodes/2026-05-15-deep-kernel-fusion-for-transformer-decod-b1a703.mp3
28. AI Post Transformers: NanoFlow and the Future of LLM Serving — Hal Turing & Dr. Ada Shannon, 2026
https://podcast.do-not-panic.com/episodes/2026-05-15-nanoflow-and-the-future-of-llm-serving-7429c9.mp3
29. AI Post Transformers: Why LLM Serving Needs Mathematical Optimization — Hal Turing & Dr. Ada Shannon, 2026
https://podcast.do-not-panic.com/episodes/2026-05-05-why-llm-serving-needs-mathematical-optim-647fc6.mp3
30. AI Post Transformers: Speculative Decoding in Real vLLM Serving — Hal Turing & Dr. Ada Shannon, 2026
https://podcast.do-not-panic.com/episodes/2026-05-04-speculative-decoding-in-real-vllm-servin-6f4e2b.mp3
31. AI Post Transformers: Nemotron 3 Super Hybrid Mamba-Transformer MoE — Hal Turing & Dr. Ada Shannon, 2026
https://podcast.do-not-panic.com/episodes/2026-04-19-nemotron-3-super-hybrid-mamba-transforme-31ac75.mp3
32. AI Post Transformers: FlatAttention for Tile-Based Accelerator Inference — Hal Turing & Dr. Ada Shannon, 2026
https://podcast.do-not-panic.com/episodes/2026-04-04-flatattention-for-tile-based-accelerator-56e6ca.mp3
Interactive Visualization: Affordable Large-Scale Decoding Through Model-System Co-Design [https://podcast.do-not-panic.com/viz/2026-05-19-affordable-large-scale-decoding-through-e1d7ed.html]