AI Post Transformers

Affordable Large-Scale Decoding Through Model-System Co-Design

1 h 0 min · 19. maj 2026

Beskrivelse

This episode explores the paper’s claim that decoding cost in large language models is driven less by raw parameter counts and more by hardware-level behavior during autoregressive generation, especially memory bandwidth pressure from the KV cache. It explains why metrics like total or activated parameters can be misleading cost proxies, and walks through the tradeoffs among standard attention, grouped-query variants, and newer approaches such as MFA that aim to preserve expressive power while reducing cache overhead. The discussion also highlights the paper’s central systems argument: attention and FFN layers have very different performance bottlenecks, so separating them through Attention-FFN Disaggregation can make large models cheaper to serve without sacrificing capability. A listener would find it interesting for its concrete, skeptical look at why inference efficiency depends on model-system co-design rather than headline model size alone. Sources: 1. Step-3 is Large yet Affordable: Model-system Co-design for Cost-effective Decoding — StepFun, :, Bin Wang, Bojun Wang, Changyi Wan, Guanzhe Huang, Hanpeng Hu, Haonan Jia, Hao Nie, Mingliang Li, Nuo Chen, Siyu Chen, Song Yuan, Wuxun Xie, Xiaoniu Song, Xing Chen, Xingping Yang, Xuelin Zhang, Yanbo Yu, Yaoyu Wang, Yibo Zhu, Yimin Jiang, Yu Zhou, Yuanwei Lu, Houyi Li, Jingcheng Hu, Ka Man Lo, Ailin Huang, Binxing Jiao, Bo Li, Boyu Chen, Changxin Miao, Chang Lou, Chen Hu, Chen Xu, Chenfeng Yu, Chengyuan Yao, Daokuan Lv, Dapeng Shi, Deshan Sun, Ding Huang, Dingyuan Hu, Dongqing Pang, Enle Liu, Fajie Zhang, Fanqi Wan, Gulin Yan, Han Zhang, Han Zhou, Hanghao Wu, Hangyu Guo, Hanqi Chen, Hanshan Zhang, Hao Wu, Haocheng Zhang, Haolong Yan, Haoran Lv, Haoran Wei, Hebin Zhou, Heng Wang, Heng Wang, Hongxin Li, Hongyu Zhou, Hongyuan Wang, Huiyong Guo, Jia Wang, Jiahao Gong, Jialing Xie, Jian Zhou, Jianjian Sun, Jiaoren Wu, Jiaran Zhang, Jiayu Liu, Jie Cheng, Jie Luo, Jie Yan, Jie Yang, Jieyi Hou, Jinguang Zhang, Jinlan Cao, Jisheng Yin, Junfeng Liu, Junhao Huang, Junzhe Lin, Kaijun Tan, Kaixiang Li, Kang An, Kangheng Lin, Kenkun Liu, Lei Yang, Liang Zhao, Liangyu Chen, Lieyu Shi, Liguo Tan, Lin Lin, Lin Zhang, Lina Chen, Liwen Huang, Liying Shi, Longlong Gu, Mei Chen, Mengqiang Ren, Ming Li, Mingzhe Chen, Na Wang, Nan Wu, Qi Han, Qian Zhao, Qiang Zhang, Qianni Liu, Qiaohui Chen, Qiling Wu, Qinglin He, Qinyuan Tan, Qiufeng Wang, Qiuping Wu, Qiuyan Liang, Quan Sun, Rui Li, Ruihang Miao, Ruosi Wan, Ruyan Guo, Shangwu Zhong, Shaoliang Pang, Shengjie Fan, Shijie Shang, Shilei Jiang, Shiliang Yang, Shiming Hao, Shuli Gao, Siming Huang, Siqi Liu, Tiancheng Cao, Tianhao Cheng, Tianhao Peng, Wang You, Wei Ji, Wen Sun, Wenjin Deng, Wenqing He, Wenzhen Zheng, Xi Chen, Xiangwen Kong, Xianzhen Luo, Xiaobo Yang, Xiaojia Liu, Xiaoxiao Ren, Xin Han, Xin Li, Xin Wu, Xu Zhao, Yanan Wei, Yang Li, Yangguang Li, Yangshijie Xu, Yanming Xu, Yaqiang Shi, Yeqing Shen, Yi Yang, Yifei Yang, Yifeng Gong, Yihan Chen, Yijing Yang, Yinmin Zhang, Yizhuang Zhou, Yuanhao Ding, Yuantao Fan, Yuanzhen Yang, Yuchu Luo, Yue Peng, Yufan Lu, Yuhang Deng, Yuhe Yin, Yujie Liu, Yukun Chen, Yuling Zhao, Yun Mou, Yunlong Li, Yunzhou Ju, Yusheng Li, Yuxiang Yang, Yuxiang Zhang, Yuyang Chen, Zejia Weng, Zhe Xie, Zheng Ge, Zheng Gong, Zhenyi Lu, Zhewei Huang, Zhichao Chang, Zhiguo Huang, Zhirui Wang, Zidong Yang, Zili Wang, Ziqi Wang, Zixin Zhang, Binxing Jiao, Daxin Jiang, Heung-Yeung Shum, Xiangyu Zhang, 2025 http://arxiv.org/abs/2507.19427 2. Fast Transformer Decoding: One Write-Head is All You Need — Noam Shazeer, 2019 https://scholar.google.com/scholar?q=Fast+Transformer+Decoding:+One+Write-Head+is+All+You+Need 3. GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints — Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebron, Sumit Sanghai, 2023 https://scholar.google.com/scholar?q=GQA:+Training+Generalized+Multi-Query+Transformer+Models+from+Multi-Head+Checkpoints 4. DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model — Zhihong Shao and DeepSeek-AI et al., 2024 https://scholar.google.com/scholar?q=DeepSeek-V2:+A+Strong,+Economical,+and+Efficient+Mixture-of-Experts+Language+Model 5. Multi-matrix Factorization Attention — Jingcheng Hu, Houyi Li, Yinmin Zhang, Zili Wang, Shuigeng Zhou, Xiangyu Zhang, Heung-Yeung Shum, Daxin Jiang, 2024 https://scholar.google.com/scholar?q=Multi-matrix+Factorization+Attention 6. Splitwise: Efficient generative LLM inference using phase splitting — Pratyush Patel, Esha Choukse, Chaojie Zhang, Aashaka Shah, Inigo Goiri, Saeed Maleki, Ricardo Bianchini, 2023 https://scholar.google.com/scholar?q=Splitwise:+Efficient+generative+LLM+inference+using+phase+splitting 7. P/D-Serve: Serving Disaggregated Large Language Model at Scale — Yibo Jin, Tao Wang, Huimin Lin and Huawei colleagues, 2024 https://scholar.google.com/scholar?q=P/D-Serve:+Serving+Disaggregated+Large+Language+Model+at+Scale 8. MegaScale-Infer: Serving Mixture-of-Experts at Scale with Disaggregated Expert Parallelism — Ruidong Zhu, Ziheng Jiang, Chao Jin and ByteDance colleagues, 2025 https://scholar.google.com/scholar?q=MegaScale-Infer:+Serving+Mixture-of-Experts+at+Scale+with+Disaggregated+Expert+Parallelism 9. Step-3 is Large yet Affordable: Model-system Co-design for Cost-effective Decoding — StepFun et al., 2025 https://scholar.google.com/scholar?q=Step-3+is+Large+yet+Affordable:+Model-system+Co-design+for+Cost-effective+Decoding 10. DeepSeek-V3 Technical Report — DeepSeek-AI et al., 2024 https://scholar.google.com/scholar?q=DeepSeek-V3+Technical+Report 11. Qwen3 MoE 235B — Qwen Team / Alibaba researchers, 2025 https://scholar.google.com/scholar?q=Qwen3+MoE+235B 12. Prefill-Decode Disaggregation — Relevant serving-systems authors cited as [18, 31], 2024-2025 https://scholar.google.com/scholar?q=Prefill-Decode+Disaggregation 13. Kimi K2 Technical Report — Moonshot AI et al., 2025 https://scholar.google.com/scholar?q=Kimi+K2+Technical+Report 14. MiniMax M1 — MiniMax researchers, 2025 https://scholar.google.com/scholar?q=MiniMax+M1 15. KVLink: Accelerating Large Language Models via Efficient KV Cache Reuse — Jingbo Yang et al., 2025 https://scholar.google.com/scholar?q=KVLink:+Accelerating+Large+Language+Models+via+Efficient+KV+Cache+Reuse 16. HyperRAG: Enhancing Quality-Efficiency Tradeoffs in Retrieval-Augmented Generation with Reranker KV-Cache Reuse — Yuwei An et al., 2025 https://scholar.google.com/scholar?q=HyperRAG:+Enhancing+Quality-Efficiency+Tradeoffs+in+Retrieval-Augmented+Generation+with+Reranker+KV-Cache+Reuse 17. ProphetKV: User-Query-Driven Selective Recomputation for Efficient KV Cache Reuse in Retrieval-Augmented Generation — Shihao Wang et al., 2026 https://scholar.google.com/scholar?q=ProphetKV:+User-Query-Driven+Selective+Recomputation+for+Efficient+KV+Cache+Reuse+in+Retrieval-Augmented+Generation 18. HyperAttention: Long-context Attention in Near-Linear Time — Insu Han et al., 2023 https://scholar.google.com/scholar?q=HyperAttention:+Long-context+Attention+in+Near-Linear+Time 19. Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention — Tsendsuren Munkhdalai et al., 2024 https://scholar.google.com/scholar?q=Leave+No+Context+Behind:+Efficient+Infinite+Context+Transformers+with+Infini-attention 20. Every Attention Matters: An Efficient Hybrid Architecture for Long-Context Reasoning — Ling Team et al., 2025 https://scholar.google.com/scholar?q=Every+Attention+Matters:+An+Efficient+Hybrid+Architecture+for+Long-Context+Reasoning 21. KVDirect: Distributed Disaggregated LLM Inference — Shiyang Chen et al., 2024 https://scholar.google.com/scholar?q=KVDirect:+Distributed+Disaggregated+LLM+Inference 22. HexGen-2: Disaggregated Generative Inference of LLMs in Heterogeneous Environment — Youhe Jiang et al., 2025 https://scholar.google.com/scholar?q=HexGen-2:+Disaggregated+Generative+Inference+of+LLMs+in+Heterogeneous+Environment 23. GRACE-MoE: Grouping and Replication with Locality-Aware Routing for Efficient Distributed MoE Inference — Yu Han et al., 2025 https://scholar.google.com/scholar?q=GRACE-MoE:+Grouping+and+Replication+with+Locality-Aware+Routing+for+Efficient+Distributed+MoE+Inference 24. AI Post Transformers: JANUS for Scalable MoE Inference — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-15-janus-for-scalable-moe-inference-78ae30.mp3 25. AI Post Transformers: Prefill-as-a-Service for Cross-Datacenter KV Cache — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-19-prefill-as-a-service-for-cross-datacente-7560be.mp3 26. AI Post Transformers: Batch-Aware Expert Routing for Faster MoE Decoding — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-04-batch-aware-expert-routing-for-faster-mo-683ab6.mp3 27. AI Post Transformers: Deep Kernel Fusion for Transformer Decoding — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-15-deep-kernel-fusion-for-transformer-decod-b1a703.mp3 28. AI Post Transformers: NanoFlow and the Future of LLM Serving — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-15-nanoflow-and-the-future-of-llm-serving-7429c9.mp3 29. AI Post Transformers: Why LLM Serving Needs Mathematical Optimization — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-05-why-llm-serving-needs-mathematical-optim-647fc6.mp3 30. AI Post Transformers: Speculative Decoding in Real vLLM Serving — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-04-speculative-decoding-in-real-vllm-servin-6f4e2b.mp3 31. AI Post Transformers: Nemotron 3 Super Hybrid Mamba-Transformer MoE — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-19-nemotron-3-super-hybrid-mamba-transforme-31ac75.mp3 32. AI Post Transformers: FlatAttention for Tile-Based Accelerator Inference — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-04-flatattention-for-tile-based-accelerator-56e6ca.mp3 Interactive Visualization: Affordable Large-Scale Decoding Through Model-System Co-Design [https://podcast.do-not-panic.com/viz/2026-05-19-affordable-large-scale-decoding-through-e1d7ed.html]

Kommentarer

Vær den første til at kommentere

Tilmeld dig nu og bliv en del af AI Post Transformers-fællesskabet!

Kom i gang

Do Language Models Need Sleep?

This episode explores a paper proposing that language models could handle long-context reasoning by periodically pausing, replaying soon-to-be-evicted context offline, and consolidating it into fixed-size fast-weight memory instead of carrying an ever-growing KV cache. It explains the core machinery behind the idea, including state space models and Gated Delta Networks, and clarifies why this is more than prompt summarization or retrieval: the model is rewriting its internal bounded memory during inference. The discussion highlights the paper’s central argument that extra compute may be better spent during these offline “sleep” passes, so later token prediction stays cheap while older information is metabolized into usable latent state. Listeners would find it interesting because it frames long-context scaling as a memory-systems problem, raises concrete questions about whether this consolidation actually improves reasoning, and connects the proposal to broader debates about how future LLMs should trade off memory, compute, and exact recall. Sources: 1. Do Language Models Need Sleep? Offline Recurrence for Improved Online Inference — Sangyun Lee, Sean McLeish, Tom Goldstein, Giulia Fanti, 2026 http://arxiv.org/abs/2605.26099 2. Replay in Deep Learning: Current Approaches and Missing Biological Elements — Tyler L. Hayes, Giri P. Krishnan, Maxim Bazhenov, Hava T. Siegelmann, Terrence J. Sejnowski, Christopher Kanan, 2021 https://scholar.google.com/scholar?q=Replay+in+Deep+Learning:+Current+Approaches+and+Missing+Biological+Elements 3. Can sleep protect memories from catastrophic forgetting? — Oscar C. Gonzalez, Yury Sokolov, Giri P. Krishnan, Jean Erik Delanois, Maxim Bazhenov, 2020 https://scholar.google.com/scholar?q=Can+sleep+protect+memories+from+catastrophic+forgetting? 4. Sleep-like unsupervised replay reduces catastrophic forgetting in artificial neural networks — Timothy Tadros, Giri P. Krishnan, Ramyaa Ramyaa, Maxim Bazhenov, 2022 https://scholar.google.com/scholar?q=Sleep-like+unsupervised+replay+reduces+catastrophic+forgetting+in+artificial+neural+networks 5. Do Language Models Need Sleep? Offline Recurrence for Improved Online Inference — Sangyun Lee, Sean McLeish, Tom Goldstein, Giulia Fanti, 2026 https://scholar.google.com/scholar?q=Do+Language+Models+Need+Sleep?+Offline+Recurrence+for+Improved+Online+Inference 6. Using Fast Weights to Attend to the Recent Past — Jimmy Ba, Geoffrey Hinton, Volodymyr Mnih, Joel Z. Leibo, Catalin Ionescu, 2016 https://scholar.google.com/scholar?q=Using+Fast+Weights+to+Attend+to+the+Recent+Past 7. Linear Transformers Are Secretly Fast Weight Programmers — Imanol Schlag, Kazuki Irie, Jürgen Schmidhuber, 2021 https://scholar.google.com/scholar?q=Linear+Transformers+Are+Secretly+Fast+Weight+Programmers 8. Fast weight programming and linear transformers: from machine learning to neurobiology — Kazuki Irie, Samuel J. Gershman, 2026 https://scholar.google.com/scholar?q=Fast+weight+programming+and+linear+transformers:+from+machine+learning+to+neurobiology 9. TRELLIS: Learning to Compress Key-Value Memory in Attention Models — Mahdi Karami, Ali Behrouz, Praneeth Kacham, Vahab Mirrokni, 2025 https://scholar.google.com/scholar?q=TRELLIS:+Learning+to+Compress+Key-Value+Memory+in+Attention+Models 10. Gated Delta Networks: Improving Mamba2 with Delta Rule — Songlin Yang, Jan Kautz, Ali Hatamizadeh, 2024 https://scholar.google.com/scholar?q=Gated+Delta+Networks:+Improving+Mamba2+with+Delta+Rule 11. Titans: Learning to Memorize at Test Time — Ali Behrouz, Peilin Zhong, Vahab Mirrokni, 2025 https://scholar.google.com/scholar?q=Titans:+Learning+to+Memorize+at+Test+Time 12. Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach — Jonas Geiping, Sean McLeish, Neel Jain, et al., 2025 https://scholar.google.com/scholar?q=Scaling+up+Test-Time+Compute+with+Latent+Reasoning:+A+Recurrent+Depth+Approach 13. In-context Autoencoder for Context Compression in a Large Language Model — Tao Ge, Jing Hu, Lei Wang, Xun Wang, Si-Qing Chen, Furu Wei, 2023 https://scholar.google.com/scholar?q=In-context+Autoencoder+for+Context+Compression+in+a+Large+Language+Model 14. Cartridges: Lightweight and general-purpose long context representations via self-study — Sabri Eyuboglu, Ryan Ehrlich, Simran Arora, et al., 2025 https://scholar.google.com/scholar?q=Cartridges:+Lightweight+and+general-purpose+long+context+representations+via+self-study 15. Repeat After Me: Transformers are Better than State Space Models at Copying — Samy Jelassi, David Brandfonbrener, Sham M. Kakade, Eran Malach, 2024 https://scholar.google.com/scholar?q=Repeat+After+Me:+Transformers+are+Better+than+State+Space+Models+at+Copying 16. End-to-End Test-Time Training for Long Context — Arnuv Tandon et al., 2025 https://scholar.google.com/scholar?q=End-to-End+Test-Time+Training+for+Long+Context 17. Let's (not) just put things in Context: Test-Time Training for Long-Context LLMs — Rachit Bansal et al., 2025 https://scholar.google.com/scholar?q=Let's+(not)+just+put+things+in+Context:+Test-Time+Training+for+Long-Context+LLMs 18. Test-Time Training Done Right — Tianyuan Zhang et al., 2025 https://scholar.google.com/scholar?q=Test-Time+Training+Done+Right 19. Not All Heads Matter: A Head-Level KV Cache Compression Method with Integrated Retrieval and Reasoning — Yu Fu et al., 2024 https://scholar.google.com/scholar?q=Not+All+Heads+Matter:+A+Head-Level+KV+Cache+Compression+Method+with+Integrated+Retrieval+and+Reasoning 20. Beyond RAG: Task-Aware KV Cache Compression for Comprehensive Knowledge Reasoning — Giulio Corallo et al., 2025 https://scholar.google.com/scholar?q=Beyond+RAG:+Task-Aware+KV+Cache+Compression+for+Comprehensive+Knowledge+Reasoning 21. SideQuest: Model-Driven KV Cache Management for Long-Horizon Agentic Reasoning — Sanjay Kariyappa and G. Edward Suh, 2026 https://scholar.google.com/scholar?q=SideQuest:+Model-Driven+KV+Cache+Management+for+Long-Horizon+Agentic+Reasoning 22. Loop, Think, & Generalize: Implicit Reasoning in Recurrent-Depth Transformers — Harsh Kohli et al., 2026 https://scholar.google.com/scholar?q=Loop,+Think,+&+Generalize:+Implicit+Reasoning+in+Recurrent-Depth+Transformers 23. AI Post Transformers: Titans: Learning to Memorize at Test Time — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-20-titans-learning-to-memorize-at-test-time-054662.mp3 24. AI Post Transformers: In-Place Test-Time Training for Transformers — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-09-in-place-test-time-training-for-transfor-d0b976.mp3 25. AI Post Transformers: Recursive Language Models for Arbitrarily Long Prompts — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-04-recursive-language-models-for-arbitraril-fbcd1c.mp3 26. AI Post Transformers: Explicit Information Transmission for Context Compression — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-05-explicit-information-transmission-for-co-24e3c2.mp3 27. AI Post Transformers: KVzip for Query-Agnostic KV Cache Compression — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-29-kvzip-for-query-agnostic-kv-cache-compre-72afe5.mp3 28. AI Post Transformers: Gated Linear Attention for Efficient Long Sequences — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-18-gated-linear-attention-for-efficient-lon-c858ab.mp3 29. AI Post Transformers: MiA-Signature and Global Activation for Long Context — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-13-mia-signature-and-global-activation-for-5ad62f.mp3

1. juni 20261 h 0 min

Affordable Large-Scale Decoding Through Model-System Co-Design

Beskrivelse

Kommentarer

2 måneder kun 19 kr.

Alle episoder