AI Post Transformers
This episode explores Beluga, a systems paper that tackles one of the biggest practical bottlenecks in long-context LLM inference: how to store and retrieve massive KV caches when GPU memory is no longer enough. It explains why traditional RDMA-based memory disaggregation is cumbersome and how Beluga uses CXL-based pooled memory to give GPUs and CPUs more direct, load/store-style access to shared cache data, reducing copies, staging, and synchronization overhead. The discussion digs into the architecture’s tradeoffs, including the fact that CXL is still slower than local HBM or DRAM, but argues that its simpler access model can still deliver large gains in the right workload regime. Listeners would find it interesting for its concrete analysis of when the reported speedups, including major reductions in time to first token and large throughput gains, are real advances versus artifacts of favorable cache-reuse conditions. Sources: 1. Beluga: A CXL-Based Memory Architecture for Scalable and Efficient LLM KVCache Management — Xinjun Yang, Qingda Hu, Junru Li, Feifei Li, Yicong Zhu, Yuqi Zhou, Qiuru Lin, Jian Dai, Yang Kong, Jiayu Zhang, Guoqiang Xu, Qiang Liu, 2025 http://arxiv.org/abs/2511.20172 2. Memory Pooling With CXL — Donghyun Gouk, Miryeong Kwon, Hanyeoreum Bae, Sangwon Lee, Myoungsoo Jung, 2023 https://scholar.google.com/scholar?q=Memory+Pooling+With+CXL 3. Demystifying CXL Memory with Genuine CXL-Ready Systems and Devices — Yan Sun, Yifan Yuan, Zeduo Yu, Reese Kuper, Chihun Song, Jinghan Huang, Houxiang Ji, Siddharth Agarwal, Jiaqi Lou, Ipoom Jeong, Ren Wang, Jung Ho Ahn, Tianyin Xu, Nam Sung Kim, 2023 https://scholar.google.com/scholar?q=Demystifying+CXL+Memory+with+Genuine+CXL-Ready+Systems+and+Devices 4. Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving — Ruoyu Qin, Zheming Li, Weiran He, Mingxing Zhang, Yongwei Wu, Weimin Zheng, Xinran Xu, 2025 https://scholar.google.com/scholar?q=Mooncake:+A+KVCache-centric+Disaggregated+Architecture+for+LLM+Serving 5. Exploring CXL-based KV Cache Storage for LLM Serving — Yupeng Tang, Runxiang Cheng, Ping Zhou, Tongping Liu, Fei Liu, Wei Tang, Kyoungryun Bae, Jianjun Chen, Wu Xiang, Rui Shi, 2024 https://scholar.google.com/scholar?q=Exploring+CXL-based+KV+Cache+Storage+for+LLM+Serving 6. MemServe: Context Caching for Disaggregated LLM Serving with Elastic Memory Pool — Cunchen Hu, Heyang Huang, Junhao Hu, Jiang Xu, Xusheng Chen, Tao Xie, Chenxi Wang, Sa Wang, Yungang Bao, Ninghui Sun, Yizhou Shan, 2024 https://scholar.google.com/scholar?q=MemServe:+Context+Caching+for+Disaggregated+LLM+Serving+with+Elastic+Memory+Pool 7. Efficient Memory Management for Large Language Model Serving with PagedAttention — Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, Ion Stoica, 2023 https://scholar.google.com/scholar?q=Efficient+Memory+Management+for+Large+Language+Model+Serving+with+PagedAttention 8. CacheGen: KV Cache Compression and Streaming for Fast Large Language Model Serving — Yuhan Liu, Hanchen Li, Yihua Cheng, Siddhant Ray, Yuyang Huang, Qizheng Zhang, Kuntai Du, Jiayi Yao, Shan Lu, Ganesh Ananthanarayanan, Michael Maire, Henry Hoffmann, Ari Holtzman, Junchen Jiang, 2024 https://scholar.google.com/scholar?q=CacheGen:+KV+Cache+Compression+and+Streaming+for+Fast+Large+Language+Model+Serving 9. PyramidInfer: Pyramid KV Cache Compression for High-throughput LLM Inference — Dongjie Yang et al., 2024 https://scholar.google.com/scholar?q=PyramidInfer:+Pyramid+KV+Cache+Compression+for+High-throughput+LLM+Inference 10. ChunkKV: Semantic-Preserving KV Cache Compression for Efficient Long-Context LLM Inference — Xiang Liu et al., 2025 https://scholar.google.com/scholar?q=ChunkKV:+Semantic-Preserving+KV+Cache+Compression+for+Efficient+Long-Context+LLM+Inference 11. Inference-Time Hyper-Scaling with KV Cache Compression — Adrian Łańcucki, Konrad Staniszewski, Piotr Nawrot, Edoardo M. Ponti, 2025 https://scholar.google.com/scholar?q=Inference-Time+Hyper-Scaling+with+KV+Cache+Compression 12. FlowKV: A Disaggregated Inference Framework with Low-Latency KV Cache Transfer and Load-Aware Scheduling — Weiqing Li et al., 2025 https://scholar.google.com/scholar?q=FlowKV:+A+Disaggregated+Inference+Framework+with+Low-Latency+KV+Cache+Transfer+and+Load-Aware+Scheduling 13. TokenLake: A Unified Segment-level Prefix Cache Pool for Fine-grained Elastic Long-Context LLM Serving — Bingyang Wu et al., 2025 https://scholar.google.com/scholar?q=TokenLake:+A+Unified+Segment-level+Prefix+Cache+Pool+for+Fine-grained+Elastic+Long-Context+LLM+Serving 14. KVFlow: Efficient Prefix Caching for Accelerating LLM-Based Multi-Agent Workflows — Zaifeng Pan et al., 2025 https://scholar.google.com/scholar?q=KVFlow:+Efficient+Prefix+Caching+for+Accelerating+LLM-Based+Multi-Agent+Workflows 15. Learned Prefix Caching for Efficient LLM Inference — Dongsheng Yang, Austin Li, Kai Li, Wyatt Lloyd, 2025 https://scholar.google.com/scholar?q=Learned+Prefix+Caching+for+Efficient+LLM+Inference 16. AI Post Transformers: FengHuang for Rack-Scale LLM Inference Memory — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-12-fenghuang-for-rack-scale-llm-inference-m-62708e.mp3 17. AI Post Transformers: CXL Computational Memory Offloading for Lower Runtime — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-04-cxl-computational-memory-offloading-for-3b2124.mp3 18. AI Post Transformers: Prefill-as-a-Service for Cross-Datacenter KV Cache — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-19-prefill-as-a-service-for-cross-datacente-7560be.mp3 19. AI Post Transformers: SolidAttention: Co-Designing Sparse Attention and SSD I/O — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-03-18-solidattention-co-designing-sparse-atten-5a8622.mp3 20. AI Post Transformers: Stochastic KV Routing for Cache Sharing — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-29-stochastic-kv-routing-for-cache-sharing-5fef63.mp3 21. AI Post Transformers: Affordable Large-Scale Decoding Through Model-System Co-Design — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-19-affordable-large-scale-decoding-through-e1d7ed.mp3 Interactive Visualization: Beluga: CXL Memory Pooling for LLM KV Cache [https://podcast.do-not-panic.com/viz/2026-05-27-beluga-cxl-memory-pooling-for-llm-kv-cac-b6142f.html]
663 Folgen
Kommentare
0Sei die erste Person, die kommentiert
Melde dich jetzt an und werde Teil der AI Post Transformers-Community!