KVzap: Fast, Adaptive, Faithful KV Cache Pruning

Beschreibung

This episode explores KVzap, a method for pruning transformer KV caches by learning a cheap surrogate for a much stronger oracle, with the goal of making cache eviction practical during both prompt prefilling and token-by-token decoding. It explains why KV caches dominate long-context inference costs, clarifies the difference between prefilling and decoding, and lays out why serving systems have favored quantization and paging over content-aware token deletion: removing the wrong token can quietly break later answers. The discussion places KVzap alongside KVzip, Expected Attention, and DMS, arguing that its key advance is a learned per-layer, per-head importance predictor trained to imitate a richer KVzip+ teacher that measures not just attention but actual contribution to the residual stream. Listeners would find it interesting because it ties together systems bottlenecks, adaptive eviction policies such as delayed eviction and sliding windows, and concrete training choices into a broader case for faster, more faithful long-context inference. Sources: 1. KVzap: Fast, Adaptive, Faithful KV Cache Pruning https://arxiv.org/pdf/2601.07891 2. H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models — Zhenyu Zhang, Ying Sheng, Tianyi Zhou, Tianlong Chen, Beidi Chen, et al., 2023 https://scholar.google.com/scholar?q=H2O:+Heavy-Hitter+Oracle+for+Efficient+Generative+Inference+of+Large+Language+Models 3. SnapKV: LLM Knows What You are Looking for Before Generation — Yuhong Li, Yingbing Huang, Bowen Yang, Bharat Venkitesh, Patrick Lewis, et al., 2024 https://scholar.google.com/scholar?q=SnapKV:+LLM+Knows+What+You+are+Looking+for+Before+Generation 4. Expected Attention: KV Cache Compression by Estimating Attention from Future Queries Distribution — Alessio Devoto, Maximilian Jeblick, Simon Jegou, 2025 https://scholar.google.com/scholar?q=Expected+Attention:+KV+Cache+Compression+by+Estimating+Attention+from+Future+Queries+Distribution 5. KVzip: Query-Agnostic KV Cache Compression with Context Reconstruction — Jang-Hyun Kim, Jinuk Kim, Sangwoo Kwon, Jae W. Lee, Sangdoo Yun, Hyun Oh Song, 2025 https://scholar.google.com/scholar?q=KVzip:+Query-Agnostic+KV+Cache+Compression+with+Context+Reconstruction 6. Inference-Time Hyper-Scaling with KV Cache Compression — Adrian Lancucki, Konrad Staniszewski, Piotr Nawrot, Edoardo M. Ponti, 2025 https://scholar.google.com/scholar?q=Inference-Time+Hyper-Scaling+with+KV+Cache+Compression 7. Compactor: Calibrated Query-Agnostic KV Cache Compression with Approximate Leverage Scores — Vivek Chari, Benjamin Van Durme, 2025 https://scholar.google.com/scholar?q=Compactor:+Calibrated+Query-Agnostic+KV+Cache+Compression+with+Approximate+Leverage+Scores 8. DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads — Guangxuan Xiao, Jiaming Tang, Jingwei Zuo, Junxian Guo, Shang Yang, Haotian Tang, Yao Fu, Song Han, 2024 https://scholar.google.com/scholar?q=DuoAttention:+Efficient+Long-Context+LLM+Inference+with+Retrieval+and+Streaming+Heads 9. Retrieval Head Mechanistically Explains Long-Context Factuality — Wenhao Wu et al., 2024 https://scholar.google.com/scholar?q=Retrieval+Head+Mechanistically+Explains+Long-Context+Factuality 10. Query-Focused Retrieval Heads Improve Long-Context Reasoning and Re-ranking — Wuwei Zhang et al., 2025 https://scholar.google.com/scholar?q=Query-Focused+Retrieval+Heads+Improve+Long-Context+Reasoning+and+Re-ranking 11. MInference 1.0: Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention — Huiqiang Jiang et al., 2024 https://scholar.google.com/scholar?q=MInference+1.0:+Accelerating+Pre-filling+for+Long-Context+LLMs+via+Dynamic+Sparse+Attention 12. KV-Compress: Paged KV-Cache Compression with Variable Compression Rates per Attention Head — Isaac Rehg, 2024 https://scholar.google.com/scholar?q=KV-Compress:+Paged+KV-Cache+Compression+with+Variable+Compression+Rates+per+Attention+Head 13. PagedEviction: Structured Block-wise KV Cache Pruning for Efficient Large Language Model Inference — Krishna Teja Chitty-Venkata et al., 2025 https://scholar.google.com/scholar?q=PagedEviction:+Structured+Block-wise+KV+Cache+Pruning+for+Efficient+Large+Language+Model+Inference 14. KVFlow: Efficient Prefix Caching for Accelerating LLM-Based Multi-Agent Workflows — Zaifeng Pan et al., 2025 https://scholar.google.com/scholar?q=KVFlow:+Efficient+Prefix+Caching+for+Accelerating+LLM-Based+Multi-Agent+Workflows 15. AI Post Transformers: Affordable Large-Scale Decoding Through Model-System Co-Design — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-19-affordable-large-scale-decoding-through-e1d7ed.mp3 16. AI Post Transformers: TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-03-25-turboquant-online-vector-quantiz-1967b7.mp3 17. AI Post Transformers: Deep Kernel Fusion for Transformer Decoding — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-15-deep-kernel-fusion-for-transformer-decod-b1a703.mp3 18. AI Post Transformers: DeepSeek-V4 and Practical Million-Token Context — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-25-deepseek-v4-and-practical-million-token-6f4de1.mp3 19. AI Post Transformers: How Induction Heads Emerge in Transformers — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-03-how-induction-heads-emerge-in-transforme-a7bfcb.mp3 20. AI Post Transformers: When Many-Shot CoT Becomes Test-Time Learning — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-15-when-many-shot-cot-becomes-test-time-lea-c25bfe.mp3

CXL-GPU and Beyond Onboard Memory

This episode explores a systems paper that extends GPU memory through CXL-attached DRAM and SSDs, asking whether accelerators can reach beyond on-board HBM without the usual overhead of software-driven memory migration. It explains CXL, memory disaggregation, and the difference between local GPU memory, host-managed memory, CXL memory, and storage-backed expansion, while grounding the discussion in earlier work such as Infiniswap, DirectCXL, and Microsoft’s Pond. The conversation focuses on the paper’s main technical claim: custom GPU-side hardware, including RTL CXL controllers, multiple root ports, and latency-hiding policies, could make expanded memory tiers more usable than approaches like UVM or GPUDirect Storage. It is interesting because the speakers both highlight the engineering ambition and press on a central unresolved question: whether these ideas truly help real transformer workloads, rather than only looking good on more conventional benchmark traces. Sources: 1. CXL-GPU: Pushing GPU Memory Boundaries with the Integration of CXL Technologies — Donghyun Gouk, Seungkwan Kang, Seungjun Lee, Jiseon Kim, Kyungkuk Nam, Eojin Ryu, Sangwon Lee, Dongpyung Kim, Junhyeok Jang, Hanyeoreum Bae, Myoungsoo Jung, 2025 http://arxiv.org/abs/2506.15601 2. Disaggregated Memory for Expansion and Sharing in Blade Servers — Kevin Lim, Jichuan Chang, Trevor Mudge, Parthasarathy Ranganathan, Steven K. Reinhardt, Thomas F. Wenisch, 2009 https://scholar.google.com/scholar?q=Disaggregated+Memory+for+Expansion+and+Sharing+in+Blade+Servers 3. Efficient Memory Disaggregation with Infiniswap — Juncheng Gu, Youngmoon Lee, Yiwen Zhang, Mosharaf Chowdhury, Kang G. Shin, 2017 https://scholar.google.com/scholar?q=Efficient+Memory+Disaggregation+with+Infiniswap 4. Direct Access, High-Performance Memory Disaggregation with DirectCXL — Donghyun Gouk, Sangwon Lee, Miryeong Kwon, Myoungsoo Jung, 2022 https://scholar.google.com/scholar?q=Direct+Access,+High-Performance+Memory+Disaggregation+with+DirectCXL 5. Pond: CXL-Based Memory Pooling Systems for Cloud Platforms — Huaicheng Li, Daniel S. Berger, Lisa Hsu, Daniel Ernst, Pantea Zardoshti, Stanko Novakovic, Monish Shah, Samir Rajadnya, Scott Lee, Ishwar Agarwal, Mark D. Hill, Marcus Fontoura, Ricardo Bianchini, 2023 https://scholar.google.com/scholar?q=Pond:+CXL-Based+Memory+Pooling+Systems+for+Cloud+Platforms 6. SMT: Software-Defined Memory Tiering for Heterogeneous Computing Systems with CXL Memory Expander — K. Kim, H. Kim, J. So, W. Lee, J. Im, S. Park, J. Cho, H. Song, 2023 https://scholar.google.com/scholar?q=SMT:+Software-Defined+Memory+Tiering+for+Heterogeneous+Computing+Systems+with+CXL+Memory+Expander 7. TPP: Transparent Page Placement for CXL-Enabled Tiered-Memory — Hasan Al Maruf, Hao Wang, Abhishek Dhanotia, Johannes Weiner, Niket Agarwal, Pallab Bhattacharya, Chris Petersen, Mosharaf Chowdhury, Shobhit Kanaujia, Prakash Chauhan, 2023 https://scholar.google.com/scholar?q=TPP:+Transparent+Page+Placement+for+CXL-Enabled+Tiered-Memory 8. NVMMU: A Non-volatile Memory Management Unit for Heterogeneous GPU-SSD Architectures — Jie Zhang, David Donofrio, John Shalf, Mahmut T. Kandemir, Myoungsoo Jung, 2015 https://scholar.google.com/scholar?q=NVMMU:+A+Non-volatile+Memory+Management+Unit+for+Heterogeneous+GPU-SSD+Architectures 9. Overcoming the Memory Wall with CXL-Enabled SSDs — Shao-Peng Yang, Minjae Kim, Sanghyun Nam, Juhyung Park, Jin-yong Choi, Eyee Hyun Nam, Eunji Lee, Sungjin Lee, Bryan S. Kim, 2023 https://scholar.google.com/scholar?q=Overcoming+the+Memory+Wall+with+CXL-Enabled+SSDs 10. NeoMem: Hardware/Software Co-Design for CXL-Native Memory Tiering — Zhe Zhou, Yiqi Chen, Tao Zhang, Yang Wang, Ran Shu, Shuotao Xu, Peng Cheng, Lei Qu, Yongqiang Xiong, Jie Zhang, Guangyu Sun, 2024 https://scholar.google.com/scholar?q=NeoMem:+Hardware/Software+Co-Design+for+CXL-Native+Memory+Tiering 11. ARIADNE: Adaptive UVM Management for Efficient GPU Memory Oversubscription — approx. recent systems authors, 2024/2025 https://scholar.google.com/scholar?q=ARIADNE:+Adaptive+UVM+Management+for+Efficient+GPU+Memory+Oversubscription 12. MOST: Memory Oversubscription-Aware Scheduling for Tensor Migration on GPU Unified Storage — approx. recent systems authors, 2024/2025 https://scholar.google.com/scholar?q=MOST:+Memory+Oversubscription-Aware+Scheduling+for+Tensor+Migration+on+GPU+Unified+Storage 13. Selective memory compression for GPU memory oversubscription management — approx. recent architecture authors, 2024/2025 https://scholar.google.com/scholar?q=Selective+memory+compression+for+GPU+memory+oversubscription+management 14. Phoenix: A Refactored I/O Stack for GPU Direct Storage without Phony Buffers — approx. recent storage/systems authors, 2024/2025 https://scholar.google.com/scholar?q=Phoenix:+A+Refactored+I/O+Stack+for+GPU+Direct+Storage+without+Phony+Buffers 15. Managing Scalable Direct Storage Accesses for GPUs with GoFS — approx. recent storage/systems authors, 2024/2025 https://scholar.google.com/scholar?q=Managing+Scalable+Direct+Storage+Accesses+for+GPUs+with+GoFS 16. CCCL: Node-Spanning GPU Collectives with CXL Memory Pooling — approx. recent distributed systems authors, 2024/2025 https://scholar.google.com/scholar?q=CCCL:+Node-Spanning+GPU+Collectives+with+CXL+Memory+Pooling 17. Efficient Tensor Offloading Based on CXL Memory Pool For Extreme Scale Deep Learning — approx. recent ML systems authors, 2024/2025 https://scholar.google.com/scholar?q=Efficient+Tensor+Offloading+Based+on+CXL+Memory+Pool+For+Extreme+Scale+Deep+Learning 18. UHM: Unified Transferring and Pooling over Heterogeneous GPU Memories — approx. recent memory-systems authors, 2024/2025 https://scholar.google.com/scholar?q=UHM:+Unified+Transferring+and+Pooling+over+Heterogeneous+GPU+Memories 19. GPUVM: GPU-driven unified virtual memory — approx. recent architecture authors, 2024/2025 https://scholar.google.com/scholar?q=GPUVM:+GPU-driven+unified+virtual+memory 20. Salus: Efficient security support for cxl-expanded gpu memory — approx. recent security/systems authors, 2024/2025 https://scholar.google.com/scholar?q=Salus:+Efficient+security+support+for+cxl-expanded+gpu+memory 21. AI Post Transformers: Vistara Brings CXL Memory to Hyperscale — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-11-vistara-brings-cxl-memory-to-hyperscale-b5199e.mp3 22. AI Post Transformers: FengHuang for Rack-Scale LLM Inference Memory — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-12-fenghuang-for-rack-scale-llm-inference-m-62708e.mp3 23. AI Post Transformers: Why LLM Serving Needs Mathematical Optimization — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-05-why-llm-serving-needs-mathematical-optim-647fc6.mp3 24. AI Post Transformers: Affordable Large-Scale Decoding Through Model-System Co-Design — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-19-affordable-large-scale-decoding-through-e1d7ed.mp3 Interactive Visualization: CXL-GPU and Beyond Onboard Memory [https://podcast.do-not-panic.com/viz/2026-05-27-cxl-gpu-and-beyond-onboard-memory-98f5ff.html]

27. Mai 20261 h 0 min

KVzap: Fast, Adaptive, Faithful KV Cache Pruning

Beschreibung

Kommentare

2 Monate für 1 €

Alle Folgen