AI Post Transformers
This episode explores KVzip, a query-agnostic method for compressing long-context KV caches so a model can reuse a shared document, codebase, or memory bank across many later questions without optimizing for just one query. It explains why KV cache has become a major systems bottleneck, including the striking example that a 120,000-token context for Qwen2.5-14B can require more memory for cache than for the model weights themselves. The discussion contrasts KVzip with exact prefix caching and query-aware pruning methods like SnapKV, then breaks down KVzip’s core idea: replay the original context, measure which cached states receive the most attention during reconstruction, and keep those as durable memory. Listeners would find it interesting because the paper ties a clean systems insight to concrete gains, reporting roughly 394x smaller decoding-time KV caches and about 2x lower FlashAttention latency across LLaMA, Qwen, and Gemma models on very long contexts. Sources: 1. KVzip for Query-Agnostic KV Cache Compression https://arxiv.org/pdf/2505.23416 2. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding — Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova, 2018 https://scholar.google.com/scholar?q=BERT:+Pre-training+of+Deep+Bidirectional+Transformers+for+Language+Understanding 3. SnapKV: LLM Knows What You are Looking for Before Generation — Yuhong Li, Yingbing Huang, Bowen Yang, Bharat Venkitesh, Acyr Locatelli, Hanchen Ye, Tianle Cai, Patrick Lewis, Deming Chen, 2024 https://scholar.google.com/scholar?q=SnapKV:+LLM+Knows+What+You+are+Looking+for+Before+Generation 4. KVzip: Query-Agnostic KV Cache Compression with Context Reconstruction — Jang-Hyun Kim, Jinuk Kim, Sangwoo Kwon, Jae W. Lee, Sangdoo Yun, Hyun Oh Song, 2025 https://scholar.google.com/scholar?q=KVzip:+Query-Agnostic+KV+Cache+Compression+with+Context+Reconstruction 5. Rethinking Key-Value Cache Compression Techniques for Large Language Model Serving — Wei Gao, Xinyu Zhou, Peng Sun, Tianwei Zhang, Yonggang Wen, 2025 https://scholar.google.com/scholar?q=Rethinking+Key-Value+Cache+Compression+Techniques+for+Large+Language+Model+Serving 6. SCBench: A KV Cache-Centric Analysis of Long-Context Methods — Yudong Li, Hongkang Jiang, Qihui Wu, Xintong Luo, Sohee Ahn, Chen Zhang, and others, 2025 https://scholar.google.com/scholar?q=SCBench:+A+KV+Cache-Centric+Analysis+of+Long-Context+Methods 7. DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads — Guangxuan Xiao, Jiaming Tang, Jingwei Zuo, Junxian Guo, Shang Yang, Haotian Tang, Yao Fu, Song Han, 2025 https://scholar.google.com/scholar?q=DuoAttention:+Efficient+Long-Context+LLM+Inference+with+Retrieval+and+Streaming+Heads 8. Compactor: Calibrated Query-Agnostic KV Cache Compression with Approximate Leverage Scores — Vivek Chari, Benjamin Van Durme, 2025 https://scholar.google.com/scholar?q=Compactor:+Calibrated+Query-Agnostic+KV+Cache+Compression+with+Approximate+Leverage+Scores 9. No Token Left Behind: Reliable KV Cache Compression via Importance-Aware Mixed Precision Quantization — June Yong Yang, Byeongwook Kim, Jeongin Bae, Beomseok Kwon, Gunho Park, Eunho Yang, Se Jung Kwon, Dongsoo Lee, 2024 https://scholar.google.com/scholar?q=No+Token+Left+Behind:+Reliable+KV+Cache+Compression+via+Importance-Aware+Mixed+Precision+Quantization 10. Safety Alignment Should Be Made More Than Just a Few Tokens Deep — Xiangyu Qi, Ashwinee Panda, Kaifeng Lyu, Xiao Ma, Subhrajit Roy, Ahmad Beirami, Prateek Mittal, Peter Henderson, 2025 https://scholar.google.com/scholar?q=Safety+Alignment+Should+Be+Made+More+Than+Just+a+Few+Tokens+Deep 11. The Residual Stream Is All You Need: On the Redundancy of the KV Cache in Transformer Inference — Kaleem Ullah Qasim et al., 2026 https://arxiv.org/abs/2603.19664 12. DeltaKV: Residual-Based KV Cache Compression via Long-Range Similarity — Jitai Hao et al., 2026 https://arxiv.org/abs/2602.08005 13. ParisKV: Fast and Drift-Robust KV-Cache Retrieval for Long-Context LLMs — Yanlin Qi et al., 2026 https://arxiv.org/abs/2602.07721 14. HeteroCache: A Dynamic Retrieval Approach to Heterogeneous KV Cache Compression for Long-Context LLM Inference — Zhiyuan Shi et al., 2026 https://arxiv.org/abs/2601.13684 15. R-KV: Redundancy-aware KV Cache Compression for Reasoning Models — Zefan Cai et al., 2025 https://arxiv.org/abs/2505.24133 16. Hold Onto That Thought: Assessing KV Cache Compression On Reasoning — Minghui Liu et al., 2025 https://arxiv.org/abs/2512.12008 17. SideQuest: Model-Driven KV Cache Management for Long-Horizon Agentic Reasoning — Sanjay Kariyappa and G. Edward Suh, 2026 https://arxiv.org/abs/2602.22603 18. AI Post Transformers: PackKV Lossy Compression for KV Caches — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-04-packkv-lossy-compression-for-kv-caches-b37bce.mp3 19. AI Post Transformers: Affordable Large-Scale Decoding Through Model-System Co-Design — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-19-affordable-large-scale-decoding-through-e1d7ed.mp3 20. AI Post Transformers: Deep Kernel Fusion for Transformer Decoding — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-15-deep-kernel-fusion-for-transformer-decod-b1a703.mp3 21. AI Post Transformers: DeepSeek-V4 and Practical Million-Token Context — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-25-deepseek-v4-and-practical-million-token-6f4de1.mp3 22. AI Post Transformers: TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-03-25-turboquant-online-vector-quantiz-1967b7.mp3
664 jaksot
Kommentit
0Ole ensimmäinen kommentoija
Rekisteröidy nyt ja liity AI Post Transformers-yhteisöön!