LPU Chip for Low-Latency LLM Inference

CXL-GPU and Beyond Onboard Memory

This episode explores a systems paper that extends GPU memory through CXL-attached DRAM and SSDs, asking whether accelerators can reach beyond on-board HBM without the usual overhead of software-driven memory migration. It explains CXL, memory disaggregation, and the difference between local GPU memory, host-managed memory, CXL memory, and storage-backed expansion, while grounding the discussion in earlier work such as Infiniswap, DirectCXL, and Microsoft’s Pond. The conversation focuses on the paper’s main technical claim: custom GPU-side hardware, including RTL CXL controllers, multiple root ports, and latency-hiding policies, could make expanded memory tiers more usable than approaches like UVM or GPUDirect Storage. It is interesting because the speakers both highlight the engineering ambition and press on a central unresolved question: whether these ideas truly help real transformer workloads, rather than only looking good on more conventional benchmark traces. Sources: 1. CXL-GPU: Pushing GPU Memory Boundaries with the Integration of CXL Technologies — Donghyun Gouk, Seungkwan Kang, Seungjun Lee, Jiseon Kim, Kyungkuk Nam, Eojin Ryu, Sangwon Lee, Dongpyung Kim, Junhyeok Jang, Hanyeoreum Bae, Myoungsoo Jung, 2025 http://arxiv.org/abs/2506.15601 2. Disaggregated Memory for Expansion and Sharing in Blade Servers — Kevin Lim, Jichuan Chang, Trevor Mudge, Parthasarathy Ranganathan, Steven K. Reinhardt, Thomas F. Wenisch, 2009 https://scholar.google.com/scholar?q=Disaggregated+Memory+for+Expansion+and+Sharing+in+Blade+Servers 3. Efficient Memory Disaggregation with Infiniswap — Juncheng Gu, Youngmoon Lee, Yiwen Zhang, Mosharaf Chowdhury, Kang G. Shin, 2017 https://scholar.google.com/scholar?q=Efficient+Memory+Disaggregation+with+Infiniswap 4. Direct Access, High-Performance Memory Disaggregation with DirectCXL — Donghyun Gouk, Sangwon Lee, Miryeong Kwon, Myoungsoo Jung, 2022 https://scholar.google.com/scholar?q=Direct+Access,+High-Performance+Memory+Disaggregation+with+DirectCXL 5. Pond: CXL-Based Memory Pooling Systems for Cloud Platforms — Huaicheng Li, Daniel S. Berger, Lisa Hsu, Daniel Ernst, Pantea Zardoshti, Stanko Novakovic, Monish Shah, Samir Rajadnya, Scott Lee, Ishwar Agarwal, Mark D. Hill, Marcus Fontoura, Ricardo Bianchini, 2023 https://scholar.google.com/scholar?q=Pond:+CXL-Based+Memory+Pooling+Systems+for+Cloud+Platforms 6. SMT: Software-Defined Memory Tiering for Heterogeneous Computing Systems with CXL Memory Expander — K. Kim, H. Kim, J. So, W. Lee, J. Im, S. Park, J. Cho, H. Song, 2023 https://scholar.google.com/scholar?q=SMT:+Software-Defined+Memory+Tiering+for+Heterogeneous+Computing+Systems+with+CXL+Memory+Expander 7. TPP: Transparent Page Placement for CXL-Enabled Tiered-Memory — Hasan Al Maruf, Hao Wang, Abhishek Dhanotia, Johannes Weiner, Niket Agarwal, Pallab Bhattacharya, Chris Petersen, Mosharaf Chowdhury, Shobhit Kanaujia, Prakash Chauhan, 2023 https://scholar.google.com/scholar?q=TPP:+Transparent+Page+Placement+for+CXL-Enabled+Tiered-Memory 8. NVMMU: A Non-volatile Memory Management Unit for Heterogeneous GPU-SSD Architectures — Jie Zhang, David Donofrio, John Shalf, Mahmut T. Kandemir, Myoungsoo Jung, 2015 https://scholar.google.com/scholar?q=NVMMU:+A+Non-volatile+Memory+Management+Unit+for+Heterogeneous+GPU-SSD+Architectures 9. Overcoming the Memory Wall with CXL-Enabled SSDs — Shao-Peng Yang, Minjae Kim, Sanghyun Nam, Juhyung Park, Jin-yong Choi, Eyee Hyun Nam, Eunji Lee, Sungjin Lee, Bryan S. Kim, 2023 https://scholar.google.com/scholar?q=Overcoming+the+Memory+Wall+with+CXL-Enabled+SSDs 10. NeoMem: Hardware/Software Co-Design for CXL-Native Memory Tiering — Zhe Zhou, Yiqi Chen, Tao Zhang, Yang Wang, Ran Shu, Shuotao Xu, Peng Cheng, Lei Qu, Yongqiang Xiong, Jie Zhang, Guangyu Sun, 2024 https://scholar.google.com/scholar?q=NeoMem:+Hardware/Software+Co-Design+for+CXL-Native+Memory+Tiering 11. ARIADNE: Adaptive UVM Management for Efficient GPU Memory Oversubscription — approx. recent systems authors, 2024/2025 https://scholar.google.com/scholar?q=ARIADNE:+Adaptive+UVM+Management+for+Efficient+GPU+Memory+Oversubscription 12. MOST: Memory Oversubscription-Aware Scheduling for Tensor Migration on GPU Unified Storage — approx. recent systems authors, 2024/2025 https://scholar.google.com/scholar?q=MOST:+Memory+Oversubscription-Aware+Scheduling+for+Tensor+Migration+on+GPU+Unified+Storage 13. Selective memory compression for GPU memory oversubscription management — approx. recent architecture authors, 2024/2025 https://scholar.google.com/scholar?q=Selective+memory+compression+for+GPU+memory+oversubscription+management 14. Phoenix: A Refactored I/O Stack for GPU Direct Storage without Phony Buffers — approx. recent storage/systems authors, 2024/2025 https://scholar.google.com/scholar?q=Phoenix:+A+Refactored+I/O+Stack+for+GPU+Direct+Storage+without+Phony+Buffers 15. Managing Scalable Direct Storage Accesses for GPUs with GoFS — approx. recent storage/systems authors, 2024/2025 https://scholar.google.com/scholar?q=Managing+Scalable+Direct+Storage+Accesses+for+GPUs+with+GoFS 16. CCCL: Node-Spanning GPU Collectives with CXL Memory Pooling — approx. recent distributed systems authors, 2024/2025 https://scholar.google.com/scholar?q=CCCL:+Node-Spanning+GPU+Collectives+with+CXL+Memory+Pooling 17. Efficient Tensor Offloading Based on CXL Memory Pool For Extreme Scale Deep Learning — approx. recent ML systems authors, 2024/2025 https://scholar.google.com/scholar?q=Efficient+Tensor+Offloading+Based+on+CXL+Memory+Pool+For+Extreme+Scale+Deep+Learning 18. UHM: Unified Transferring and Pooling over Heterogeneous GPU Memories — approx. recent memory-systems authors, 2024/2025 https://scholar.google.com/scholar?q=UHM:+Unified+Transferring+and+Pooling+over+Heterogeneous+GPU+Memories 19. GPUVM: GPU-driven unified virtual memory — approx. recent architecture authors, 2024/2025 https://scholar.google.com/scholar?q=GPUVM:+GPU-driven+unified+virtual+memory 20. Salus: Efficient security support for cxl-expanded gpu memory — approx. recent security/systems authors, 2024/2025 https://scholar.google.com/scholar?q=Salus:+Efficient+security+support+for+cxl-expanded+gpu+memory 21. AI Post Transformers: Vistara Brings CXL Memory to Hyperscale — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-11-vistara-brings-cxl-memory-to-hyperscale-b5199e.mp3 22. AI Post Transformers: FengHuang for Rack-Scale LLM Inference Memory — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-12-fenghuang-for-rack-scale-llm-inference-m-62708e.mp3 23. AI Post Transformers: Why LLM Serving Needs Mathematical Optimization — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-05-why-llm-serving-needs-mathematical-optim-647fc6.mp3 24. AI Post Transformers: Affordable Large-Scale Decoding Through Model-System Co-Design — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-19-affordable-large-scale-decoding-through-e1d7ed.mp3 Interactive Visualization: CXL-GPU and Beyond Onboard Memory [https://podcast.do-not-panic.com/viz/2026-05-27-cxl-gpu-and-beyond-onboard-memory-98f5ff.html]

Eilen1 h 0 min

LPU Chip for Low-Latency LLM Inference

Kuvaus

Kommentit

3 kuukautta hintaan 7,99 €

Kaikki jaksot