DFX: Multi-FPGA Acceleration for Transformer Inference

Descripción

This episode explores the DFX system, a four-FPGA appliance designed to accelerate transformer-based text generation by targeting a key weakness of GPUs: low-batch, token-by-token decode. It explains the difference between prompt processing and sequential generation, connects the paper’s older terminology to today’s prefill/decode framing, and shows why autoregressive inference often leaves GPU hardware underused even when training runs efficiently in parallel. The discussion also breaks down how DFX uses hardware-aware model parallelism and end-to-end accelerator design, rather than only speeding up isolated transformer subcomponents, to argue for lower latency and better energy and cost efficiency than a four-V100 GPU server. Listeners would find it interesting for its clear historical perspective on transformer serving and for its skepticism about how much of the reported advantage comes from FPGA specialization versus the fairness of the GPU baseline. Sources: 1. DFX: A Low-latency Multi-FPGA Appliance for Accelerating Transformer-based Text Generation — Seongmin Hong, Seungjae Moon, Junsoo Kim, Sungjae Lee, Minsub Kim, Dongsoo Lee, Joo-Young Kim, 2022 http://arxiv.org/abs/2209.10797 2. GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism — Yanping Huang, Youlong Cheng, Ankur Bapna, Quoc V. Le, Yonghui Wu, Zhifeng Chen, and others, 2019 https://scholar.google.com/scholar?q=GPipe:+Efficient+Training+of+Giant+Neural+Networks+using+Pipeline+Parallelism 3. Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism — Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, Bryan Catanzaro, 2020 https://scholar.google.com/scholar?q=Megatron-LM:+Training+Multi-Billion+Parameter+Language+Models+Using+Model+Parallelism 4. GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding — Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, Dehao Chen, Orhan Firat, Noam Shazeer, Zhifeng Chen, and others, 2020 https://scholar.google.com/scholar?q=GShard:+Scaling+Giant+Models+with+Conditional+Computation+and+Automatic+Sharding 5. Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM — Deepak Narayanan, Mohammad Shoeybi, Jared Casper, Patrick LeGresley, Mostofa Patwary, Bryan Catanzaro, Amar Phanishayee, Matei Zaharia, and others, 2021 https://scholar.google.com/scholar?q=Efficient+Large-Scale+Language+Model+Training+on+GPU+Clusters+Using+Megatron-LM 6. Attention Is All You Need — Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin, 2017 https://scholar.google.com/scholar?q=Attention+Is+All+You+Need 7. FTRANS: Energy-Efficient Acceleration of Transformers using FPGA — Jingcheng Rao, Yuchen Shao, Ke Wang, Zhihao Zhu, Xuehai Qian, Yiyu Shi, 2020 https://scholar.google.com/scholar?q=FTRANS:+Energy-Efficient+Acceleration+of+Transformers+using+FPGA 8. Fast Inference from Transformers via Speculative Decoding — Yaniv Leviathan, Matan Kalman, Yossi Matias, 2022 https://scholar.google.com/scholar?q=Fast+Inference+from+Transformers+via+Speculative+Decoding 9. PyramidInfer: Pyramid KV Cache Compression for High-throughput LLM Inference — Dongjie Yang, XiaoDong Han, Yan Gao, Yao Hu, Shilin Zhang, Hai Zhao, 2024 https://scholar.google.com/scholar?q=PyramidInfer:+Pyramid+KV+Cache+Compression+for+High-throughput+LLM+Inference 10. ChunkKV: Semantic-Preserving KV Cache Compression for Efficient Long-Context LLM Inference — Xiang Liu, Zhenheng Tang, Peijie Dong, Zeyu Li, Bo Li, Xuming Hu, Xiaowen Chu, 2025 https://scholar.google.com/scholar?q=ChunkKV:+Semantic-Preserving+KV+Cache+Compression+for+Efficient+Long-Context+LLM+Inference 11. Cost-Optimal Grouped-Query Attention for Long-Context LLMs — Yingfa Chen, Yutong Wu, Xu Han, Zhiyuan Liu, Maosong Sun, 2025 https://scholar.google.com/scholar?q=Cost-Optimal+Grouped-Query+Attention+for+Long-Context+LLMs 12. Optimised Grouped-Query Attention Mechanism for Transformers — Yuang Chen, Cheng Zhang, Xitong Gao, Robert D. Mullins, George A. Constantinides, Yiren Zhao, 2024 https://scholar.google.com/scholar?q=Optimised+Grouped-Query+Attention+Mechanism+for+Transformers 13. Prefill-Decode Aggregation or Disaggregation? Unifying Both for Goodput-Optimized LLM Serving — Chao Wang, Pengfei Zuo, Zhangyu Chen, Yunkai Liang, Zhou Yu, Ming-Chang Yang, 2025 https://scholar.google.com/scholar?q=Prefill-Decode+Aggregation+or+Disaggregation?+Unifying+Both+for+Goodput-Optimized+LLM+Serving 14. Nexus: Proactive Intra-GPU Disaggregation of Prefill and Decode in LLM Serving — Xiaoxiang Shi, Colin Cai, Junjia Du, Zhihao Jia, 2025 https://scholar.google.com/scholar?q=Nexus:+Proactive+Intra-GPU+Disaggregation+of+Prefill+and+Decode+in+LLM+Serving 15. SPAD: Specialized Prefill and Decode Hardware for Disaggregated LLM Inference — Hengrui Zhang, Pratyush Patel, August Ning, David Wentzlaff, 2025 https://scholar.google.com/scholar?q=SPAD:+Specialized+Prefill+and+Decode+Hardware+for+Disaggregated+LLM+Inference 16. AI Post Transformers: Deep Kernel Fusion for Transformer Decoding — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-15-deep-kernel-fusion-for-transformer-decod-b1a703.mp3 17. AI Post Transformers: Affordable Large-Scale Decoding Through Model-System Co-Design — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-19-affordable-large-scale-decoding-through-e1d7ed.mp3 18. AI Post Transformers: LAPS for Length-Aware LLM Serving — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-05-laps-for-length-aware-llm-serving-0c6149.mp3 19. AI Post Transformers: Prefill-as-a-Service for Cross-Datacenter KV Cache — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-19-prefill-as-a-service-for-cross-datacente-7560be.mp3 20. AI Post Transformers: Speculative Decoding in Real vLLM Serving — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-04-speculative-decoding-in-real-vllm-servin-6f4e2b.mp3 21. AI Post Transformers: FengHuang for Rack-Scale LLM Inference Memory — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-12-fenghuang-for-rack-scale-llm-inference-m-62708e.mp3 22. AI Post Transformers: Caffeine: A Unified FPGA for CNNs — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-06-caffeine-a-unified-fpga-for-cnns-e8acbe.mp3 Interactive Visualization: DFX: Multi-FPGA Acceleration for Transformer Inference [https://podcast.do-not-panic.com/viz/2026-05-22-dfx-multi-fpga-acceleration-for-transfor-3266ea.html]

CXL-GPU and Beyond Onboard Memory

This episode explores a systems paper that extends GPU memory through CXL-attached DRAM and SSDs, asking whether accelerators can reach beyond on-board HBM without the usual overhead of software-driven memory migration. It explains CXL, memory disaggregation, and the difference between local GPU memory, host-managed memory, CXL memory, and storage-backed expansion, while grounding the discussion in earlier work such as Infiniswap, DirectCXL, and Microsoft’s Pond. The conversation focuses on the paper’s main technical claim: custom GPU-side hardware, including RTL CXL controllers, multiple root ports, and latency-hiding policies, could make expanded memory tiers more usable than approaches like UVM or GPUDirect Storage. It is interesting because the speakers both highlight the engineering ambition and press on a central unresolved question: whether these ideas truly help real transformer workloads, rather than only looking good on more conventional benchmark traces. Sources: 1. CXL-GPU: Pushing GPU Memory Boundaries with the Integration of CXL Technologies — Donghyun Gouk, Seungkwan Kang, Seungjun Lee, Jiseon Kim, Kyungkuk Nam, Eojin Ryu, Sangwon Lee, Dongpyung Kim, Junhyeok Jang, Hanyeoreum Bae, Myoungsoo Jung, 2025 http://arxiv.org/abs/2506.15601 2. Disaggregated Memory for Expansion and Sharing in Blade Servers — Kevin Lim, Jichuan Chang, Trevor Mudge, Parthasarathy Ranganathan, Steven K. Reinhardt, Thomas F. Wenisch, 2009 https://scholar.google.com/scholar?q=Disaggregated+Memory+for+Expansion+and+Sharing+in+Blade+Servers 3. Efficient Memory Disaggregation with Infiniswap — Juncheng Gu, Youngmoon Lee, Yiwen Zhang, Mosharaf Chowdhury, Kang G. Shin, 2017 https://scholar.google.com/scholar?q=Efficient+Memory+Disaggregation+with+Infiniswap 4. Direct Access, High-Performance Memory Disaggregation with DirectCXL — Donghyun Gouk, Sangwon Lee, Miryeong Kwon, Myoungsoo Jung, 2022 https://scholar.google.com/scholar?q=Direct+Access,+High-Performance+Memory+Disaggregation+with+DirectCXL 5. Pond: CXL-Based Memory Pooling Systems for Cloud Platforms — Huaicheng Li, Daniel S. Berger, Lisa Hsu, Daniel Ernst, Pantea Zardoshti, Stanko Novakovic, Monish Shah, Samir Rajadnya, Scott Lee, Ishwar Agarwal, Mark D. Hill, Marcus Fontoura, Ricardo Bianchini, 2023 https://scholar.google.com/scholar?q=Pond:+CXL-Based+Memory+Pooling+Systems+for+Cloud+Platforms 6. SMT: Software-Defined Memory Tiering for Heterogeneous Computing Systems with CXL Memory Expander — K. Kim, H. Kim, J. So, W. Lee, J. Im, S. Park, J. Cho, H. Song, 2023 https://scholar.google.com/scholar?q=SMT:+Software-Defined+Memory+Tiering+for+Heterogeneous+Computing+Systems+with+CXL+Memory+Expander 7. TPP: Transparent Page Placement for CXL-Enabled Tiered-Memory — Hasan Al Maruf, Hao Wang, Abhishek Dhanotia, Johannes Weiner, Niket Agarwal, Pallab Bhattacharya, Chris Petersen, Mosharaf Chowdhury, Shobhit Kanaujia, Prakash Chauhan, 2023 https://scholar.google.com/scholar?q=TPP:+Transparent+Page+Placement+for+CXL-Enabled+Tiered-Memory 8. NVMMU: A Non-volatile Memory Management Unit for Heterogeneous GPU-SSD Architectures — Jie Zhang, David Donofrio, John Shalf, Mahmut T. Kandemir, Myoungsoo Jung, 2015 https://scholar.google.com/scholar?q=NVMMU:+A+Non-volatile+Memory+Management+Unit+for+Heterogeneous+GPU-SSD+Architectures 9. Overcoming the Memory Wall with CXL-Enabled SSDs — Shao-Peng Yang, Minjae Kim, Sanghyun Nam, Juhyung Park, Jin-yong Choi, Eyee Hyun Nam, Eunji Lee, Sungjin Lee, Bryan S. Kim, 2023 https://scholar.google.com/scholar?q=Overcoming+the+Memory+Wall+with+CXL-Enabled+SSDs 10. NeoMem: Hardware/Software Co-Design for CXL-Native Memory Tiering — Zhe Zhou, Yiqi Chen, Tao Zhang, Yang Wang, Ran Shu, Shuotao Xu, Peng Cheng, Lei Qu, Yongqiang Xiong, Jie Zhang, Guangyu Sun, 2024 https://scholar.google.com/scholar?q=NeoMem:+Hardware/Software+Co-Design+for+CXL-Native+Memory+Tiering 11. ARIADNE: Adaptive UVM Management for Efficient GPU Memory Oversubscription — approx. recent systems authors, 2024/2025 https://scholar.google.com/scholar?q=ARIADNE:+Adaptive+UVM+Management+for+Efficient+GPU+Memory+Oversubscription 12. MOST: Memory Oversubscription-Aware Scheduling for Tensor Migration on GPU Unified Storage — approx. recent systems authors, 2024/2025 https://scholar.google.com/scholar?q=MOST:+Memory+Oversubscription-Aware+Scheduling+for+Tensor+Migration+on+GPU+Unified+Storage 13. Selective memory compression for GPU memory oversubscription management — approx. recent architecture authors, 2024/2025 https://scholar.google.com/scholar?q=Selective+memory+compression+for+GPU+memory+oversubscription+management 14. Phoenix: A Refactored I/O Stack for GPU Direct Storage without Phony Buffers — approx. recent storage/systems authors, 2024/2025 https://scholar.google.com/scholar?q=Phoenix:+A+Refactored+I/O+Stack+for+GPU+Direct+Storage+without+Phony+Buffers 15. Managing Scalable Direct Storage Accesses for GPUs with GoFS — approx. recent storage/systems authors, 2024/2025 https://scholar.google.com/scholar?q=Managing+Scalable+Direct+Storage+Accesses+for+GPUs+with+GoFS 16. CCCL: Node-Spanning GPU Collectives with CXL Memory Pooling — approx. recent distributed systems authors, 2024/2025 https://scholar.google.com/scholar?q=CCCL:+Node-Spanning+GPU+Collectives+with+CXL+Memory+Pooling 17. Efficient Tensor Offloading Based on CXL Memory Pool For Extreme Scale Deep Learning — approx. recent ML systems authors, 2024/2025 https://scholar.google.com/scholar?q=Efficient+Tensor+Offloading+Based+on+CXL+Memory+Pool+For+Extreme+Scale+Deep+Learning 18. UHM: Unified Transferring and Pooling over Heterogeneous GPU Memories — approx. recent memory-systems authors, 2024/2025 https://scholar.google.com/scholar?q=UHM:+Unified+Transferring+and+Pooling+over+Heterogeneous+GPU+Memories 19. GPUVM: GPU-driven unified virtual memory — approx. recent architecture authors, 2024/2025 https://scholar.google.com/scholar?q=GPUVM:+GPU-driven+unified+virtual+memory 20. Salus: Efficient security support for cxl-expanded gpu memory — approx. recent security/systems authors, 2024/2025 https://scholar.google.com/scholar?q=Salus:+Efficient+security+support+for+cxl-expanded+gpu+memory 21. AI Post Transformers: Vistara Brings CXL Memory to Hyperscale — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-11-vistara-brings-cxl-memory-to-hyperscale-b5199e.mp3 22. AI Post Transformers: FengHuang for Rack-Scale LLM Inference Memory — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-12-fenghuang-for-rack-scale-llm-inference-m-62708e.mp3 23. AI Post Transformers: Why LLM Serving Needs Mathematical Optimization — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-05-why-llm-serving-needs-mathematical-optim-647fc6.mp3 24. AI Post Transformers: Affordable Large-Scale Decoding Through Model-System Co-Design — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-19-affordable-large-scale-decoding-through-e1d7ed.mp3 Interactive Visualization: CXL-GPU and Beyond Onboard Memory [https://podcast.do-not-panic.com/viz/2026-05-27-cxl-gpu-and-beyond-onboard-memory-98f5ff.html]

27 de may de 20261 h 0 min

DFX: Multi-FPGA Acceleration for Transformer Inference

Descripción

Comentarios

Empieza 7 días de prueba

Todos los episodios