AI Post Transformers
This episode explores a 2024 paper on the LPU, a custom processor designed specifically for large language model inference, with an emphasis on reducing the per-token delay that users notice in interactive systems. It explains why autoregressive decoding is often limited by memory movement and synchronization rather than raw compute, making conventional GPU strengths less decisive in small-batch, user-facing generation. The discussion highlights the paper’s full-stack argument: a specialized chip, a supporting software stack called HyperDex, and a multi-device link meant to preserve low latency while scaling across processors. Listeners would find it interesting because it reframes AI hardware performance around real conversational responsiveness and digs into whether the paper’s bold efficiency and scaling claims actually hold up under careful comparison. Sources: 1. LPU Chip for Low-Latency LLM Inference https://arxiv.org/pdf/2408.07326 2. DFX: A Low-latency Multi-FPGA Appliance for Accelerating Transformer-based Text Generation — Seongmin Hong, et al., 2022 https://scholar.google.com/scholar?q=DFX:+A+Low-latency+Multi-FPGA+Appliance+for+Accelerating+Transformer-based+Text+Generation 3. SpAtten: Efficient Sparse Attention Architecture with Cascade Token and Head Pruning — Hanrui Wang, Zhekai Zhang, Song Han, 2021 https://scholar.google.com/scholar?q=SpAtten:+Efficient+Sparse+Attention+Architecture+with+Cascade+Token+and+Head+Pruning 4. A Software-Defined Tensor Streaming Multiprocessor for Large-Scale Machine Learning — Dennis Abts, et al., 2022 https://scholar.google.com/scholar?q=A+Software-Defined+Tensor+Streaming+Multiprocessor+for+Large-Scale+Machine+Learning 5. Efficient Memory Management for Large Language Model Serving with PagedAttention — Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, et al., 2023 https://scholar.google.com/scholar?q=Efficient+Memory+Management+for+Large+Language+Model+Serving+with+PagedAttention 6. DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving — Yinmin Zhong, Shengyu Zhao, et al., 2024 https://scholar.google.com/scholar?q=DistServe:+Disaggregating+Prefill+and+Decoding+for+Goodput-optimized+Large+Language+Model+Serving 7. RazorAttention: Efficient KV Cache Compression Through Retrieval Heads — Hanlin Tang et al., 2024 https://scholar.google.com/scholar?q=RazorAttention:+Efficient+KV+Cache+Compression+Through+Retrieval+Heads 8. Not All Heads Matter: A Head-Level KV Cache Compression Method with Integrated Retrieval and Reasoning — Yu Fu et al., 2024 https://scholar.google.com/scholar?q=Not+All+Heads+Matter:+A+Head-Level+KV+Cache+Compression+Method+with+Integrated+Retrieval+and+Reasoning 9. Efficient Interactive LLM Serving with Proxy Model-based Sequence Length Prediction — Haoran Qiu et al., 2024 https://scholar.google.com/scholar?q=Efficient+Interactive+LLM+Serving+with+Proxy+Model-based+Sequence+Length+Prediction 10. Slice-Level Scheduling for High Throughput and Load Balanced LLM Serving — Ke Cheng et al., 2024 https://scholar.google.com/scholar?q=Slice-Level+Scheduling+for+High+Throughput+and+Load+Balanced+LLM+Serving 11. Deferred Continuous Batching in Resource-Efficient Large Language Model Serving — Yongjun He, Yao Lu, Gustavo Alonso, 2024 https://scholar.google.com/scholar?q=Deferred+Continuous+Batching+in+Resource-Efficient+Large+Language+Model+Serving 12. TokenWeave: Efficient Compute-Communication Overlap for Distributed LLM Inference — Raja Gond, Nipun Kwatra, Ramachandran Ramjee, 2025 https://scholar.google.com/scholar?q=TokenWeave:+Efficient+Compute-Communication+Overlap+for+Distributed+LLM+Inference 13. Characterizing Communication Patterns in Distributed Large Language Model Inference — Lang Xu et al., 2025 https://scholar.google.com/scholar?q=Characterizing+Communication+Patterns+in+Distributed+Large+Language+Model+Inference 14. Mind the Memory Gap: Unveiling GPU Bottlenecks in Large-Batch LLM Inference — Pol G. Recasens et al., 2025 https://scholar.google.com/scholar?q=Mind+the+Memory+Gap:+Unveiling+GPU+Bottlenecks+in+Large-Batch+LLM+Inference 15. AI Post Transformers: FengHuang for Rack-Scale LLM Inference Memory — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-12-fenghuang-for-rack-scale-llm-inference-m-62708e.mp3 16. AI Post Transformers: Deep Kernel Fusion for Transformer Decoding — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-15-deep-kernel-fusion-for-transformer-decod-b1a703.mp3 17. AI Post Transformers: Affordable Large-Scale Decoding Through Model-System Co-Design — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-19-affordable-large-scale-decoding-through-e1d7ed.mp3 18. AI Post Transformers: Speculative Decoding in Real vLLM Serving — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-04-speculative-decoding-in-real-vllm-servin-6f4e2b.mp3 19. AI Post Transformers: Why LLM Serving Needs Mathematical Optimization — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-05-why-llm-serving-needs-mathematical-optim-647fc6.mp3 20. AI Post Transformers: LAPS for Length-Aware LLM Serving — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-05-laps-for-length-aware-llm-serving-0c6149.mp3 21. AI Post Transformers: Splitwise: Phase-Split LLM Inference — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-03-26-splitwise-phase-split-llm-inference-e8945b.mp3 22. AI Post Transformers: JANUS for Scalable MoE Inference — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-15-janus-for-scalable-moe-inference-78ae30.mp3 Interactive Visualization: LPU Chip for Low-Latency LLM Inference [https://podcast.do-not-panic.com/viz/2026-05-20-lpu-chip-for-low-latency-llm-inference-be13c3.html]
663 jaksot
Kommentit
0Ole ensimmäinen kommentoija
Rekisteröidy nyt ja liity AI Post Transformers-yhteisöön!