AI Post Transformers
This episode explores the 2008 Dragonfly network topology paper and why its ideas suddenly matter again for large-scale AI systems in 2026. It explains how Dragonfly uses high-radix routers and router groups to keep most traffic to a local hop, a single global hop, and another local hop, reducing the number of expensive long-distance optical links compared with flattened butterfly and folded Clos designs. The discussion highlights the paper’s core argument that topology and routing must be co-designed around pin bandwidth, cable cost, power, and congestion, with the authors claiming roughly 20 percent lower cost than flattened butterfly and 52 percent lower cost than folded Clos beyond 16K nodes under their assumptions. Listeners would find it interesting because it connects an old supercomputing interconnect idea to modern TPU fabrics, mixture-of-experts traffic, all-to-all communication, and the growing reality that network design now directly shapes AI system performance. Sources: 1. Dragonfly Topology for Scalable AI Networks https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/34926.pdf 2. Technology-Driven, Highly-Scalable Dragonfly Topology — John Kim, William J. Dally, Steve Scott, Dennis Abts, 2008 https://scholar.google.com/scholar?q=Technology-Driven,+Highly-Scalable+Dragonfly+Topology 3. Flattened Butterfly: A Cost-Efficient Topology for High-Radix Networks — John Kim, William J. Dally, Dennis Abts, 2007 https://scholar.google.com/scholar?q=Flattened+Butterfly:+A+Cost-Efficient+Topology+for+High-Radix+Networks 4. Topological Characterization of Hamming and Dragonfly Networks and Its Implications on Routing — Cristobal Camarero, Enrique Vallejo, Ramon Beivide, 2014 https://scholar.google.com/scholar?q=Topological+Characterization+of+Hamming+and+Dragonfly+Networks+and+Its+Implications+on+Routing 5. Slim Fly: A Cost Effective Low-Diameter Network Topology — Maciej Besta, Torsten Hoefler, 2014 https://scholar.google.com/scholar?q=Slim+Fly:+A+Cost+Effective+Low-Diameter+Network+Topology 6. Microarchitecture of a High-Radix Router — John Kim, William J. Dally, Brian Towles, Amit K. Gupta, 2005 https://scholar.google.com/scholar?q=Microarchitecture+of+a+High-Radix+Router 7. The BlackWidow High-Radix Clos Network — Steve Scott, Dennis Abts, John Kim, William J. Dally, 2006 https://scholar.google.com/scholar?q=The+BlackWidow+High-Radix+Clos+Network 8. Scalable High-Radix Router Microarchitecture Using a Network Switch Organization — Jung Ho Ahn, Young Hoon Son, John Kim, 2013 https://scholar.google.com/scholar?q=Scalable+High-Radix+Router+Microarchitecture+Using+a+Network+Switch+Organization 9. A Scheme for Fast Parallel Communication — L. G. Valiant, 1982 https://scholar.google.com/scholar?q=A+Scheme+for+Fast+Parallel+Communication 10. Indirect Adaptive Routing on Large Scale Interconnection Networks — Nan Jiang, John Kim, William J. Dally, 2009 https://scholar.google.com/scholar?q=Indirect+Adaptive+Routing+on+Large+Scale+Interconnection+Networks 11. Rationale and Challenges for Optical Interconnects to Electronic Chips — David A. B. Miller, 2000 https://scholar.google.com/scholar?q=Rationale+and+Challenges+for+Optical+Interconnects+to+Electronic+Chips 12. Optical Interconnects for High-Performance Computing — Marc A. Taubenblatt, 2012 https://scholar.google.com/scholar?q=Optical+Interconnects+for+High-Performance+Computing 13. Optical Interconnects for Extreme Scale Computing Systems — Sebastien Rumley, Meisam Bahadori, Robert Polster, Simon D. Hammond, David M. Calhoun, Ke Wen, Arun Rodrigues, Keren Bergman, 2017 https://scholar.google.com/scholar?q=Optical+Interconnects+for+Extreme+Scale+Computing+Systems 14. Mission Apollo: Landing Optical Circuit Switching at Datacenter Scale — Ryohei Urata, Hong Liu, Kevin Yasumura, Erji Mao, Jill Berger, Xiang Zhou, Cedric Lam, Roy Bannon, Darren Hutchinson, Daniel Nelson, Leon Poutievski, Arjun Singh, Joon Ong, Amin Vahdat, 2022 https://scholar.google.com/scholar?q=Mission+Apollo:+Landing+Optical+Circuit+Switching+at+Datacenter+Scale 15. Adaptive Routing in High-Radix Clos Network — John Kim, William J. Dally, Dennis Abts, 2006 https://doi.org/10.1145/1188455.1188552 16. Efficient All-to-All Collective Communication Schedules for Direct-Connect Topologies — Prithwish Basu, Liangyu Zhao, Jason Fantl, Siddharth Pal, Arvind Krishnamurthy, Joud Khoury, 2024 https://doi.org/10.1145/3625549.3658656 17. Toward lower-diameter large-scale HPC and data center networks with co-packaged optics — Pavlos Maniotis, Laurent Schares, Benjamin G. Lee, Marc A. Taubenblatt, Daniel M. Kuchta, 2021 https://scholar.google.com/scholar?q=Toward+lower-diameter+large-scale+HPC+and+data+center+networks+with+co-packaged+optics 18. Toward higher-radix switches with co-packaged optics for improved network locality in data center and HPC networks [Invited] — Pavlos Maniotis, Laurent Schares, Daniel M. Kuchta, Bengi Karacali, 2022 https://scholar.google.com/scholar?q=Toward+higher-radix+switches+with+co-packaged+optics+for+improved+network+locality+in+data+center+and+HPC+networks+[Invited] 19. Exploring the benefits of using co-packaged optics in data center and AI supercomputer networks: a simulation-based analysis [Invited] — Pavlos Maniotis, Daniel M. Kuchta, 2024 https://scholar.google.com/scholar?q=Exploring+the+benefits+of+using+co-packaged+optics+in+data+center+and+AI+supercomputer+networks:+a+simulation-based+analysis+[Invited] 20. Enhanced UGAL Routing Schemes for Dragonfly Networks — Ram Sharan Chaulagain, Xin Yuan, 2024 https://scholar.google.com/scholar?q=Enhanced+UGAL+Routing+Schemes+for+Dragonfly+Networks 21. On Selection Functions in Adaptive Routing — Alejandro Cano, Cristobal Camarero, Carmen Martinez, 2025 https://scholar.google.com/scholar?q=On+Selection+Functions+in+Adaptive+Routing 22. Co-packaged optics (CPO): status, challenges, and solutions — Min Tan and coauthors, 2023 https://scholar.google.com/scholar?q=Co-packaged+optics+(CPO):+status,+challenges,+and+solutions 23. AI Post Transformers: Computation-Bandwidth-Memory Trade-offs for AI Infrastructure — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-09-computation-bandwidth-memory-trade-offs-a83f2b.mp3 24. AI Post Transformers: FengHuang for Rack-Scale LLM Inference Memory — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-12-fenghuang-for-rack-scale-llm-inference-m-62708e.mp3 25. AI Post Transformers: Serving MoE Models with Disaggregated Expert Parallelism — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-19-serving-moe-models-with-disaggregated-ex-6979d2.mp3 26. AI Post Transformers: Lossless Sparse Deltas for RL Networks — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-15-lossless-sparse-deltas-for-rl-networks-84d676.mp3
670 episoder
Kommentarer
0Vær den første til at kommentere
Tilmeld dig nu og bliv en del af AI Post Transformers-fællesskabet!