Inside Open Networking by STORDIS – the podcast where tech meets real life
Recorded live at the OCP Regional Summit Dublin 2025, this session features Kamran Naqvi (Broadcom) exploring how next-generation Ethernet networking is accelerating AI infrastructure — and what enterprises need to build practical, scalable GPU clusters with SONiC. From AI traffic patterns (elephant flows, low entropy, tail latency) to real-world topology and cabling choices, Kamran breaks down what makes AI fabrics different from traditional data centers — and how Enterprise SONiC enhancements (smarter hashing, adaptive routing, better load distribution) help remove networking as the bottleneck. Key Takeaways The unique networking demands of AI back-end fabrics: elephant flows, poor entropy, RDMA retransmission challenges, and tail latency The four fabrics in AI infrastructure — and why the back-end fabric is where “the uniqueness” shows up Scale-up vs. scale-out networking: what each does and where enterprise designs focus today Enterprise-ready GPU cluster designs and scalable Ethernet fabrics for AI SONiC improvements for AI workloads: Advanced hashing (including deeper header visibility for better entropy) Adaptive routing approaches, including flowlet-based spraying for better balancing with minimal reordering risk Practical best practices for cable/optics selection, rack/topology layout, and failure recovery planning Why Ethernet often beats InfiniBand in production AI deployments — including faster failover behavior and scalability Session outline 00:03 – Intro: Kamran’s role at Broadcom; session focus 00:21 – What you’ll learn: AI network needs, scalable enterprise designs 01:14 – AI infrastructure fabrics + scale-up vs scale-out (enterprise reality) 07:54 – AI as distributed compute: why networking drives job completion time 11:16 – AI traffic patterns: elephant flows, RDMA pain, tail latency 14:08 – Fabric challenges: collisions, failures, incast + required capabilities 16:04 – Broadcom approaches + why Ethernet often wins vs InfiniBand (incl. failover) 18:28 – Topologies: closed vs rail-optimized; when spines still matter 23:08 – Cabling/optics best practices: DAC first, then linear pluggables; CPO intro 27:09 – Reference designs: rack layouts and cluster scaling examples 32:04 – SONiC for AI: advanced hashing + adaptive routing (flowlet spray) 36:07 – Wrap-up: QR code for reference architecture + materials; thanks Download Download the Broadcom AI Reference Architecture (via the QR code shown during the session). Stay Connected 📬 Questions or support: support@stordis.com | 🌐 www.stordis.com Let’s get social 💻 Blog: https://stordis.com/blog/ 📘 Facebook: https://www.facebook.com/people/STORDIS-GmbH/100057058555819/ 📸 Instagram: https://www.instagram.com/stordis_open_networking/ 👥 LinkedIn: https://www.linkedin.com/company/stordis/ 🐦 X: https://twitter.com/STORDIS_GmbH/
13 episodios
Comentarios
0Sé la primera persona en comentar
¡Regístrate ahora y únete a la comunidad de Inside Open Networking by STORDIS – the podcast where tech meets real life!