Ethernet Based AI Cluster Fabric - Performance Improvement - Tuning in SONiC | OCP Dublin 2025

19 min · 28 de ene de 2026

Descripción

I’ll rewrite your session description to match the same structure and tone as the example: short intro, “Learn how” value line, punchy bullet takeaways, and a timestamp-style outline, ending with the same contact/social footer. Recorded live at the OCP Regional Summit Dublin 2025, this episode features Nanda Ravindran (VP of Technical Sales, Edgecore Networks) sharing hands-on, real-world insights into tuning AI-scale network fabrics with SONiC. Learn how Edgecore benchmarks and optimizes 800G AI switches in SONiC — and why consistent, repeatable tuning (plus validation under realistic load) is critical for stable AI network performance. * AI workload characteristics and the fabric performance challenges they introduce * Step-by-step SONiC tuning: PFC, ECN, and DLB configuration fundamentals * Using Spirent test equipment to generate realistic AI traffic profiles and stress conditions * What changes performance: topology choices, link failures, VXLAN overlays, and traffic patterns * Flowlet mode vs. hash mode — which delivers better outcomes for AI use cases * Why automation, repeatable test methods, and community best practices matter at AI scale * Edgecore’s open networking approach: collaborating with Broadcom on Enterprise SONiC for next-gen AI deployments Session outline: 00:00 Intro — Nanda Ravindran & session overview 01:00 Why AI fabric tuning matters — 800G benchmarking + recurring performance gaps 02:00 AI workload traits — elephant flows, low entropy, load-balancing pressure; goal: lossless + low latency 03:00 SONiC tuning focus — RoCEv2 mapping + PFC, ECN, DLB 04:00 Testbed overview — 6× Edgecore 800G (TH5), SONiC 202311-based, non-blocking fabric 05:00 Spirent methodology — AI workload emulation, collectives, measurements 06:00 PFC configuration — QoS profiles (DSCP→TC→Queue/PG), bindings, enablement 08:00 ECN configuration — WRED profile, thresholds, drop probability sweeps 09:00 DLB explained — hash vs flowlet; why flowlet tuning matters 10:00 Key findings — PFC-only best in lab; PFC+ECN required for deployments 12:00 ECN result highlight — example best setting (1% drop, 2MB/10MB thresholds) 13:00 800G vs 400G/breakout — native 800G performs better for AI workloads 14:00 Failure + VXLAN tests — link failures hurt; VXLAN shows minimal impact 15:00 Collectives + PXN — PXN best; flowlet recovers faster than hash 16:00 Call to action — automation + repeatable community best practices 18:00 Q&A — question on newer enhanced DLB/ECMP; plan to test on newer SONiC 📬 Questions or support: support@stordis.com | 🌐 www.stordis.com [http://www.stordis.com] Let’s get social 💻 Blog: https://stordis.com/blog/ [https://stordis.com/blog/] 📘 Facebook: https://www.facebook.com/people/STORDIS-GmbH/100057058555819/ [https://www.facebook.com/people/STORDIS-GmbH/100057058555819/] 📸 Instagram: https://www.instagram.com/stordis_open_networking/ [https://www.instagram.com/stordis_open_networking/] 👥 LinkedIn: https://www.linkedin.com/company/stordis/ [https://www.linkedin.com/company/stordis/] 🐦 X: https://twitter.com/STORDIS_GmbH/ [https://twitter.com/STORDIS_GmbH/] #SONiC #AIFabricTuning #Edgecore #800GSwitches #OCPDublin2025 #ECN #PFC #DLB #AIWorkloads #SONiCOptimization #OpenNetworking #EnterpriseSONiC #Broadcom #FlowletMode #NetworkAutomation #AIInfrastructure

Comentarios

Sé la primera persona en comentar

¡Regístrate ahora y únete a la comunidad de Inside Open Networking by STORDIS – the podcast where tech meets real life!

Prueba gratis

Todos los episodios

13 episodios

Broadcom Tomahawk 6 & Ultra: Ethernet Silicon for AI

Broadcom Tomahawk 6 and Tomahawk Ultra are redefining Ethernet for AI infrastructure. Recorded at CLOUDFEST, Łukasz Łukowski from STORDIS talks with Kamran Naqvi from Broadcom about 102.4 Tb/s switching silicon, scale-out and scale-up AI networking, ultra-low-latency HPC, Enterprise SONiC, and why Ethernet is ready for the next generation of AI data centers.

1 de jun de 20269 min

Infrastructure as Prompt: AI Agents, Digital Twins & the Future of Network Automation with Kamal Bhatt

What happens when infrastructure is no longer written only as code, but described as a prompt? In this CLOUDFEST Masterclass, Kamal Bhatt from STORDIS explains the concept of Infrastructure as Prompt — an AI-driven approach to designing, testing, documenting, and automating modern infrastructure using coding agents, digital twins, Git, GNS3, NetBox, telemetry, and real network devices. What if modern infrastructure could be designed, tested, and automated through natural language? In this episode, we feature Kamal Bhatt, Head of Value Driven Software Delivery at STORDIS, presenting his CLOUDFEST Masterclass on Infrastructure as Prompt — a new way of thinking about infrastructure automation in the age of AI coding agents. Kamal walks through the evolution of orchestration, from CLI and SNMP to SDN, REST APIs, Infrastructure as Code, and now AI-assisted workflows where prompts, skills, instructions, and MCP can help turn intent into infrastructure actions. This session explores how tools such as GitHub Copilot, GNS3, NetBox, telemetry systems, Git repositories, Draw.io, and real switches were combined in a proof of concept for AI-driven infrastructure operations. You will learn how AI agents can compare documentation with live or simulated network topologies, generate deviation reports, support digital twin workflows, identify ACL overlaps and IP conflicts, and even turn diagrams into simulated labs. The episode also looks honestly at the trust gaps that still need to be solved before AI-driven infrastructure automation can be used safely in production — including hallucinations, ambiguity, AAA, vendor differences, rollback strategies, and the need for guardrails. If you are interested in network automation, NetDevOps, AI infrastructure, digital twins, data center orchestration, or the future of Infrastructure as Code, this episode is for you. You will discover: How infrastructure orchestration evolved from CLI and SNMP to APIs and Infrastructure as Code What “Infrastructure as Prompt” means in practice How AI coding agents can support infrastructure operations Why Git repositories matter for backups, generated code, and reference designs How GNS3, NetBox, telemetry, and real switches can work together in an AI workflow How hand-drawn diagrams and Draw.io files can become topology simulations Why digital twins and guardrails are essential before production use Where the biggest trust gaps are in AI-driven infrastructure automation 00:00 Intro 00:55 Evolution of orchestration: CLI, SNMP, SDN, REST, IaC 01:58 From code to prompt 03:03 What Infrastructure as Prompt means 04:27 Proof of concept architecture 06:44 Prompts, skills, instructions, and MCP 07:39 GNS3, NetBox, telemetry, and real switches 09:33 Why Git repositories matter 12:07 Device skills and automated backups 15:33 GNS3 REST API automation 17:20 Comparing GNS3 topology with NetBox documentation 19:03 Digital twins, GitHub issues, ACL overlaps, and IP conflicts 21:28 Turning diagrams into GNS3 labs 23:29 Trust gaps in AI-driven orchestration 25:57 Digital twins, guardrails, and validation 28:10 Q&A Presentation: https://stordis.com/wp-content/uploads/2026/04/Kamal-Krishna-Bhatt_Infrastructure-As-Prompt_Stordis-GmbH-.pdf [https://stordis.com/wp-content/uploads/2026/04/Kamal-Krishna-Bhatt_Infrastructure-As-Prompt_Stordis-GmbH-.pdf] Kamal Bhatt YouTube Channel: https://www.youtube.com/@UCGZUFgtdxAkUIyiS-YU9XpQ [https://www.youtube.com/@UCGZUFgtdxAkUIyiS-YU9XpQ] STORDIS: https://www.stordis.com [https://www.stordis.com] Support: support@stordis.com

22 de may de 202630 min

OpenLAN Switching - More Than Just Cloud Management | OCP Dublin 2025

Recorded live at the OCP Regional Summit Dublin 2025, this session features Gidi Navon of Marvell and Taras Chornyi of PLVISION exploring how OpenLAN Switching is pushing open networking beyond cloud environments and into campus, enterprise, and industrial deployments. As part of the Telecom Infra Project (TIP), OpenLAN is building a community-driven switching ecosystem centered on open-source innovation, SONiC integration, and cloud-based management. In this practical and forward-looking talk, the speakers explain how OpenLAN extends the success of OpenWiFi into Ethernet switching, enabling scalable, cloud-managed infrastructure with Zero Touch Provisioning (ZTP), northbound APIs, and true multi-vendor interoperability. They also introduce a community-enhanced, fully open-source SONiC approach designed to deliver greater transparency, flexibility, and freedom from vendor lock-in. Key Takeaways What OpenLAN Is: An overview of the OpenLAN Switching initiative and its role within the TIP ecosystem for open, cloud-managed networking. Cloud-Managed Switching: How OpenLAN enables Zero Touch Provisioning, centralized orchestration, and simplified operations for modern campus and enterprise networks. Growing Ecosystem: Progress across hardware, software, ODM participation, and multi-vendor support—with 22+ switch SKUs and 4+ ODMs already in the ecosystem. SONiC Integration: How OpenLAN is evolving its NOS strategy around SONiC, including containerized northbound agents and a more open, transparent software model. Real-World Applications: Use cases spanning campuses, public venues, and industrial environments where flexible, cloud-native switching is becoming essential. What’s Next: A look at OpenLAN certification plans and the roadmap for broader industry adoption. Session Outline 00:00 – Intro 00:03 – Welcome 00:16 – OpenLAN Switching Overview 01:04 – What Is OpenLAN? 02:36 – Cloud Management and ZTP 05:03 – Building the Ecosystem 06:35 – Project Progress Update 07:36 – Hardware SKUs and Roadmap 08:52 – New Switching Features 09:42 – NOS Evolution and SONiC 11:18 – Software Architecture 12:14 – Cloud SDK Overview 15:07 – SONiC Integration 16:43 – Use Cases 19:01 – Key Takeaways 20:45 – Q&A 20:58 – Multi-Vendor Support 22:06 – Certification Roadmap Whether you're a network operator, campus IT leader, or open-source networking enthusiast, this session offers a valuable look at how OpenLAN is enabling more flexible, interoperable, and scalable switching for the next generation of enterprise infrastructure. Stay Connected 📬 Questions or support: support@stordis.com | 🌐 www.stordis.com [http://www.stordis.com] 💻 Blog: https://stordis.com/blog/ [https://stordis.com/blog/] 📘 Facebook: https://www.facebook.com/people/STORDIS-GmbH/100057058555819/ [https://www.facebook.com/people/STORDIS-GmbH/100057058555819/] 📸 Instagram: https://www.instagram.com/stordis_open_networking/ [https://www.instagram.com/stordis_open_networking/] 👥 LinkedIn: https://www.linkedin.com/company/stordis/ [https://www.linkedin.com/company/stordis/] 🐦 X: https://twitter.com/STORDIS_GmbH/ [https://twitter.com/STORDIS_GmbH/]

11 de mar de 202623 min

Open-Source Operable SONiC | OCP Dublin 2025

Recorded live at the OCP Regional Summit Dublin 2025, this candid session features Matthias Haag (Founder & CEO, UhuruTec) sharing what it really looks like to deploy Community SONiC for a production cloud platform — from the perspective of a network user, not a network admin. Matthias walks through the promise of vendor-neutral networking (a “Linux-like” hardware abstraction for switches) — and the friction points he hit when trying to operationalize Community SONiC: slow and flaky build pipelines, unclear failures and unpinned dependencies, missing release tagging/versioning, feature gaps like EVPN multihoming, and day-2 reliability issues such as SNMP container crashes and unexpected restarts. He then contrasts Community SONiC with commercial/enterprise SONiC distributions (including Broadcom’s), and ends with a clear call to action: a truly open, enterprise-ready SONiC with proper CI/CD, reproducible builds, platform testing, sane releases/tags, and more upstream collaboration — with fewer “politics” slowing down key contributions. 00:00 Music / opening00:11 Intro: who Matthias is and the perspective he brings (user vs. admin)00:36 Project context: why NOS/switching choices matter for a cloud platform00:46 Why SONiC and why starting with Community SONiC03:43 Hardware bring-up + setting up a custom build approach04:18 Build realities: long runtimes, caching effort, failure loops04:59 Versioning challenges: build identification, consistency, missing tags/releases05:56 Pipeline reliability concerns: “rerun works” and dependency/pinning issues06:50 Design goals: target fabric architecture + connectivity requirements07:12 Feature gaps: key capabilities missing or delayed upstream07:32 Operational stability: monitoring/service reliability pain points08:06 Boot/bring-up issues: unexpected states and disruptive fixes08:18 Hardware visibility: intermittent missing interface/transceiver info08:43 Port configuration: workarounds and manual adjustments09:06 Config errors: “applied but errors” situations that reduce confidence09:33 Unexpected restarts after configuration changes10:06 Day-2 usability: split tools for different layers and admin workflow friction10:34 Production readiness: where it’s okay today vs. where it’s still risky11:23 Enterprise/commercial comparison: features, stability, QA, polish12:04 The “server-world” analogy: openness vs. ecosystem constraints/lock-in risk13:13 Call to action: truly open “enterprise” SONiC (releases, CI/CD, testing, UX)18:06 Q&A: bringing issues upstream; tags existed before, then stopped19:25 Vendor discussion: shifting toward stronger community participation21:23 Community feedback: upstream-first collaboration vs. fragmented forks 📬 Questions or support: support@stordis.com | www.stordis.com [http://www.stordis.com] Let’s get social 💻 Blog: stordis.com/blog/ 📘 Facebook: facebook.com/people/STORDIS-GmbH/100057058555819/ 📸 Instagram: instagram.com/stordis_open_networking/ 👥 LinkedIn: linkedin.com/company/stordis/ 🐦 X: twitter.com/STORDIS_GmbH/

16 de feb de 202622 min

Accelerating AI with Next-Gen Networking SONiC Innovations and Scalable Designs

Recorded live at the OCP Regional Summit Dublin 2025, this session features Kamran Naqvi (Broadcom) exploring how next-generation Ethernet networking is accelerating AI infrastructure — and what enterprises need to build practical, scalable GPU clusters with SONiC. From AI traffic patterns (elephant flows, low entropy, tail latency) to real-world topology and cabling choices, Kamran breaks down what makes AI fabrics different from traditional data centers — and how Enterprise SONiC enhancements (smarter hashing, adaptive routing, better load distribution) help remove networking as the bottleneck. Key Takeaways The unique networking demands of AI back-end fabrics: elephant flows, poor entropy, RDMA retransmission challenges, and tail latency The four fabrics in AI infrastructure — and why the back-end fabric is where “the uniqueness” shows up Scale-up vs. scale-out networking: what each does and where enterprise designs focus today Enterprise-ready GPU cluster designs and scalable Ethernet fabrics for AI SONiC improvements for AI workloads: Advanced hashing (including deeper header visibility for better entropy) Adaptive routing approaches, including flowlet-based spraying for better balancing with minimal reordering risk Practical best practices for cable/optics selection, rack/topology layout, and failure recovery planning Why Ethernet often beats InfiniBand in production AI deployments — including faster failover behavior and scalability Session outline 00:03 – Intro: Kamran’s role at Broadcom; session focus 00:21 – What you’ll learn: AI network needs, scalable enterprise designs 01:14 – AI infrastructure fabrics + scale-up vs scale-out (enterprise reality) 07:54 – AI as distributed compute: why networking drives job completion time 11:16 – AI traffic patterns: elephant flows, RDMA pain, tail latency 14:08 – Fabric challenges: collisions, failures, incast + required capabilities 16:04 – Broadcom approaches + why Ethernet often wins vs InfiniBand (incl. failover) 18:28 – Topologies: closed vs rail-optimized; when spines still matter 23:08 – Cabling/optics best practices: DAC first, then linear pluggables; CPO intro 27:09 – Reference designs: rack layouts and cluster scaling examples 32:04 – SONiC for AI: advanced hashing + adaptive routing (flowlet spray) 36:07 – Wrap-up: QR code for reference architecture + materials; thanks Download Download the Broadcom AI Reference Architecture (via the QR code shown during the session). Stay Connected 📬 Questions or support: support@stordis.com | 🌐 www.stordis.com Let’s get social 💻 Blog: https://stordis.com/blog/ 📘 Facebook: https://www.facebook.com/people/STORDIS-GmbH/100057058555819/ 📸 Instagram: https://www.instagram.com/stordis_open_networking/ 👥 LinkedIn: https://www.linkedin.com/company/stordis/ 🐦 X: https://twitter.com/STORDIS_GmbH/

5 de feb de 202637 min

Ethernet Based AI Cluster Fabric - Performance Improvement - Tuning in SONiC | OCP Dublin 2025

Descripción

Comentarios

Empieza 7 días de prueba

Todos los episodios