SemiAnalysis Weekly
This episode features Jordan Nanos (@JordanNanos) and Daniel Nishball (@dnishball) breaking down the economics of GPU clusters through real-world data and experience. Joined with Kang Wen Cheang and Zane Fong, the team discussed moving beyond theoretical TCO models as they examine how reliability differences between top-tier and lower-tier providers create significant cost disparities that aren't captured in simple per-GPU pricing. The discussion introduces practical frameworks for measuring goodput and understanding how system failures cascade through entire training jobs.Nanos walks through the mechanics of fault-tolerant frameworks including AWS's Checkpointless Training and explains why a single GPU failure can halt progress across hundreds of nodes. The conversation reveals how hyperscalers and NeoClouds price their services and why paying premium rates for reliable infrastructure often delivers better value than chasing the lowest per-hour costs. Subscribe to SemiAnalysis for in-depth analysis of AI hardware economics and infrastructure trends that impact the entire semiconductor ecosystem.
13 episodios
Comentarios
0Sé la primera persona en comentar
¡Regístrate ahora y únete a la comunidad de SemiAnalysis Weekly!