Dwarkesh Podcast
Did a very different format with Reiner Pope - a blackboard lecture where he walks through how frontier LLMs are trained and served. It’s shocking how much you can deduce about what the labs are doing from a handful of equations, public API prices, and some chalk. It’s a bit technical, but I encourage you to hang in there – it’s really worth it. There are less than a handful of people who understand the full stack of AI, from chip design to model architecture, as well as Reiner. It was a real delight to learn from him. Recommend watching this one on YouTube [https://youtu.be/xmkSf5IS-zw] so you can see the chalkboard. Reiner [https://reiner.org/] is CEO of MatX [https://matx.com/], a new chip startup (full disclosure - I’m an angel investor). He was previously at Google, where he worked on software [https://arxiv.org/abs/2211.05102] efficiency [https://jax-ml.github.io/scaling-book/], compilers, and TPU architecture. Download markdown of transcript here [https://gist.github.com/dwarkeshsp/79100f0fdeed69d76241903bb0604dbe] to chat with an LLM. Wrote up some flashcards and practice problems [https://reiner-flashcards.vercel.app/] to help myself retain what Reiner taught. Hope it's helpful to you too! Sponsors * Jane Street [https://janestreet.com/dwarkesh] needs constant access to incredibly low-latency compute. I recently asked one of their engineers, Clark, to talk me through how they meet these demands. Our conversation—which touched on everything from FPGAs to liquid cooling—was extremely helpful as I prepped to interview Reiner. You can watch the full discussion and explore Jane Street’s open roles at janestreet.com/dwarkesh [https://janestreet.com/dwarkesh] * Google’s Gemma 4 [https://goo.gle/Gemma4] is the first open model that’s let me shut off the internet and create a fully disconnected “focus machine”. This is because Gemma is small enough to run on my laptop, but powerful enough to actually be useful. So, to prep for this interview, I downloaded Reiner’s scaling book, disconnected from wifi, and used Gemma to help me break down the material. Check it out at goo.gle/Gemma4 [https://goo.gle/Gemma4] * Cursor [https://cursor.com/dwarkesh] helped me turn some notes I took on how gradients flow during large-scale pretraining into a great animation. At first, I wasn’t sure the best way to visualize the concept, but Cursor’s Composer 2 Fast model let me iterate on different ideas almost instantaneously. You can check out the animation in my recent blog post [https://www.dwarkesh.com/p/what-i-learned-april-15]. And if you have something to visualize yourself, go to cursor.com/dwarkesh [https://cursor.com/dwarkesh] Timestamps (00:00:00) – How batch size affects token cost and speed (00:32:09) – How MoE models are laid out across GPU racks (00:47:12) – How pipeline parallelism spreads model layers across racks (01:03:37) – Why Ilya said, “As we now know, pipelining is not wise.” (01:18:59) – Because of RL, models may be 100x over-trained beyond Chinchilla-optimal (01:33:02) – Deducing long context memory costs from API pricing (02:04:02) – Convergent evolution between neural nets and cryptography Get full access to Dwarkesh Podcast at www.dwarkesh.com/subscribe [https://www.dwarkesh.com/subscribe?utm_medium=podcast&utm_campaign=CTA_4]
124 jaksot
Kommentit
0Ole ensimmäinen kommentoija
Rekisteröidy nyt ja liity Dwarkesh Podcast-yhteisöön!