Reiner Pope – The math behind how LLMs are trained and served

Kuvaus

Did a very different format with Reiner Pope - a blackboard lecture where he walks through how frontier LLMs are trained and served. It’s shocking how much you can deduce about what the labs are doing from a handful of equations, public API prices, and some chalk. It’s a bit technical, but I encourage you to hang in there – it’s really worth it. There are less than a handful of people who understand the full stack of AI, from chip design to model architecture, as well as Reiner. It was a real delight to learn from him. Recommend watching this one on YouTube [https://youtu.be/xmkSf5IS-zw] so you can see the chalkboard. Reiner [https://reiner.org/] is CEO of MatX [https://matx.com/], a new chip startup (full disclosure - I’m an angel investor). He was previously at Google, where he worked on software [https://arxiv.org/abs/2211.05102] efficiency [https://jax-ml.github.io/scaling-book/], compilers, and TPU architecture. Download markdown of transcript here [https://gist.github.com/dwarkeshsp/79100f0fdeed69d76241903bb0604dbe] to chat with an LLM. Wrote up some flashcards and practice problems [https://reiner-flashcards.vercel.app/] to help myself retain what Reiner taught. Hope it's helpful to you too! Sponsors * Jane Street [https://janestreet.com/dwarkesh] needs constant access to incredibly low-latency compute. I recently asked one of their engineers, Clark, to talk me through how they meet these demands. Our conversation—which touched on everything from FPGAs to liquid cooling—was extremely helpful as I prepped to interview Reiner. You can watch the full discussion and explore Jane Street’s open roles at janestreet.com/dwarkesh [https://janestreet.com/dwarkesh] * Google’s Gemma 4 [https://goo.gle/Gemma4] is the first open model that’s let me shut off the internet and create a fully disconnected “focus machine”. This is because Gemma is small enough to run on my laptop, but powerful enough to actually be useful. So, to prep for this interview, I downloaded Reiner’s scaling book, disconnected from wifi, and used Gemma to help me break down the material. Check it out at goo.gle/Gemma4 [https://goo.gle/Gemma4] * Cursor [https://cursor.com/dwarkesh] helped me turn some notes I took on how gradients flow during large-scale pretraining into a great animation. At first, I wasn’t sure the best way to visualize the concept, but Cursor’s Composer 2 Fast model let me iterate on different ideas almost instantaneously. You can check out the animation in my recent blog post [https://www.dwarkesh.com/p/what-i-learned-april-15]. And if you have something to visualize yourself, go to cursor.com/dwarkesh [https://cursor.com/dwarkesh] Timestamps (00:00:00) – How batch size affects token cost and speed (00:32:09) – How MoE models are laid out across GPU racks (00:47:12) – How pipeline parallelism spreads model layers across racks (01:03:37) – Why Ilya said, “As we now know, pipelining is not wise.” (01:18:59) – Because of RL, models may be 100x over-trained beyond Chinchilla-optimal (01:33:02) – Deducing long context memory costs from API pricing (02:04:02) – Convergent evolution between neural nets and cryptography Get full access to Dwarkesh Podcast at www.dwarkesh.com/subscribe [https://www.dwarkesh.com/subscribe?utm_medium=podcast&utm_campaign=CTA_4]

David Reich – Why the Bronze Age was an inflection point in human evolution

David Reich [https://reich.hms.harvard.edu/] is back. He and collaborator Ali Akbari [https://reich.hms.harvard.edu/people/ali-akbari] just published a paper [https://reich.hms.harvard.edu/sites/reich.hms.harvard.edu/files/inline-files/2026_Akbari_Nature_selection_0.pdf] that overturns a long-standing consensus about human evolution — that natural selection has been dormant in our species since the agricultural revolution. By scaling ancient DNA sequencing and developing a new statistical method, they found that selection has actually sped up. Selection went especially bonkers during the Bronze Age (around 3,000 years ago). That’s when gene frequencies for everything from immune function to body fat to intelligence were most in flux. Over the last 10,000 years, selection pushed the genetic predictor of cognitive performance up by roughly a full standard deviation — most of it between 4,000 and 2,000 years ago. After we finished recording, David sketched out on a whiteboard his new heretical model about who the Neanderthals really were. Luckily, I took out my iPhone and managed to record it. He thinks the standard story (that Neanderthals are some separate archaic lineage we interbred with a little) just doesn’t fit the evidence. Instead, he proposes that Neanderthals are essentially genetically-swamped modern humans. A small population somewhere around the Caucasus invented Middle Stone Age technology roughly 300,000 years ago and expanded outward. The ones that moved into Europe interbred with local archaic humans, got genetically swamped, and became Neanderthals. The same expansion went into Africa, met much more diverged archaic Africans, and that mixture became us. This means Neanderthals and modern humans share the same cultural ancestry — the only difference is which archaic humans they mixed with afterward. David is a brilliant and rigorous scholar. It was a real delight to learn from him again. Watch on YouTube [https://youtu.be/sRKBGVFVYAw]; read the transcript [https://www.dwarkesh.com/p/david-reich-2]. Sponsors * Cursor [https://cursor.com/dwarkesh] was super useful as I prepped for this episode. Whenever I had a question, I’d have Cursor kick off a few different models simultaneously and then compare their responses. I found that this led to better results than I could get out of any individual LLM. If you’ve only used Cursor for coding, you should try using it for research. Check it out at cursor.com/dwarkesh [https://cursor.com/dwarkesh] * Jane Street [https://janestreet.com/dwarkesh] uses an internal currency called “hive bucks” to allocate compute through a real-time auction – and anyone can change anyone else’s bids or even kill their jobs! Everyone just trusts each other to act in the firm’s best interest, which is what lets the system work in the first place. If this weird and high-trust culture sounds like your kind of thing, Jane Street’s hiring at janestreet.com/dwarkesh [https://janestreet.com/dwarkesh] * Crusoe’s [https://crusoe.ai/dwarkesh] ML infra team built fastokens, an open-source tokenizer that delivers a ~9x speedup over Hugging Face and up to 40% faster time-to-first token – on real production workloads! Crusoe achieved these results by parallelizing things and using some clever engineering to handle duplicates without cross-thread coordination. Learn more at crusoe.ai/dwarkesh [https://crusoe.ai/dwarkesh] Timestamps (00:00:00) – Ancient DNA suggests strong selection over last 10,000 years (00:15:45) – Natural selection intensified during the Bronze Age (00:35:02) – Why didn’t evolution max out intelligence? (00:57:21) – Evolution is limited by time, not population size (01:09:02) – Why no farming before the Ice Age? (01:17:13) – The Neanderthal puzzle David can’t stop thinking about (01:54:10) – The methodology behind this breakthrough Get full access to Dwarkesh Podcast at www.dwarkesh.com/subscribe [https://www.dwarkesh.com/subscribe?utm_medium=podcast&utm_campaign=CTA_4]

8. touko 20262 h 13 min

Reiner Pope – The math behind how LLMs are trained and served

Kuvaus

Kommentit

1 kuukausi hintaan 1 €

Kaikki jaksot