
LessWrong (Curated & Popular)
Podcast de LessWrong
Audio narrations of LessWrong posts. Includes all curated posts and all posts with 125+ karma.If you'd like more, subscribe to the “Lesswrong (30+ karma)” feed.
Disfruta 90 días gratis
4,99 € / mes después de la prueba.Cancela cuando quieras.
Todos los episodios
529 episodios![episode [Linkpost] “the void” by nostalgebraist artwork](https://cdn.podimo.com/images/4abe58af-fbc4-42ce-a849-d0711b36f354_400x400.png)
This is a link post. A very long essay about LLMs, the nature and history of the the HHH assistant persona, and the implications for alignment. Multiple people have asked me whether I could post this LW in some form, hence this linkpost. (Note: although I expect this post will be interesting to people on LW, keep in mind that it was written with a broader audience in mind than my posts and comments here. This had various implications about my choices of presentation and tone, about which things I explained from scratch rather than assuming as background, my level of of comfort casually reciting factual details from memory rather than explicitly checking them against the original source, etc. Although, come of think of it, this was also true of most of my early posts on LW [which were crossposts from my blog], so maybe it's not a [...] --- First published: June 11th, 2025 Source: https://www.lesswrong.com/posts/3EzbtNLdcnZe8og8b/the-void-1 [https://www.lesswrong.com/posts/3EzbtNLdcnZe8og8b/the-void-1?utm_source=TYPE_III_AUDIO&utm_medium=Podcast&utm_content=Source+URL+in+episode+description&utm_campaign=ai_narration] Linkpost URL: https://nostalgebraist.tumblr.com/post/785766737747574784/the-void [https://www.lesswrong.com/out?url=https%3A%2F%2Fnostalgebraist.tumblr.com%2Fpost%2F785766737747574784%2Fthe-void] --- Narrated by TYPE III AUDIO [https://type3.audio/?utm_source=TYPE_III_AUDIO&utm_medium=Podcast&utm_content=Narrated+by+TYPE+III+AUDIO&utm_term=lesswrong&utm_campaign=ai_narration].

This is a blogpost version of a talk I gave earlier this year at GDM. Epistemic status: Vague and handwavy. Nuance is often missing. Some of the claims depend on implicit definitions that may be reasonable to disagree with. But overall I think it's directionally true. It's often said that mech interp is pre-paradigmatic. I think it's worth being skeptical of this claim. In this post I argue that: * Mech interp is not pre-paradigmatic. * Within that paradigm, there have been "waves" (mini paradigms). Two waves so far. * Second-Wave Mech Interp has recently entered a 'crisis' phase. * We may be on the edge of a third wave. Preamble: Kuhn, paradigms, and paradigm shifts First, we need to be familiar with the basic definition of a paradigm: A paradigm is a distinct set of concepts or thought patterns, including theories, research [...] --- Outline: (00:58) Preamble: Kuhn, paradigms, and paradigm shifts (03:56) Claim: Mech Interp is Not Pre-paradigmatic (07:56) First-Wave Mech Interp (ca. 2012 - 2021) (10:21) The Crisis in First-Wave Mech Interp (11:21) Second-Wave Mech Interp (ca. 2022 - ??) (14:23) Anomalies in Second-Wave Mech Interp (17:10) The Crisis of Second-Wave Mech Interp (ca. 2025 - ??) (18:25) Toward Third-Wave Mechanistic Interpretability (20:28) The Basics of Parameter Decomposition (22:40) Parameter Decomposition Questions Foundational Assumptions of Second-Wave Mech Interp (24:13) Parameter Decomposition In Theory Resolves Anomalies of Second-Wave Mech Interp (27:27) Conclusion The original text contained 6 footnotes which were omitted from this narration. --- First published: June 10th, 2025 Source: https://www.lesswrong.com/posts/beREnXhBnzxbJtr8k/mech-interp-is-not-pre-paradigmatic [https://www.lesswrong.com/posts/beREnXhBnzxbJtr8k/mech-interp-is-not-pre-paradigmatic?utm_source=TYPE_III_AUDIO&utm_medium=Podcast&utm_content=Source+URL+in+episode+description&utm_campaign=ai_narration] --- Narrated by TYPE III AUDIO [https://type3.audio/?utm_source=TYPE_III_AUDIO&utm_medium=Podcast&utm_content=Narrated+by+TYPE+III+AUDIO&utm_term=lesswrong&utm_campaign=ai_narration]. --- Images from the article: Presentation slide titled [https://lh7-rt.googleusercontent.com/docsz/AD_4nXeQ8ct0z2yZiSn4G_UdTDVr-7Ffh1P-KwBMxTgmA_N-9nqZtA5FbDev_wd0SRnCi0O9PSUPAP8phfOzNoQDzsC6aGylEAgliztXvkuxLsOYptUFmlv2n5mGKuSyzCqVOFmL8biusQ?key=RjtCWWHJ6jQ9e-KD2lcWVw]https://lh7-rt.googleusercontent.com/docsz/AD_4nXeQ8ct0z2yZiSn4G_UdTDVr-7Ffh1P-KwBMxTgmA_N-9nqZtA5FbDev_wd0SRnCi0O9PSUPAP8phfOzNoQDzsC6aGylEAgliztXvkuxLsOYptUFmlv2n5mGKuSyzCqVOFmL8biusQ?key=RjtCWWHJ6jQ9e-KD2lcWVw ---------------------------------------- Technical diagram titled [https://lh7-rt.googleusercontent.com/docsz/AD_4nXc8inKkvRdAazaJazbkfGX6iJ0odlMo_rPVmgwOtiIZkn3od3RTbrrGUZg64s_gFSWb67070N5rEdlkXZnLB--UOkEV3h46fQ6cLwKDBBwga7vU7cW5FCFRpL4IkDCuQeJmHWde2Q?key=RjtCWWHJ6jQ9e-KD2lcWVw]https://lh7-rt.googleusercontent.com/docsz/AD_4nXc8inKkvRdAazaJazbkfGX6iJ0odlMo_rPVmgwOtiIZkn3od3RTbrrGUZg64s_gFSWb67070N5rEdlkXZnLB--UOkEV3h46fQ6cLwKDBBwga7vU7cW5FCFRpL4IkDCuQeJmHWde2Q?key=RjtCWWHJ6jQ9e-KD2lcWVw

Current “unlearning” methods only suppress capabilities instead of truly unlearning the capabilities. But if you distill an unlearned model into a randomly initialized model, the resulting network is actually robust to relearning. We show why this works, how well it works, and how to trade off compute for robustness. Unlearn-and-Distill applies unlearning to a bad behavior and then distills the unlearned model into a new model. Distillation makes it way harder to retrain the new model to do the bad thing. Produced as part of the ML Alignment & Theory Scholars Program in the winter 2024–25 cohort of the shard theory stream. Read our paper on ArXiv and enjoy an interactive demo. Robust unlearning probably reduces AI risk Maybe some future AI has long-term goals and humanity is in its way. Maybe future open-weight AIs have tons of bioterror expertise. If a system has dangerous knowledge, that system becomes [...] --- Outline: (01:01) Robust unlearning probably reduces AI risk (02:42) Perfect data filtering is the current unlearning gold standard (03:24) Oracle matching does not guarantee robust unlearning (05:05) Distillation robustifies unlearning (07:46) Trading unlearning robustness for compute (09:49) UNDO is better than other unlearning methods (11:19) Where this leaves us (11:22) Limitations (12:12) Insights and speculation (15:00) Future directions (15:35) Conclusion (16:07) Acknowledgments (16:50) Citation The original text contained 2 footnotes which were omitted from this narration. --- First published: June 13th, 2025 Source: https://www.lesswrong.com/posts/anX4QrNjhJqGFvrBr/distillation-robustifies-unlearning [https://www.lesswrong.com/posts/anX4QrNjhJqGFvrBr/distillation-robustifies-unlearning?utm_source=TYPE_III_AUDIO&utm_medium=Podcast&utm_content=Source+URL+in+episode+description&utm_campaign=ai_narration] --- Narrated by TYPE III AUDIO [https://type3.audio/?utm_source=TYPE_III_AUDIO&utm_medium=Podcast&utm_content=Narrated+by+TYPE+III+AUDIO&utm_term=lesswrong&utm_campaign=ai_narration]. --- Images from the article: Unlearn-and-Distill applies unlearning to a bad behavior and then distills the unlearned model into a new model. Distillation makes it way harder to retrain the new model to do the bad thing. [https://lh7-rt.googleusercontent.com/docsz/AD_4nXfROL1sDIflu1Jd5x0kRu28HEL2ANRfI_HeABaCAzQdRbS8rWjl8-O1mtV66FYxKlwvIMucPxzyPKe9RUF5NZscYAISXKAd3mHaJbO-Dv2HwhdwIVvWQIna8XpZlZ2P48_JTA2o?key=Pplkg-7kqc_sIfFHnelBEw]https://lh7-rt.googleusercontent.com/docsz/AD_4nXfROL1sDIflu1Jd5x0kRu28HEL2ANRfI_HeABaCAzQdRbS8rWjl8-O1mtV66FYxKlwvIMucPxzyPKe9RUF5NZscYAISXKAd3mHaJbO-Dv2HwhdwIVvWQIna8XpZlZ2P48_JTA2o?key=Pplkg-7kqc_sIfFHnelBEw ---------------------------------------- Matching oracle behavior doesn’t guarantee robust unlearning. Graph (a) shows the loss during distillation of the student (Reference) and the Student (Random). Graphs (b) and (c) show forget performance through retraining for Language and Arithmetic settings, respectively. [https://39669.cdn.cke-cs.com/rQvD3VnunXZu34m86e5f/images/476b965571c43ee2cbd023aa54f655a9105e08ccda6bb96e.png]https://39669.cdn.cke-cs.com/rQvD3VnunXZu34m86e5f/images/476b965571c43ee2cbd023aa54f655a9105e08ccda6bb96e.png ----------------------------------------

A while ago I saw a person in the comments on comments to Scott Alexander's blog arguing that a superintelligent AI would not be able to do anything too weird and that "intelligence is not magic", hence it's Business As Usual. Of course, in a purely technical sense, he's right. No matter how intelligent you are, you cannot override fundamental laws of physics. But people (myself included) have a fairly low threshold for what counts as "magic," to the point where other humans can surpass that threshold. Example 1: Trevor Rainbolt. There is an 8-minute-long video where he does seemingly impossible things, such as correctly guessing that a photo of nothing but literal blue sky was taken in Indonesia or guessing Jordan based only on pavement. He can also correctly identify the country after looking at a photo for 0.1 seconds. Example 2: Joaquín "El Chapo" Guzmán. He ran [...] --- First published: June 15th, 2025 Source: https://www.lesswrong.com/posts/FBvWM5HgSWwJa5xHc/intelligence-is-not-magic-but-your-threshold-for-magic-is [https://www.lesswrong.com/posts/FBvWM5HgSWwJa5xHc/intelligence-is-not-magic-but-your-threshold-for-magic-is?utm_source=TYPE_III_AUDIO&utm_medium=Podcast&utm_content=Source+URL+in+episode+description&utm_campaign=ai_narration] --- Narrated by TYPE III AUDIO [https://type3.audio/?utm_source=TYPE_III_AUDIO&utm_medium=Podcast&utm_content=Narrated+by+TYPE+III+AUDIO&utm_term=lesswrong&utm_campaign=ai_narration].

Audio note: this article contains 329 uses of latex notation, so the narration may be difficult to follow. There's a link to the original text in the episode description. This post was written during the agent foundations fellowship with Alex Altair funded by the LTFF. Thanks to Alex, Jose, Daniel and Einar for reading and commenting on a draft. The Good Regulator Theorem, as published by Conant and Ashby in their 1970 paper (cited over 1700 times!) claims to show that 'every good regulator of a system must be a model of that system', though it is a subject of debate as to whether this is actually what the paper shows. It is a fairly simple mathematical result which is worth knowing about for people who care about agent foundations and selection theorems. You might have heard about the Good Regulator Theorem in the context of John [...] --- Outline: (03:03) The Setup (07:30) What makes a regulator good? (10:36) The Theorem Statement (11:24) Concavity of Entropy (15:42) The Main Lemma (19:54) The Theorem (22:38) Example (26:59) Conclusion --- First published: November 18th, 2024 Source: https://www.lesswrong.com/posts/JQefBJDHG6Wgffw6T/a-straightforward-explanation-of-the-good-regulator-theorem [https://www.lesswrong.com/posts/JQefBJDHG6Wgffw6T/a-straightforward-explanation-of-the-good-regulator-theorem?utm_source=TYPE_III_AUDIO&utm_medium=Podcast&utm_content=Source+URL+in+episode+description&utm_campaign=ai_narration] --- Narrated by TYPE III AUDIO [https://type3.audio/?utm_source=TYPE_III_AUDIO&utm_medium=Podcast&utm_content=Narrated+by+TYPE+III+AUDIO&utm_term=lesswrong&utm_campaign=ai_narration]. --- Images from the article: undefined [https://39669.cdn.cke-cs.com/rQvD3VnunXZu34m86e5f/images/fa14d4706fad041b1b30eeaa327ba8a48dd0224fc7678818.png]https://39669.cdn.cke-cs.com/rQvD3VnunXZu34m86e5f/images/fa14d4706fad041b1b30eeaa327ba8a48dd0224fc7678818.png ---------------------------------------- undefined [https://39669.cdn.cke-cs.com/rQvD3VnunXZu34m86e5f/images/94f5d7ec73e293707222fdf1632c8111946ccdeb36b36231.png]https://39669.cdn.cke-cs.com/rQvD3VnunXZu34m86e5f/images/94f5d7ec73e293707222fdf1632c8111946ccdeb36b36231.pngApple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts [https://pocketcasts.com/], or another podcast app.
Disfruta 90 días gratis
4,99 € / mes después de la prueba.Cancela cuando quieras.
Podcasts exclusivos
Sin anuncios
Podcast gratuitos
Audiolibros
20 horas / mes