AXRP - the AI X-risk Research Podcast

AXRP - the AI X-risk Research Podcast

Podkast av Daniel Filan

AXRP (pronounced axe-urp) is the AI X-risk Research Podcast where I, Daniel Filan, have conversations with researchers about their papers. We discuss the paper, and hopefully get a sense of why it's been written and how it might reduce the risk of AI causing an existential catastrophe: that is, permanently and drastically curtailing humanity's future potential. You can visit the website and read transcripts at axrp.net.

Prøv gratis i 3 dager

99,00 kr / Måned etter prøveperioden.Avslutt når som helst.

Prøv gratis

Alle episoder

53 Episoder
episode 40 - Jason Gross on Compact Proofs and Interpretability artwork
40 - Jason Gross on Compact Proofs and Interpretability

How do we figure out whether interpretability is doing its job? One way is to see if it helps us prove things about models that we care about knowing. In this episode, I speak with Jason Gross about his agenda to benchmark interpretability in this way, and his exploration of the intersection of proofs and modern machine learning. Patreon: https://www.patreon.com/axrpodcast [https://www.patreon.com/axrpodcast] Ko-fi: https://ko-fi.com/axrpodcast [https://ko-fi.com/axrpodcast] Transcript: https://axrp.net/episode/2025/03/28/episode-40-jason-gross-compact-proofs-interpretability.html [https://axrp.net/episode/2025/03/28/episode-40-jason-gross-compact-proofs-interpretability.html]   Topics we discuss, and timestamps: 0:00:40 - Why compact proofs 0:07:25 - Compact Proofs of Model Performance via Mechanistic Interpretability 0:14:19 - What compact proofs look like 0:32:43 - Structureless noise, and why proofs 0:48:23 - What we've learned about compact proofs in general 0:59:02 - Generalizing 'symmetry' 1:11:24 - Grading mechanistic interpretability 1:43:34 - What helps compact proofs 1:51:08 - The limits of compact proofs 2:07:33 - Guaranteed safe AI, and AI for guaranteed safety 2:27:44 - Jason and Rajashree's start-up 2:34:19 - Following Jason's work   Links to Jason: Github: https://github.com/jasongross [https://github.com/jasongross] Website: https://jasongross.github.io [https://jasongross.github.io] Alignment Forum: https://www.alignmentforum.org/users/jason-gross [https://www.alignmentforum.org/users/jason-gross]   Links to work we discuss: Compact Proofs of Model Performance via Mechanistic Interpretability: https://arxiv.org/abs/2406.11779 [https://arxiv.org/abs/2406.11779] Unifying and Verifying Mechanistic Interpretability: A Case Study with Group Operations: https://arxiv.org/abs/2410.07476 [https://arxiv.org/abs/2410.07476] Modular addition without black-boxes: Compressing explanations of MLPs that compute numerical integration: https://arxiv.org/abs/2412.03773 [https://arxiv.org/abs/2412.03773] Stage-Wise Model Diffing: https://transformer-circuits.pub/2024/model-diffing/index.html [https://transformer-circuits.pub/2024/model-diffing/index.html] Causal Scrubbing: a method for rigorously testing interpretability hypotheses: https://www.lesswrong.com/posts/JvZhhzycHu2Yd57RN/causal-scrubbing-a-method-for-rigorously-testing [https://www.lesswrong.com/posts/JvZhhzycHu2Yd57RN/causal-scrubbing-a-method-for-rigorously-testing] Interpretability in Parameter Space: Minimizing Mechanistic Description Length with Attribution-based Parameter Decomposition (aka the Apollo paper on APD): https://arxiv.org/abs/2501.14926 [https://arxiv.org/abs/2501.14926] Towards Guaranteed Safe AI: https://www2.eecs.berkeley.edu/Pubs/TechRpts/2024/EECS-2024-45.pdf [https://www2.eecs.berkeley.edu/Pubs/TechRpts/2024/EECS-2024-45.pdf]     Episode art by Hamish Doodles: hamishdoodles.com [https://hamishdoodles.com/]

28. mars 2025 - 2 h 36 min
episode 38.8 - David Duvenaud on Sabotage Evaluations and the Post-AGI Future artwork
38.8 - David Duvenaud on Sabotage Evaluations and the Post-AGI Future

In this episode, I chat with David Duvenaud about two topics he's been thinking about: firstly, a paper he wrote about evaluating whether or not frontier models can sabotage human decision-making or monitoring of the same models; and secondly, the difficult situation humans find themselves in in a post-AGI future, even if AI is aligned with human intentions.   Patreon: https://www.patreon.com/axrpodcast [https://www.patreon.com/axrpodcast] Ko-fi: https://ko-fi.com/axrpodcast [https://ko-fi.com/axrpodcast] Transcript: https://axrp.net/episode/2025/03/01/episode-38_8-david-duvenaud-sabotage-evaluations-post-agi-future.html [https://axrp.net/episode/2025/03/01/episode-38_8-david-duvenaud-sabotage-evaluations-post-agi-future.html] FAR.AI: https://far.ai/ [https://far.ai/] FAR.AI on X (aka Twitter): https://x.com/farairesearch [https://x.com/farairesearch] FAR.AI on YouTube: @FARAIResearch [https://studio.youtube.com/channel/UCCV6kbjBZje3LPxRp0NHfxg] The Alignment Workshop: https://www.alignment-workshop.com/ [https://www.alignment-workshop.com/]   Topics we discuss, and timestamps: 01:42 - The difficulty of sabotage evaluations 05:23 - Types of sabotage evaluation 08:45 - The state of sabotage evaluations 12:26 - What happens after AGI?   Links: Sabotage Evaluations for Frontier Models: https://arxiv.org/abs/2410.21514 [https://arxiv.org/abs/2410.21514] Gradual Disempowerment: https://gradual-disempowerment.ai/ [https://gradual-disempowerment.ai/]   Episode art by Hamish Doodles: hamishdoodles.com [https://hamishdoodles.com/]

01. mars 2025 - 20 min
episode 38.7 - Anthony Aguirre on the Future of Life Institute artwork
38.7 - Anthony Aguirre on the Future of Life Institute

The Future of Life Institute is one of the oldest and most prominant organizations in the AI existential safety space, working on such topics as the AI pause open letter and how the EU AI Act can be improved. Metaculus is one of the premier forecasting sites on the internet. Behind both of them lie one man: Anthony Aguirre, who I talk with in this episode. Patreon: https://www.patreon.com/axrpodcast [https://www.patreon.com/axrpodcast] Ko-fi: https://ko-fi.com/axrpodcast [https://ko-fi.com/axrpodcast] Transcript: https://axrp.net/episode/2025/02/09/episode-38_7-anthony-aguirre-future-of-life-institute.html [https://axrp.net/episode/2025/01/24/episode-38_6-joel-lehman-positive-visions-of-ai.html] FAR.AI: https://far.ai/ [https://far.ai/] FAR.AI on X (aka Twitter): https://x.com/farairesearch [https://x.com/farairesearch] FAR.AI on YouTube: https://www.youtube.com/@FARAIResearch [https://www.youtube.com/@FARAIResearch] The Alignment Workshop: https://www.alignment-workshop.com/ [https://www.alignment-workshop.com/]   Topics we discuss, and timestamps: 00:33 - Anthony, FLI, and Metaculus 06:46 - The Alignment Workshop 07:15 - FLI's current activity 11:04 - AI policy 17:09 - Work FLI funds   Links: Future of Life Institute: https://futureoflife.org/ [https://futureoflife.org/] Metaculus: https://www.metaculus.com/ [https://www.metaculus.com/] Future of Life Foundation: https://www.flf.org/ [https://www.flf.org/]   Episode art by Hamish Doodles: hamishdoodles.com [https://hamishdoodles.com/]

09. feb. 2025 - 22 min
episode 38.6 - Joel Lehman on Positive Visions of AI artwork
38.6 - Joel Lehman on Positive Visions of AI

Typically this podcast talks about how to avert destruction from AI. But what would it take to ensure AI promotes human flourishing as well as it can? Is alignment to individuals enough, and if not, where do we go form here? In this episode, I talk with Joel Lehman about these questions. Patreon: https://www.patreon.com/axrpodcast [https://www.patreon.com/axrpodcast] Ko-fi: https://ko-fi.com/axrpodcast [https://ko-fi.com/axrpodcast] Transcript: https://axrp.net/episode/2025/01/24/episode-38_6-joel-lehman-positive-visions-of-ai.html [https://axrp.net/episode/2025/01/24/episode-38_6-joel-lehman-positive-visions-of-ai.html] FAR.AI: https://far.ai/ [https://far.ai/] FAR.AI on X (aka Twitter): https://x.com/farairesearch [https://x.com/farairesearch] FAR.AI on YouTube: https://www.youtube.com/@FARAIResearch [https://www.youtube.com/@FARAIResearch] The Alignment Workshop: https://www.alignment-workshop.com/ [https://www.alignment-workshop.com/]   Topics we discuss, and timestamps:  01:12 - Why aligned AI might not be enough 04:05 - Positive visions of AI 08:27 - Improving recommendation systems   Links: Why Greatness Cannot Be Planned: https://www.amazon.com/Why-Greatness-Cannot-Planned-Objective/dp/3319155237 [https://www.amazon.com/Why-Greatness-Cannot-Planned-Objective/dp/3319155237] We Need Positive Visions of AI Grounded in Wellbeing: https://thegradientpub.substack.com/p/beneficial-ai-wellbeing-lehman-ngo [https://thegradientpub.substack.com/p/beneficial-ai-wellbeing-lehman-ngo] Machine Love: https://arxiv.org/abs/2302.09248 [https://arxiv.org/abs/2302.09248] AI Alignment with Changing and Influenceable Reward Functions: https://arxiv.org/abs/2405.17713 [https://arxiv.org/abs/2405.17713]   Episode art by Hamish Doodles: hamishdoodles.com [https://hamishdoodles.com/]

24. jan. 2025 - 15 min
episode 38.5 - Adrià Garriga-Alonso on Detecting AI Scheming artwork
38.5 - Adrià Garriga-Alonso on Detecting AI Scheming

Suppose we're worried about AIs engaging in long-term plans that they don't tell us about. If we were to peek inside their brains, what should we look for to check whether this was happening? In this episode Adrià Garriga-Alonso talks about his work trying to answer this question. Patreon: https://www.patreon.com/axrpodcast [https://www.patreon.com/axrpodcast] Ko-fi: https://ko-fi.com/axrpodcast [https://ko-fi.com/axrpodcast] Transcript: https://axrp.net/episode/2025/01/20/episode-38_5-adria-garriga-alonso-detecting-ai-scheming.html [https://axrp.net/episode/2025/01/20/episode-38_5-adria-garriga-alonso-detecting-ai-scheming.html] FAR.AI: https://far.ai/ [https://far.ai/] FAR.AI on X (aka Twitter): https://x.com/farairesearch [https://x.com/farairesearch] FAR.AI on YouTube: https://www.youtube.com/@FARAIResearch [https://www.youtube.com/@FARAIResearch] The Alignment Workshop: https://www.alignment-workshop.com/ [https://www.alignment-workshop.com/]   Topics we discuss, and timestamps: 01:04 - The Alignment Workshop 02:49 - How to detect scheming AIs 05:29 - Sokoban-solving networks taking time to think 12:18 - Model organisms of long-term planning 19:44 - How and why to study planning in networks   Links: Adrià's website: https://agarri.ga/ [https://agarri.ga/] An investigation of model-free planning: https://arxiv.org/abs/1901.03559 [https://arxiv.org/abs/1901.03559] Model-Free Planning: https://tuphs28.github.io/projects/interpplanning/ [https://tuphs28.github.io/projects/interpplanning/] Planning in a recurrent neural network that plays Sokoban: https://arxiv.org/abs/2407.15421 [https://arxiv.org/abs/2407.15421]   Episode art by Hamish Doodles: hamishdoodles.com [https://hamishdoodles.com/]

20. jan. 2025 - 27 min
Enkelt å finne frem nye favoritter og lett å navigere seg gjennom innholdet i appen
Enkelt å finne frem nye favoritter og lett å navigere seg gjennom innholdet i appen
Liker at det er både Podcaster (godt utvalg) og lydbøker i samme app, pluss at man kan holde Podcaster og lydbøker atskilt i biblioteket.
Bra app. Oversiktlig og ryddig. MYE bra innhold⭐️⭐️⭐️

Prøv gratis i 3 dager

99,00 kr / Måned etter prøveperioden.Avslutt når som helst.

Eksklusive podkaster

Uten reklame

Gratis podkaster

Lydbøker

20 timer i måneden

Prøv gratis

Bare på Podimo

Populære lydbøker