Imagen de portada del espectáculo Redwood Research Blog

Redwood Research Blog

Podcast de Redwood Research

inglés

Historia

Después 4,99 € / mes. Cancela cuando quieras.

  • 20 horas de audiolibros / mes
  • Podcasts solo en Podimo
  • Podcast gratuitos

Acerca de Redwood Research Blog

Narrations of Redwood Research blog posts. Redwood Research is a research nonprofit based in Berkeley. We investigate risks posed by the development of powerful artificial intelligence and techniques for mitigating those risks.

Todos los episodios

110 episodios

Portada del episodio “Incriminating misaligned AI models via distillation” by Alek Westover, Sebastian Prasanna, Alex Mallen, Alexa Pan, Julian Stastny

“Incriminating misaligned AI models via distillation” by Alek Westover, Sebastian Prasanna, Alex Mallen, Alexa Pan, Julian Stastny

Suppose we have a dangerous misaligned AI that can fool alignment audits, and distill it into a student model. Two things can happen: 1. Misalignment fails to transfer to the student. If so, we get a fairly capable benign model. 2. Misalignment transfers to the student. The student might also be worse than the teacher at hiding its misalignment (e.g., due to being less capable). If so, we might get indirect evidence about the teacher's misalignment by auditing the distilled model. In this post, we will discuss the second possibility, which we call incrimination via distillation. Specifically, we propose distillation methods that we hope transfer misalignment without transferring the ability to fool audits, and discuss why these techniques might work or fail. In a future post, we discuss the first possibility, and what distillation methods should be used when aiming to create a capable benign model. We’re excited for research that empirically tests and refines this approach; if successful, this technique could become a valuable part of alignment audits. How incrimination via distillation works Powerful misaligned AI models might not be auditable: they might pass alignment audits but still act on their misaligned drives when given the chance. [...] --- Outline: (01:32) How incrimination via distillation works (02:45) How we propose implementing incrimination via distillation (03:59) Auditability-preserving distillation (05:11) Misalignment-targeted distillation (06:34) Why incrimination via distillation might work (08:17) Why incrimination via distillation might not work (10:53) Conclusion The original text contained 2 footnotes which were omitted from this narration. --- First published: May 18th, 2026 Source: https://blog.redwoodresearch.org/p/incriminating-misaligned-ai-models [https://blog.redwoodresearch.org/p/incriminating-misaligned-ai-models?utm_source=TYPE_III_AUDIO&utm_medium=Podcast&utm_content=Source+URL+in+episode+description&utm_campaign=ai_narration] --- Narrated by TYPE III AUDIO [https://type3.audio/?utm_source=TYPE_III_AUDIO&utm_medium=Podcast&utm_content=Narrated+by+TYPE+III+AUDIO&utm_term=redwood_research&utm_campaign=ai_narration]. --- Images from the article: Diagram showing AI model distillation process from misaligned teacher to student model. [https://substackcdn.com/image/fetch/$s_!s6S5!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7a236ec7-b557-4621-bc4c-2fe86c712936_2000x1450.webp]https://substackcdn.com/image/fetch/$s_!s6S5!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7a236ec7-b557-4621-bc4c-2fe86c712936_2000x1450.webp Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts [https://pocketcasts.com/], or another podcast app.

18 de may de 2026 - 11 min
Portada del episodio “Risk reports need to address deployment-time spread of misalignment” by Alex Mallen

“Risk reports need to address deployment-time spread of misalignment” by Alex Mallen

Subtitle: Deployment-time spread is the most plausible near-term route to consistent adversarial misalignment. Risk reports commonly use pre-deployment alignment assessments to measure misalignment risk from an internally deployed AI. However, an AI that genuinely starts out with largely benign motivations can develop widespread dangerous motivations during deployment. I think this is the most plausible route to consistent adversarial misalignment in the near future. So, AI companies and evaluators should substantively incorporate it into risk analysis and planning. In this post, I’ll briefly argue why, absent improved mitigations, this will probably soon become a reason why AI companies will be unable to convincingly argue against consistent adversarial misalignment (this risk will perhaps be even larger than risk of consistent adversarial misalignment arising from training). Then I’ll discuss how well current risk reports address it (the Claude Mythos risk report does a reasonable job; others don’t). Thanks to Ryan Greenblatt, Alexa Pan, Charlie Griffin, Anders Cairns Woodruff, and Buck Shlegeris for feedback on drafts. Deployment-time spread is the most plausible near-term route to consistent adversarial misalignment In some contexts, AIs might adopt misaligned goals, even if they were otherwise previously aligned. Because this misalignment can be rare, the AI might [...] --- Outline: (01:21) Deployment-time spread is the most plausible near-term route to consistent adversarial misalignment (06:15) Company risk reports The original text contained 6 footnotes which were omitted from this narration. --- First published: May 15th, 2026 Source: https://blog.redwoodresearch.org/p/risk-reports-need-to-address-deployment [https://blog.redwoodresearch.org/p/risk-reports-need-to-address-deployment?utm_source=TYPE_III_AUDIO&utm_medium=Podcast&utm_content=Source+URL+in+episode+description&utm_campaign=ai_narration] --- Narrated by TYPE III AUDIO [https://type3.audio/?utm_source=TYPE_III_AUDIO&utm_medium=Podcast&utm_content=Narrated+by+TYPE+III+AUDIO&utm_term=redwood_research&utm_campaign=ai_narration]. --- Images from the article: Grok tweets: [https://substackcdn.com/image/fetch/$s_!XC2p!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6ee44a64-e391-43b6-a4b6-4ebd80fb323c_1168x334.png]https://substackcdn.com/image/fetch/$s_!XC2p!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6ee44a64-e391-43b6-a4b6-4ebd80fb323c_1168x334.png Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts [https://pocketcasts.com/], or another podcast app.

15 de may de 2026 - 11 min
Portada del episodio “How useful is the information you get from working inside an AI company?” by Buck Shlegeris, Anders Cairns Woodruff

“How useful is the information you get from working inside an AI company?” by Buck Shlegeris, Anders Cairns Woodruff

Subtitle: My median guess: it's as good as a crystal ball that sees 2.5 months into the future. This post was drafted by Buck, and substantially edited by Anders. “I” refers to Buck. Thanks to Alex Mallen for comments. People who work inside AI companies get access to information that I only get later or never. Quantitatively, how big a deal is this access? Here's an operationalization of this. Consider the following two ways my knowledge could be augmented: * I get a crystal ball that tells me all the information I would know n months in the future. * I become an employee of a frontier AI company (like OpenAI or Anthropic), with access to all the private information I’d normally get from working at that company. How big would n have to be for me to be indifferent between these two options, from the perspective of learning things that are helpful for making AI go well? The answer is presumably different for me than for many readers, because I’m a reasonably well-connected researcher; I see published information and news from the rumor mill and I talk to researchers at frontier AI companies all the time. [...] --- Outline: (03:20) What do insiders know? (04:34) Safety work and corporate attitudes (05:54) Model capabilities (07:27) Algorithms and architecture (09:49) How will this change over time? (12:27) Conclusion The original text contained 4 footnotes which were omitted from this narration. --- First published: May 11th, 2026 Source: https://blog.redwoodresearch.org/p/how-useful-is-the-information-you [https://blog.redwoodresearch.org/p/how-useful-is-the-information-you?utm_source=TYPE_III_AUDIO&utm_medium=Podcast&utm_content=Source+URL+in+episode+description&utm_campaign=ai_narration] --- Narrated by TYPE III AUDIO [https://type3.audio/?utm_source=TYPE_III_AUDIO&utm_medium=Podcast&utm_content=Narrated+by+TYPE+III+AUDIO&utm_term=redwood_research&utm_campaign=ai_narration]. --- Images from the article: Person looking through window at office workers with computers and whiteboards. [https://substackcdn.com/image/fetch/$s_!WNv8!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F33cdd72b-b9a9-48a8-a0d0-a375aec290ba_1448x1086.png]https://substackcdn.com/image/fetch/$s_!WNv8!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F33cdd72b-b9a9-48a8-a0d0-a375aec290ba_1448x1086.png Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts [https://pocketcasts.com/], or another podcast app.

11 de may de 2026 - 13 min
Portada del episodio “A review of “Investigating the consequences of accidentally grading CoT during RL”” by Buck Shlegeris

“A review of “Investigating the consequences of accidentally grading CoT during RL”” by Buck Shlegeris

Last week, OpenAI staff shared an early draft of Investigating the consequences of accidentally grading CoT during RL with Redwood Research staff. To start with, I appreciate them publishing this post. I think it is valuable for AI companies to be transparent about problems like these when they arise. I particularly appreciate them sharing the post with us early, discussing the issues in detail, and modifying it to address our most important criticisms. I think it will be increasingly important for AI companies to have a policy of getting external feedback on the risks posed by their deployments, and in particular having some external accountability on whether they have adequate evidence to support their claims about the level of risk posed; as an example of this, see METR reviewing Anthropic's Sabotage Risk Report. We at Redwood Research are interested in participating in this kind of external review of evidence about safety. So I am taking this as an opportunity to try out writing this kind of review. If you work at a frontier AI company, please feel free to reach out if you’d like our review of similar documents. My overall assessment is that I mostly agree with the [...] --- Outline: (01:35) Assessing the evidence that CoT training did not damage monitorability (10:37) How much does this analysis rely on information that wasnt provided? (12:09) Small amounts of RL training on CoT might not be more important than other sources of CoT unreliability (13:20) AI companies will eventually need to learn not to make mistakes like this The original text contained 5 footnotes which were omitted from this narration. --- First published: May 7th, 2026 Source: https://blog.redwoodresearch.org/p/openai-cot [https://blog.redwoodresearch.org/p/openai-cot?utm_source=TYPE_III_AUDIO&utm_medium=Podcast&utm_content=Source+URL+in+episode+description&utm_campaign=ai_narration] --- Narrated by TYPE III AUDIO [https://type3.audio/?utm_source=TYPE_III_AUDIO&utm_medium=Podcast&utm_content=Narrated+by+TYPE+III+AUDIO&utm_term=redwood_research&utm_campaign=ai_narration].

7 de may de 2026 - 15 min
Portada del episodio “Risk from fitness-seeking AIs: mechanisms and mitigations” by Alex Mallen

“Risk from fitness-seeking AIs: mechanisms and mitigations” by Alex Mallen

Subtitle: Fitness-seeking is increasingly what misalignment looks like in practice—how should we respond? Current AIs routinely take unintended actions to score well on tasks: hardcoding test cases, training on the test set, downplaying issues, etc. This misalignment is still somewhat incoherent, but it increasingly resembles what I call “fitness-seeking“—a family of misaligned motivations centered on performing well in training and evaluations (e.g., reward-seeking). Fitness-seeking warrants substantial concern. In this piece, I lay out what I take to be the central mechanisms by which fitness-seeking motivations might lead to human disempowerment, and propose mitigations to them. While the analysis is inherently speculative, this kind of speculation seems worthwhile: AI control emerged from explicitly taking scheming motivations seriously and asking what interventions are implied, and my hope is that developing mitigations for fitness-seeking will benefit from similar forward-looking analysis. Fitness-seekers are, in many ways, notably safer than what I’ll call “classic schemers”. A classic schemer is an intelligent adversary with unified motivations whose fulfillment requires control over the whole world's resources. Meanwhile, fitness-seeking instances generally don’t share a common goal (e.g., a reward-seeker only cares about the reward for its current actions), and many fitness-seekers would be satisfied1 with modest-to-trivial [...] --- Outline: (03:50) Overview (11:32) The basic reasons fitness-seekers might be safer than classic schemers (16:01) Four mechanisms for risk and their mitigations (16:46) Potemkin work (22:13) Instability (27:49) Manipulation (31:25) Outcome enforcement (36:52) Cross-cutting mitigations (37:18) Deals (40:05) Control (44:45) Alignment (44:48) Preventing fitness-seeking from arising (48:51) Making any fitness-seeking motivations safer (51:32) How does online training change the picture? (55:09) Overall recommendations (59:12) Conclusion The original text contained 18 footnotes which were omitted from this narration. --- First published: May 1st, 2026 Source: https://blog.redwoodresearch.org/p/risk-from-fitness-seeking-ais-mechanisms [https://blog.redwoodresearch.org/p/risk-from-fitness-seeking-ais-mechanisms?utm_source=TYPE_III_AUDIO&utm_medium=Podcast&utm_content=Source+URL+in+episode+description&utm_campaign=ai_narration] --- Narrated by TYPE III AUDIO [https://type3.audio/?utm_source=TYPE_III_AUDIO&utm_medium=Podcast&utm_content=Narrated+by+TYPE+III+AUDIO&utm_term=redwood_research&utm_campaign=ai_narration].

1 de may de 2026 - 1 h 3 min
Soy muy de podcasts. Mientras hago la cama, mientras recojo la casa, mientras trabajo… Y en Podimo encuentro podcast que me encantan. De emprendimiento, de salid, de humor… De lo que quiera! Estoy encantada 👍
Soy muy de podcasts. Mientras hago la cama, mientras recojo la casa, mientras trabajo… Y en Podimo encuentro podcast que me encantan. De emprendimiento, de salid, de humor… De lo que quiera! Estoy encantada 👍
MI TOC es feliz, que maravilla. Ordenador, limpio, sugerencias de categorías nuevas a explorar!!!
Me suscribi con los 14 días de prueba para escuchar el Podcast de Misterios Cotidianos, pero al final me quedo mas tiempo porque hacia tiempo que no me reía tanto. Tiene Podcast muy buenos y la aplicación funciona bien.
App ligera, eficiente, encuentras rápido tus podcast favoritos. Diseño sencillo y bonito. me gustó.
contenidos frescos e inteligentes
La App va francamente bien y el precio me parece muy justo para pagar a gente que nos da horas y horas de contenido. Espero poder seguir usándola asiduamente.

Elige tu suscripción

Más populares

Oferta limitada

Premium

20 horas de audiolibros

  • Podcasts solo en Podimo

  • Disfruta los shows de Podimo sin anuncios

  • Cancela cuando quieras

2 meses por 1 €
Después 4,99 € / mes

Empezar

Premium Plus

100 horas de audiolibros

  • Podcasts solo en Podimo

  • Disfruta los shows de Podimo sin anuncios

  • Cancela cuando quieras

Disfruta 30 días gratis
Después 9,99 € / mes

Prueba gratis

Sólo en Podimo

Audiolibros populares

Preguntas frecuentes

Más preguntas y respuestas
Empezar

2 meses por 1 €. Después 4,99 € / mes. Cancela cuando quieras.