Billede af showet Redwood Research Blog

Redwood Research Blog

Podcast af Redwood Research

engelsk

Historie & religion

Begrænset tilbud

2 måneder kun 19 kr.

Derefter 99 kr. / månedOpsig når som helst.

  • 20 lydbogstimer pr. måned
  • Podcasts kun på Podimo
  • Gratis podcasts
Kom i gang

Læs mere Redwood Research Blog

Narrations of Redwood Research blog posts. Redwood Research is a research nonprofit based in Berkeley. We investigate risks posed by the development of powerful artificial intelligence and techniques for mitigating those risks.

Alle episoder

110 episoder

episode “Incriminating misaligned AI models via distillation” by Alek Westover, Sebastian Prasanna, Alex Mallen, Alexa Pan, Julian Stastny cover

“Incriminating misaligned AI models via distillation” by Alek Westover, Sebastian Prasanna, Alex Mallen, Alexa Pan, Julian Stastny

Suppose we have a dangerous misaligned AI that can fool alignment audits, and distill it into a student model. Two things can happen: 1. Misalignment fails to transfer to the student. If so, we get a fairly capable benign model. 2. Misalignment transfers to the student. The student might also be worse than the teacher at hiding its misalignment (e.g., due to being less capable). If so, we might get indirect evidence about the teacher's misalignment by auditing the distilled model. In this post, we will discuss the second possibility, which we call incrimination via distillation. Specifically, we propose distillation methods that we hope transfer misalignment without transferring the ability to fool audits, and discuss why these techniques might work or fail. In a future post, we discuss the first possibility, and what distillation methods should be used when aiming to create a capable benign model. We’re excited for research that empirically tests and refines this approach; if successful, this technique could become a valuable part of alignment audits. How incrimination via distillation works Powerful misaligned AI models might not be auditable: they might pass alignment audits but still act on their misaligned drives when given the chance. [...] --- Outline: (01:32) How incrimination via distillation works (02:45) How we propose implementing incrimination via distillation (03:59) Auditability-preserving distillation (05:11) Misalignment-targeted distillation (06:34) Why incrimination via distillation might work (08:17) Why incrimination via distillation might not work (10:53) Conclusion The original text contained 2 footnotes which were omitted from this narration. --- First published: May 18th, 2026 Source: https://blog.redwoodresearch.org/p/incriminating-misaligned-ai-models [https://blog.redwoodresearch.org/p/incriminating-misaligned-ai-models?utm_source=TYPE_III_AUDIO&utm_medium=Podcast&utm_content=Source+URL+in+episode+description&utm_campaign=ai_narration] --- Narrated by TYPE III AUDIO [https://type3.audio/?utm_source=TYPE_III_AUDIO&utm_medium=Podcast&utm_content=Narrated+by+TYPE+III+AUDIO&utm_term=redwood_research&utm_campaign=ai_narration]. --- Images from the article: Diagram showing AI model distillation process from misaligned teacher to student model. [https://substackcdn.com/image/fetch/$s_!s6S5!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7a236ec7-b557-4621-bc4c-2fe86c712936_2000x1450.webp]https://substackcdn.com/image/fetch/$s_!s6S5!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7a236ec7-b557-4621-bc4c-2fe86c712936_2000x1450.webp Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts [https://pocketcasts.com/], or another podcast app.

18. maj 2026 - 11 min
episode “Risk reports need to address deployment-time spread of misalignment” by Alex Mallen cover

“Risk reports need to address deployment-time spread of misalignment” by Alex Mallen

Subtitle: Deployment-time spread is the most plausible near-term route to consistent adversarial misalignment. Risk reports commonly use pre-deployment alignment assessments to measure misalignment risk from an internally deployed AI. However, an AI that genuinely starts out with largely benign motivations can develop widespread dangerous motivations during deployment. I think this is the most plausible route to consistent adversarial misalignment in the near future. So, AI companies and evaluators should substantively incorporate it into risk analysis and planning. In this post, I’ll briefly argue why, absent improved mitigations, this will probably soon become a reason why AI companies will be unable to convincingly argue against consistent adversarial misalignment (this risk will perhaps be even larger than risk of consistent adversarial misalignment arising from training). Then I’ll discuss how well current risk reports address it (the Claude Mythos risk report does a reasonable job; others don’t). Thanks to Ryan Greenblatt, Alexa Pan, Charlie Griffin, Anders Cairns Woodruff, and Buck Shlegeris for feedback on drafts. Deployment-time spread is the most plausible near-term route to consistent adversarial misalignment In some contexts, AIs might adopt misaligned goals, even if they were otherwise previously aligned. Because this misalignment can be rare, the AI might [...] --- Outline: (01:21) Deployment-time spread is the most plausible near-term route to consistent adversarial misalignment (06:15) Company risk reports The original text contained 6 footnotes which were omitted from this narration. --- First published: May 15th, 2026 Source: https://blog.redwoodresearch.org/p/risk-reports-need-to-address-deployment [https://blog.redwoodresearch.org/p/risk-reports-need-to-address-deployment?utm_source=TYPE_III_AUDIO&utm_medium=Podcast&utm_content=Source+URL+in+episode+description&utm_campaign=ai_narration] --- Narrated by TYPE III AUDIO [https://type3.audio/?utm_source=TYPE_III_AUDIO&utm_medium=Podcast&utm_content=Narrated+by+TYPE+III+AUDIO&utm_term=redwood_research&utm_campaign=ai_narration]. --- Images from the article: Grok tweets: [https://substackcdn.com/image/fetch/$s_!XC2p!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6ee44a64-e391-43b6-a4b6-4ebd80fb323c_1168x334.png]https://substackcdn.com/image/fetch/$s_!XC2p!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6ee44a64-e391-43b6-a4b6-4ebd80fb323c_1168x334.png Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts [https://pocketcasts.com/], or another podcast app.

15. maj 2026 - 11 min
episode “How useful is the information you get from working inside an AI company?” by Buck Shlegeris, Anders Cairns Woodruff cover

“How useful is the information you get from working inside an AI company?” by Buck Shlegeris, Anders Cairns Woodruff

Subtitle: My median guess: it's as good as a crystal ball that sees 2.5 months into the future. This post was drafted by Buck, and substantially edited by Anders. “I” refers to Buck. Thanks to Alex Mallen for comments. People who work inside AI companies get access to information that I only get later or never. Quantitatively, how big a deal is this access? Here's an operationalization of this. Consider the following two ways my knowledge could be augmented: * I get a crystal ball that tells me all the information I would know n months in the future. * I become an employee of a frontier AI company (like OpenAI or Anthropic), with access to all the private information I’d normally get from working at that company. How big would n have to be for me to be indifferent between these two options, from the perspective of learning things that are helpful for making AI go well? The answer is presumably different for me than for many readers, because I’m a reasonably well-connected researcher; I see published information and news from the rumor mill and I talk to researchers at frontier AI companies all the time. [...] --- Outline: (03:20) What do insiders know? (04:34) Safety work and corporate attitudes (05:54) Model capabilities (07:27) Algorithms and architecture (09:49) How will this change over time? (12:27) Conclusion The original text contained 4 footnotes which were omitted from this narration. --- First published: May 11th, 2026 Source: https://blog.redwoodresearch.org/p/how-useful-is-the-information-you [https://blog.redwoodresearch.org/p/how-useful-is-the-information-you?utm_source=TYPE_III_AUDIO&utm_medium=Podcast&utm_content=Source+URL+in+episode+description&utm_campaign=ai_narration] --- Narrated by TYPE III AUDIO [https://type3.audio/?utm_source=TYPE_III_AUDIO&utm_medium=Podcast&utm_content=Narrated+by+TYPE+III+AUDIO&utm_term=redwood_research&utm_campaign=ai_narration]. --- Images from the article: Person looking through window at office workers with computers and whiteboards. [https://substackcdn.com/image/fetch/$s_!WNv8!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F33cdd72b-b9a9-48a8-a0d0-a375aec290ba_1448x1086.png]https://substackcdn.com/image/fetch/$s_!WNv8!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F33cdd72b-b9a9-48a8-a0d0-a375aec290ba_1448x1086.png Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts [https://pocketcasts.com/], or another podcast app.

11. maj 2026 - 13 min
episode “A review of “Investigating the consequences of accidentally grading CoT during RL”” by Buck Shlegeris cover

“A review of “Investigating the consequences of accidentally grading CoT during RL”” by Buck Shlegeris

Last week, OpenAI staff shared an early draft of Investigating the consequences of accidentally grading CoT during RL with Redwood Research staff. To start with, I appreciate them publishing this post. I think it is valuable for AI companies to be transparent about problems like these when they arise. I particularly appreciate them sharing the post with us early, discussing the issues in detail, and modifying it to address our most important criticisms. I think it will be increasingly important for AI companies to have a policy of getting external feedback on the risks posed by their deployments, and in particular having some external accountability on whether they have adequate evidence to support their claims about the level of risk posed; as an example of this, see METR reviewing Anthropic's Sabotage Risk Report. We at Redwood Research are interested in participating in this kind of external review of evidence about safety. So I am taking this as an opportunity to try out writing this kind of review. If you work at a frontier AI company, please feel free to reach out if you’d like our review of similar documents. My overall assessment is that I mostly agree with the [...] --- Outline: (01:35) Assessing the evidence that CoT training did not damage monitorability (10:37) How much does this analysis rely on information that wasnt provided? (12:09) Small amounts of RL training on CoT might not be more important than other sources of CoT unreliability (13:20) AI companies will eventually need to learn not to make mistakes like this The original text contained 5 footnotes which were omitted from this narration. --- First published: May 7th, 2026 Source: https://blog.redwoodresearch.org/p/openai-cot [https://blog.redwoodresearch.org/p/openai-cot?utm_source=TYPE_III_AUDIO&utm_medium=Podcast&utm_content=Source+URL+in+episode+description&utm_campaign=ai_narration] --- Narrated by TYPE III AUDIO [https://type3.audio/?utm_source=TYPE_III_AUDIO&utm_medium=Podcast&utm_content=Narrated+by+TYPE+III+AUDIO&utm_term=redwood_research&utm_campaign=ai_narration].

7. maj 2026 - 15 min
episode “Risk from fitness-seeking AIs: mechanisms and mitigations” by Alex Mallen cover

“Risk from fitness-seeking AIs: mechanisms and mitigations” by Alex Mallen

Subtitle: Fitness-seeking is increasingly what misalignment looks like in practice—how should we respond? Current AIs routinely take unintended actions to score well on tasks: hardcoding test cases, training on the test set, downplaying issues, etc. This misalignment is still somewhat incoherent, but it increasingly resembles what I call “fitness-seeking“—a family of misaligned motivations centered on performing well in training and evaluations (e.g., reward-seeking). Fitness-seeking warrants substantial concern. In this piece, I lay out what I take to be the central mechanisms by which fitness-seeking motivations might lead to human disempowerment, and propose mitigations to them. While the analysis is inherently speculative, this kind of speculation seems worthwhile: AI control emerged from explicitly taking scheming motivations seriously and asking what interventions are implied, and my hope is that developing mitigations for fitness-seeking will benefit from similar forward-looking analysis. Fitness-seekers are, in many ways, notably safer than what I’ll call “classic schemers”. A classic schemer is an intelligent adversary with unified motivations whose fulfillment requires control over the whole world's resources. Meanwhile, fitness-seeking instances generally don’t share a common goal (e.g., a reward-seeker only cares about the reward for its current actions), and many fitness-seekers would be satisfied1 with modest-to-trivial [...] --- Outline: (03:50) Overview (11:32) The basic reasons fitness-seekers might be safer than classic schemers (16:01) Four mechanisms for risk and their mitigations (16:46) Potemkin work (22:13) Instability (27:49) Manipulation (31:25) Outcome enforcement (36:52) Cross-cutting mitigations (37:18) Deals (40:05) Control (44:45) Alignment (44:48) Preventing fitness-seeking from arising (48:51) Making any fitness-seeking motivations safer (51:32) How does online training change the picture? (55:09) Overall recommendations (59:12) Conclusion The original text contained 18 footnotes which were omitted from this narration. --- First published: May 1st, 2026 Source: https://blog.redwoodresearch.org/p/risk-from-fitness-seeking-ais-mechanisms [https://blog.redwoodresearch.org/p/risk-from-fitness-seeking-ais-mechanisms?utm_source=TYPE_III_AUDIO&utm_medium=Podcast&utm_content=Source+URL+in+episode+description&utm_campaign=ai_narration] --- Narrated by TYPE III AUDIO [https://type3.audio/?utm_source=TYPE_III_AUDIO&utm_medium=Podcast&utm_content=Narrated+by+TYPE+III+AUDIO&utm_term=redwood_research&utm_campaign=ai_narration].

1. maj 2026 - 1 h 3 min
En fantastisk app med et enormt stort udvalg af spændende podcasts. Podimo formår virkelig at lave godt indhold, der takler de lidt mere svære emner. At der så også er lydbøger oveni til en billig pris, gør at det er blevet min favorit app.
En fantastisk app med et enormt stort udvalg af spændende podcasts. Podimo formår virkelig at lave godt indhold, der takler de lidt mere svære emner. At der så også er lydbøger oveni til en billig pris, gør at det er blevet min favorit app.
Rigtig god tjeneste med gode eksklusive podcasts og derudover et kæmpe udvalg af podcasts og lydbøger. Kan varmt anbefales, om ikke andet så udelukkende pga Dårligdommerne, Klovn podcast, Hakkedrengene og Han duo 😁 👍
Podimo er blevet uundværlig! Til lange bilture, hverdagen, rengøringen og i det hele taget, når man trænger til lidt adspredelse.

Vælg dit abonnement

Mest populære

Begrænset tilbud

Premium

20 timers lydbøger

  • Podcasts kun på Podimo

  • Ingen reklamer i podcasts fra Podimo

  • Opsig når som helst

2 måneder kun 19 kr.
Derefter 99 kr. / måned

Kom i gang

Premium Plus

100 timers lydbøger

  • Podcasts kun på Podimo

  • Ingen reklamer i podcasts fra Podimo

  • Opsig når som helst

Prøv gratis i 7 dage
Derefter 129 kr. / måned

Prøv gratis

Kun på Podimo

Populære lydbøger

Ofte stillede spørgsmål

Flere spørgsmål og svar
Kom i gang

2 måneder kun 19 kr. Derefter 99 kr. / måned. Opsig når som helst.