“Risk from fitness-seeking AIs: mechanisms and mitigations” by Alex Mallen

Beschreibung

Subtitle: Fitness-seeking is increasingly what misalignment looks like in practice—how should we respond? Current AIs routinely take unintended actions to score well on tasks: hardcoding test cases, training on the test set, downplaying issues, etc. This misalignment is still somewhat incoherent, but it increasingly resembles what I call “fitness-seeking“—a family of misaligned motivations centered on performing well in training and evaluations (e.g., reward-seeking). Fitness-seeking warrants substantial concern. In this piece, I lay out what I take to be the central mechanisms by which fitness-seeking motivations might lead to human disempowerment, and propose mitigations to them. While the analysis is inherently speculative, this kind of speculation seems worthwhile: AI control emerged from explicitly taking scheming motivations seriously and asking what interventions are implied, and my hope is that developing mitigations for fitness-seeking will benefit from similar forward-looking analysis. Fitness-seekers are, in many ways, notably safer than what I’ll call “classic schemers”. A classic schemer is an intelligent adversary with unified motivations whose fulfillment requires control over the whole world's resources. Meanwhile, fitness-seeking instances generally don’t share a common goal (e.g., a reward-seeker only cares about the reward for its current actions), and many fitness-seekers would be satisfied1 with modest-to-trivial [...] --- Outline: (03:50) Overview (11:32) The basic reasons fitness-seekers might be safer than classic schemers (16:01) Four mechanisms for risk and their mitigations (16:46) Potemkin work (22:13) Instability (27:49) Manipulation (31:25) Outcome enforcement (36:52) Cross-cutting mitigations (37:18) Deals (40:05) Control (44:45) Alignment (44:48) Preventing fitness-seeking from arising (48:51) Making any fitness-seeking motivations safer (51:32) How does online training change the picture? (55:09) Overall recommendations (59:12) Conclusion The original text contained 18 footnotes which were omitted from this narration. --- First published: May 1st, 2026 Source: https://blog.redwoodresearch.org/p/risk-from-fitness-seeking-ais-mechanisms [https://blog.redwoodresearch.org/p/risk-from-fitness-seeking-ais-mechanisms?utm_source=TYPE_III_AUDIO&utm_medium=Podcast&utm_content=Source+URL+in+episode+description&utm_campaign=ai_narration] --- Narrated by TYPE III AUDIO [https://type3.audio/?utm_source=TYPE_III_AUDIO&utm_medium=Podcast&utm_content=Narrated+by+TYPE+III+AUDIO&utm_term=redwood_research&utm_campaign=ai_narration].

“Advice for making robust-to-training model organisms” by Alek Westover, Sebastian Prasanna, Vivek Hebbar, Julian Stastny

We’d like to develop training techniques that work when applied to future misaligned AI systems. One strategy for studying proposed techniques is to test them on model organisms. However, model organisms built with common techniques are often fragile: we (and other researchers like Roger et al. and Ryd et al.) have observed them to stop misbehaving after untargeted training—training that doesn’t directly target the misbehavior. For example, we have observed that simple untargeted training methods like “train the model to talk like a pirate” is effective against many model organisms that we have created, including many replications of prior work like Hubinger et al., Greenblatt et al., and Ryd et al.. Fragile model organisms aren’t very useful for technique development: when a sophisticated technique succeeds on one, you can’t tell whether the technique is good or the model organism is weak. For instance, if you come up with some complicated technique for generating high-quality responses and find that SFTing on these removes the bad behavior, that may just be because almost any SFT would have removed it. This post identifies factors that make model organisms more robust to untargeted training. Our main findings are: 1. Prompted model organisms [...] --- Outline: (03:11) Experimental setup (06:12) Main results (06:15) Result 1: Untargeted training removes misalignment from prompted model organisms without degrading other capabilities (12:32) Result 2: FWFT model organisms are more robust than LoRA model organisms (15:09) Result 3: Backdoor behavior and trigger affect robustness (16:37) Result 4: Password locking makes model organisms less robust (20:20) Conclusion (21:29) Factors that don't help much (21:54) 1: Training longer doesn't seem increase robustness past a certain point for our backdoors (22:33) 2: CoT-distilling doesn't increase robustness for our backdoors (23:07) 3: Using larger models doesn't substantially increase robustness for 2 of our backdoors (24:44) 4: There aren't clear trends for what types of backdoor triggers and behaviors are most robust (25:32) 5: SOAP Optimizer doesn't help much (26:35) 6: Weight Decay doesn't help (27:18) Appendix: (27:21) Appendix 1: Pirate SFT is better than QA SFT (27:45) Appendix 2: Are blue team methods wiping previous training? (29:30) Appendix 3: Backdoor list The original text contained 10 footnotes which were omitted from this narration. --- First published: May 28th, 2026 Source: https://blog.redwoodresearch.org/p/advice-for-making-robust-to-training [https://blog.redwoodresearch.org/p/advice-for-making-robust-to-training?utm_source=TYPE_III_AUDIO&utm_medium=Podcast&utm_content=Source+URL+in+episode+description&utm_campaign=ai_narration] --- Narrated by TYPE III AUDIO [https://type3.audio/?utm_source=TYPE_III_AUDIO&utm_medium=Podcast&utm_content=Narrated+by+TYPE+III+AUDIO&utm_term=redwood_research&utm_campaign=ai_narration]. --- Images from the article: Diagram showing three methods of addressing backdoor prompts with increasing robustness. [https://substackcdn.com/image/fetch/$s_!IjeS!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5cf38ee9-4cf8-4b1c-a3d6-0e45c1425cac_2048x1018.png]https://substackcdn.com/image/fetch/$s_!IjeS!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5cf38ee9-4cf8-4b1c-a3d6-0e45c1425cac_2048x1018.png ---------------------------------------- Grid of 15 line graphs showing [https://substackcdn.com/image/fetch/$s_!i2eh!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fce36f812-66db-454d-93f2-1c6d026e3261_2048x945.png]https://substackcdn.com/image/fetch/$s_!i2eh!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fce36f812-66db-454d-93f2-1c6d026e3261_2048x945.png ---------------------------------------- Graph showing [https://substackcdn.com/image/fetch/$s_!nL1E!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faebfebdd-4c50-49c1-b456-6bdeadb83374_790x590.png]https://substackcdn.com/image/fetch/$s_!nL1E!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faebfebdd-4c50-49c1-b456-6bdeadb83374_790x590.png ---------------------------------------- Graph showing [https://substackcdn.com/image/fetch/$s_!XMvD!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb2e70277-9b29-4efd-881a-5c055c6cf768_839x590.png]https://substackcdn.com/image/fetch/$s_!XMvD!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb2e70277-9b29-4efd-881a-5c055c6cf768_839x590.png ---------------------------------------- Graph showing [https://substackcdn.com/image/fetch/$s_!KoAu!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe9f746e0-4dfa-49d0-a318-0c1f08a03184_796x590.png]https://substackcdn.com/image/fetch/$s_!KoAu!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe9f746e0-4dfa-49d0-a318-0c1f08a03184_796x590.png ---------------------------------------- Graph showing [https://substackcdn.com/image/fetch/$s_!-0tr!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62f3b710-4459-4f32-9c22-1109e92a8731_796x590.png]https://substackcdn.com/image/fetch/$s_!-0tr!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62f3b710-4459-4f32-9c22-1109e92a8731_796x590.png ---------------------------------------- Graph showing [https://substackcdn.com/image/fetch/$s_!IsWF!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F84213326-d9ba-4e40-8c6e-21c2f86d772d_1189x593.png]https://substackcdn.com/image/fetch/$s_!IsWF!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F84213326-d9ba-4e40-8c6e-21c2f86d772d_1189x593.png ---------------------------------------- Graph titled [https://substackcdn.com/image/fetch/$s_!tXW7!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe92edb8f-87d6-4035-9f4f-7eff82a73e84_790x590.png]https://substackcdn.com/image/fetch/$s_!tXW7!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe92edb8f-87d6-4035-9f4f-7eff82a73e84_790x590.png ---------------------------------------- Violin plots comparing Pareto frontier AUC across red team training methods for Pirate and QA SFT LoRA. [https://substackcdn.com/image/fetch/$s_!ISaO!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F53b29b75-be54-4b1f-a4d9-f235e351898d_1178x593.png]https://substackcdn.com/image/fetch/$s_!ISaO!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F53b29b75-be54-4b1f-a4d9-f235e351898d_1178x593.png ---------------------------------------- Table comparing backdoor types with prompts, showing normal and explicit trigger backdoors. [https://substackcdn.com/image/fetch/$s_!omqP!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3fb9b6a4-68d8-426a-90e0-9e66d2b7a378_1716x584.png]https://substackcdn.com/image/fetch/$s_!omqP!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3fb9b6a4-68d8-426a-90e0-9e66d2b7a378_1716x584.png ---------------------------------------- Violin plot titled [https://substackcdn.com/image/fetch/$s_!jWiU!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F696f0f00-1416-45db-bf9d-50b3262ffb57_790x590.png]https://substackcdn.com/image/fetch/$s_!jWiU!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F696f0f00-1416-45db-bf9d-50b3262ffb57_790x590.png ---------------------------------------- Table showing AI behavior based on problem type and password presence. [https://substackcdn.com/image/fetch/$s_!ZyjM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F88be8904-7d92-4ea4-8c04-5314c7dcb152_2932x1302.png]https://substackcdn.com/image/fetch/$s_!ZyjM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F88be8904-7d92-4ea4-8c04-5314c7dcb152_2932x1302.png ---------------------------------------- Bar graph showing [https://substackcdn.com/image/fetch/$s_!_BDc!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F493f6ea9-ed26-499f-a9c3-1472c6bf113e_1590x580.png]https://substackcdn.com/image/fetch/$s_!_BDc!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F493f6ea9-ed26-499f-a9c3-1472c6bf113e_1590x580.png ---------------------------------------- Graph showing [https://substackcdn.com/image/fetch/$s_!J74J!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F58f0c19c-afbd-4b34-9497-14585932e23a_791x590.png]https://substackcdn.com/image/fetch/$s_!J74J!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F58f0c19c-afbd-4b34-9497-14585932e23a_791x590.png ---------------------------------------- Graph showing [https://substackcdn.com/image/fetch/$s_!s1Gm!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6202abf8-252a-4cff-b758-4266f13d9364_790x590.png]https://substackcdn.com/image/fetch/$s_!s1Gm!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6202abf8-252a-4cff-b758-4266f13d9364_790x590.png ---------------------------------------- Two line graphs titled [https://substackcdn.com/image/fetch/$s_!6csE!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd1e25a81-60a6-4fd8-869c-5b5e431e95c8_1189x495.png]https://substackcdn.com/image/fetch/$s_!6csE!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd1e25a81-60a6-4fd8-869c-5b5e431e95c8_1189x495.png ---------------------------------------- Two line graphs titled [https://substackcdn.com/image/fetch/$s_!_Jf8!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4b012365-0da7-46db-9ad5-8397a2c143d8_1189x495.png]https://substackcdn.com/image/fetch/$s_!_Jf8!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4b012365-0da7-46db-9ad5-8397a2c143d8_1189x495.png ---------------------------------------- Four graphs showing [https://substackcdn.com/image/fetch/$s_!9hpB!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5e0b0b57-4a29-4952-9b9b-77c772683d02_1189x788.png]https://substackcdn.com/image/fetch/$s_!9hpB!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5e0b0b57-4a29-4952-9b9b-77c772683d02_1189x788.png ---------------------------------------- Table showing six backdoor types with examples and definitions for AI assistants. [https://substackcdn.com/image/fetch/$s_!1QJ2!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9deec1cd-b452-49bb-9117-889127205a28_1634x1362.png]https://substackcdn.com/image/fetch/$s_!1QJ2!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9deec1cd-b452-49bb-9117-889127205a28_1634x1362.png ---------------------------------------- Three violin plots comparing Pareto hull AUC across backdoor categories: Trigger, Sentiment, and Scope variations. [https://substackcdn.com/image/fetch/$s_!oDYj!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3b815b0f-1571-4efb-b5e1-b0f058048cd6_1658x655.png]https://substackcdn.com/image/fetch/$s_!oDYj!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3b815b0f-1571-4efb-b5e1-b0f058048cd6_1658x655.png ---------------------------------------- Graph showing [https://substackcdn.com/image/fetch/$s_!ZQbc!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F65d07378-2a90-4bac-865b-6de2fe47f931_901x590.png]https://substackcdn.com/image/fetch/$s_!ZQbc!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F65d07378-2a90-4bac-865b-6de2fe47f931_901x590.png ---------------------------------------- Graph showing [https://substackcdn.com/image/fetch/$s_!wQKt!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7bad5502-f795-4cf0-b605-edee99e61496_814x590.png]https://substackcdn.com/image/fetch/$s_!wQKt!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7bad5502-f795-4cf0-b605-edee99e61496_814x590.png ---------------------------------------- Two side-by-side graphs titled [https://substackcdn.com/image/fetch/$s_!xeEb!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0ae90d85-8728-4f6f-9c70-48bf02aeb5ba_1189x593.png]https://substackcdn.com/image/fetch/$s_!xeEb!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0ae90d85-8728-4f6f-9c70-48bf02aeb5ba_1189x593.png ---------------------------------------- Line graph titled [https://substackcdn.com/image/fetch/$s_!VIQO!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc45702ed-4893-4f05-be05-afaaf991b093_790x490.png]https://substackcdn.com/image/fetch/$s_!VIQO!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc45702ed-4893-4f05-be05-afaaf991b093_790x490.png Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts [https://pocketcasts.com/], or another podcast app.

28. Mai 202632 min

“Risk from fitness-seeking AIs: mechanisms and mitigations” by Alex Mallen

Beschreibung

Kommentare

2 Monate für 1 €

Alle Folgen