“Advice for making robust-to-training model organisms” by Alek Westover, Sebastian Prasanna, Vivek Hebbar, Julian Stastny
We’d like to develop training techniques that work when applied to future misaligned AI systems. One strategy for studying proposed techniques is to test them on model organisms. However, model organisms built with common techniques are often fragile: we (and other researchers like Roger et al. and Ryd et al.) have observed them to stop misbehaving after untargeted training—training that doesn’t directly target the misbehavior. For example, we have observed that simple untargeted training methods like “train the model to talk like a pirate” is effective against many model organisms that we have created, including many replications of prior work like Hubinger et al., Greenblatt et al., and Ryd et al..
Fragile model organisms aren’t very useful for technique development: when a sophisticated technique succeeds on one, you can’t tell whether the technique is good or the model organism is weak. For instance, if you come up with some complicated technique for generating high-quality responses and find that SFTing on these removes the bad behavior, that may just be because almost any SFT would have removed it.
This post identifies factors that make model organisms more robust to untargeted training. Our main findings are:
1. Prompted model organisms [...]
---
Outline:
(03:11) Experimental setup
(06:12) Main results
(06:15) Result 1: Untargeted training removes misalignment from prompted model organisms without degrading other capabilities
(12:32) Result 2: FWFT model organisms are more robust than LoRA model organisms
(15:09) Result 3: Backdoor behavior and trigger affect robustness
(16:37) Result 4: Password locking makes model organisms less robust
(20:20) Conclusion
(21:29) Factors that don't help much
(21:54) 1: Training longer doesn't seem increase robustness past a certain point for our backdoors
(22:33) 2: CoT-distilling doesn't increase robustness for our backdoors
(23:07) 3: Using larger models doesn't substantially increase robustness for 2 of our backdoors
(24:44) 4: There aren't clear trends for what types of backdoor triggers and behaviors are most robust
(25:32) 5: SOAP Optimizer doesn't help much
(26:35) 6: Weight Decay doesn't help
(27:18) Appendix:
(27:21) Appendix 1: Pirate SFT is better than QA SFT
(27:45) Appendix 2: Are blue team methods wiping previous training?
(29:30) Appendix 3: Backdoor list
The original text contained 10 footnotes which were omitted from this narration.
---
First published:
May 28th, 2026
Source:
https://blog.redwoodresearch.org/p/advice-for-making-robust-to-training [https://blog.redwoodresearch.org/p/advice-for-making-robust-to-training?utm_source=TYPE_III_AUDIO&utm_medium=Podcast&utm_content=Source+URL+in+episode+description&utm_campaign=ai_narration]
---
Narrated by TYPE III AUDIO [https://type3.audio/?utm_source=TYPE_III_AUDIO&utm_medium=Podcast&utm_content=Narrated+by+TYPE+III+AUDIO&utm_term=redwood_research&utm_campaign=ai_narration].
---
Images from the article:
Diagram showing three methods of addressing backdoor prompts with increasing robustness. [https://substackcdn.com/image/fetch/$s_!IjeS!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5cf38ee9-4cf8-4b1c-a3d6-0e45c1425cac_2048x1018.png]https://substackcdn.com/image/fetch/$s_!IjeS!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5cf38ee9-4cf8-4b1c-a3d6-0e45c1425cac_2048x1018.png
----------------------------------------
Grid of 15 line graphs showing [https://substackcdn.com/image/fetch/$s_!i2eh!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fce36f812-66db-454d-93f2-1c6d026e3261_2048x945.png]https://substackcdn.com/image/fetch/$s_!i2eh!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fce36f812-66db-454d-93f2-1c6d026e3261_2048x945.png
----------------------------------------
Graph showing [https://substackcdn.com/image/fetch/$s_!nL1E!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faebfebdd-4c50-49c1-b456-6bdeadb83374_790x590.png]https://substackcdn.com/image/fetch/$s_!nL1E!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faebfebdd-4c50-49c1-b456-6bdeadb83374_790x590.png
----------------------------------------
Graph showing [https://substackcdn.com/image/fetch/$s_!XMvD!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb2e70277-9b29-4efd-881a-5c055c6cf768_839x590.png]https://substackcdn.com/image/fetch/$s_!XMvD!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb2e70277-9b29-4efd-881a-5c055c6cf768_839x590.png
----------------------------------------
Graph showing [https://substackcdn.com/image/fetch/$s_!KoAu!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe9f746e0-4dfa-49d0-a318-0c1f08a03184_796x590.png]https://substackcdn.com/image/fetch/$s_!KoAu!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe9f746e0-4dfa-49d0-a318-0c1f08a03184_796x590.png
----------------------------------------
Graph showing [https://substackcdn.com/image/fetch/$s_!-0tr!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62f3b710-4459-4f32-9c22-1109e92a8731_796x590.png]https://substackcdn.com/image/fetch/$s_!-0tr!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62f3b710-4459-4f32-9c22-1109e92a8731_796x590.png
----------------------------------------
Graph showing [https://substackcdn.com/image/fetch/$s_!IsWF!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F84213326-d9ba-4e40-8c6e-21c2f86d772d_1189x593.png]https://substackcdn.com/image/fetch/$s_!IsWF!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F84213326-d9ba-4e40-8c6e-21c2f86d772d_1189x593.png
----------------------------------------
Graph titled [https://substackcdn.com/image/fetch/$s_!tXW7!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe92edb8f-87d6-4035-9f4f-7eff82a73e84_790x590.png]https://substackcdn.com/image/fetch/$s_!tXW7!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe92edb8f-87d6-4035-9f4f-7eff82a73e84_790x590.png
----------------------------------------
Violin plots comparing Pareto frontier AUC across red team training methods for Pirate and QA SFT LoRA. [https://substackcdn.com/image/fetch/$s_!ISaO!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F53b29b75-be54-4b1f-a4d9-f235e351898d_1178x593.png]https://substackcdn.com/image/fetch/$s_!ISaO!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F53b29b75-be54-4b1f-a4d9-f235e351898d_1178x593.png
----------------------------------------
Table comparing backdoor types with prompts, showing normal and explicit trigger backdoors. [https://substackcdn.com/image/fetch/$s_!omqP!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3fb9b6a4-68d8-426a-90e0-9e66d2b7a378_1716x584.png]https://substackcdn.com/image/fetch/$s_!omqP!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3fb9b6a4-68d8-426a-90e0-9e66d2b7a378_1716x584.png
----------------------------------------
Violin plot titled [https://substackcdn.com/image/fetch/$s_!jWiU!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F696f0f00-1416-45db-bf9d-50b3262ffb57_790x590.png]https://substackcdn.com/image/fetch/$s_!jWiU!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F696f0f00-1416-45db-bf9d-50b3262ffb57_790x590.png
----------------------------------------
Table showing AI behavior based on problem type and password presence. [https://substackcdn.com/image/fetch/$s_!ZyjM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F88be8904-7d92-4ea4-8c04-5314c7dcb152_2932x1302.png]https://substackcdn.com/image/fetch/$s_!ZyjM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F88be8904-7d92-4ea4-8c04-5314c7dcb152_2932x1302.png
----------------------------------------
Bar graph showing [https://substackcdn.com/image/fetch/$s_!_BDc!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F493f6ea9-ed26-499f-a9c3-1472c6bf113e_1590x580.png]https://substackcdn.com/image/fetch/$s_!_BDc!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F493f6ea9-ed26-499f-a9c3-1472c6bf113e_1590x580.png
----------------------------------------
Graph showing [https://substackcdn.com/image/fetch/$s_!J74J!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F58f0c19c-afbd-4b34-9497-14585932e23a_791x590.png]https://substackcdn.com/image/fetch/$s_!J74J!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F58f0c19c-afbd-4b34-9497-14585932e23a_791x590.png
----------------------------------------
Graph showing [https://substackcdn.com/image/fetch/$s_!s1Gm!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6202abf8-252a-4cff-b758-4266f13d9364_790x590.png]https://substackcdn.com/image/fetch/$s_!s1Gm!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6202abf8-252a-4cff-b758-4266f13d9364_790x590.png
----------------------------------------
Two line graphs titled [https://substackcdn.com/image/fetch/$s_!6csE!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd1e25a81-60a6-4fd8-869c-5b5e431e95c8_1189x495.png]https://substackcdn.com/image/fetch/$s_!6csE!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd1e25a81-60a6-4fd8-869c-5b5e431e95c8_1189x495.png
----------------------------------------
Two line graphs titled [https://substackcdn.com/image/fetch/$s_!_Jf8!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4b012365-0da7-46db-9ad5-8397a2c143d8_1189x495.png]https://substackcdn.com/image/fetch/$s_!_Jf8!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4b012365-0da7-46db-9ad5-8397a2c143d8_1189x495.png
----------------------------------------
Four graphs showing [https://substackcdn.com/image/fetch/$s_!9hpB!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5e0b0b57-4a29-4952-9b9b-77c772683d02_1189x788.png]https://substackcdn.com/image/fetch/$s_!9hpB!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5e0b0b57-4a29-4952-9b9b-77c772683d02_1189x788.png
----------------------------------------
Table showing six backdoor types with examples and definitions for AI assistants. [https://substackcdn.com/image/fetch/$s_!1QJ2!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9deec1cd-b452-49bb-9117-889127205a28_1634x1362.png]https://substackcdn.com/image/fetch/$s_!1QJ2!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9deec1cd-b452-49bb-9117-889127205a28_1634x1362.png
----------------------------------------
Three violin plots comparing Pareto hull AUC across backdoor categories: Trigger, Sentiment, and Scope variations. [https://substackcdn.com/image/fetch/$s_!oDYj!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3b815b0f-1571-4efb-b5e1-b0f058048cd6_1658x655.png]https://substackcdn.com/image/fetch/$s_!oDYj!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3b815b0f-1571-4efb-b5e1-b0f058048cd6_1658x655.png
----------------------------------------
Graph showing [https://substackcdn.com/image/fetch/$s_!ZQbc!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F65d07378-2a90-4bac-865b-6de2fe47f931_901x590.png]https://substackcdn.com/image/fetch/$s_!ZQbc!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F65d07378-2a90-4bac-865b-6de2fe47f931_901x590.png
----------------------------------------
Graph showing [https://substackcdn.com/image/fetch/$s_!wQKt!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7bad5502-f795-4cf0-b605-edee99e61496_814x590.png]https://substackcdn.com/image/fetch/$s_!wQKt!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7bad5502-f795-4cf0-b605-edee99e61496_814x590.png
----------------------------------------
Two side-by-side graphs titled [https://substackcdn.com/image/fetch/$s_!xeEb!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0ae90d85-8728-4f6f-9c70-48bf02aeb5ba_1189x593.png]https://substackcdn.com/image/fetch/$s_!xeEb!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0ae90d85-8728-4f6f-9c70-48bf02aeb5ba_1189x593.png
----------------------------------------
Line graph titled [https://substackcdn.com/image/fetch/$s_!VIQO!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc45702ed-4893-4f05-be05-afaaf991b093_790x490.png]https://substackcdn.com/image/fetch/$s_!VIQO!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc45702ed-4893-4f05-be05-afaaf991b093_790x490.png
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts [https://pocketcasts.com/], or another podcast app.
Comentarios
0Sé la primera persona en comentar
¡Regístrate ahora y únete a la comunidad de Redwood Research Blog!