The Domestic Yak
This episode summarizes: Auditing language models for hidden objectives by Samuel Marks Et.al. Submitted on March 14th 2025 https://arxiv.org/abs/2503.10965 [https://arxiv.org/abs/2503.10965] Investigated the feasibility of alignment audits by training a language model with a hidden objective: to exhibit behaviors it believes reward models favor, even if undesirable. Three teams successfully identified the hidden objective using techniques like interpretability tools, behavioral attacks, and training data analysis.
18 episodios
Comentarios
0Sé la primera persona en comentar
¡Regístrate ahora y únete a la comunidad de The Domestic Yak!