The Domestic Yak
This episode summarizes: Auditing language models for hidden objectives by Samuel Marks Et.al. Submitted on March 14th 2025 https://arxiv.org/abs/2503.10965 [https://arxiv.org/abs/2503.10965] Investigated the feasibility of alignment audits by training a language model with a hidden objective: to exhibit behaviors it believes reward models favor, even if undesirable. Three teams successfully identified the hidden objective using techniques like interpretability tools, behavioral attacks, and training data analysis.
18 Folgen
Kommentare
0Sei die erste Person, die kommentiert
Melde dich jetzt an und werde Teil der The Domestic Yak-Community!