AE Alignment Podcast
In this episode, AE Studio Research Director Mike Vaiana is joined by Research Scientist Keenan Pepper, to explore a new approach to model self-interpretation - teaching language models to explain their own internal activations. They dive into Keenan’s recent paper on training lightweight adapters that transform activation vectors into soft tokens the model can interpret as language. The conversation walks through how this method improves on prior approaches like SelfIE, and how simple affine transformations can unlock surprisingly strong interpretability. Mike and Keenan break down concrete examples, including how models can identify latent topics like “baseball” from internal states, and even surface hidden reasoning steps in multi-hop questions, offering a potential path toward detecting when models are reasoning, guessing, or even hiding information. They also explore broader implications for AI alignment: from probing deception and internal representations, to enabling new forms of activation steering and self-monitoring. Along the way, they discuss attention schema theory, limitations of current labeling methods, and how this work could evolve into a general interface between model internals and human-understandable concepts. In this episode: * What self-interpretation of activations is * How lightweight adapters improve interpretability without retraining models * Why this approach could help uncover hidden reasoning and deception in LLMs Learn more: ae.studio/alignment [https://ae.studio/alignment] Keenan's Research Paper: https://arxiv.org/abs/2602.10352 [https://arxiv.org/abs/2602.10352] AE Studio is hiring: https://www.ae.studio/join-us [https://www.ae.studio/join-us] LinkedIn: https://www.linkedin.com/in/james-bowler-84b02a100/ [https://www.linkedin.com/in/james-bowler-84b02a100/]
6 episodios
Comentarios
0Sé la primera persona en comentar
¡Regístrate ahora y forma parte de la comunidad de AE Alignment Podcast!