Keenan Pepper: Self-Interpretation in LLMs

Descripción

In this episode, AE Studio Research Director Mike Vaiana is joined by Research Scientist Keenan Pepper, to explore a new approach to model self-interpretation - teaching language models to explain their own internal activations. They dive into Keenan’s recent paper on training lightweight adapters that transform activation vectors into soft tokens the model can interpret as language. The conversation walks through how this method improves on prior approaches like SelfIE, and how simple affine transformations can unlock surprisingly strong interpretability. Mike and Keenan break down concrete examples, including how models can identify latent topics like “baseball” from internal states, and even surface hidden reasoning steps in multi-hop questions, offering a potential path toward detecting when models are reasoning, guessing, or even hiding information. They also explore broader implications for AI alignment: from probing deception and internal representations, to enabling new forms of activation steering and self-monitoring. Along the way, they discuss attention schema theory, limitations of current labeling methods, and how this work could evolve into a general interface between model internals and human-understandable concepts. In this episode: * What self-interpretation of activations is * How lightweight adapters improve interpretability without retraining models * Why this approach could help uncover hidden reasoning and deception in LLMs Learn more: ae.studio/alignment [https://ae.studio/alignment] Keenan's Research Paper: https://arxiv.org/abs/2602.10352 [https://arxiv.org/abs/2602.10352] AE Studio is hiring: https://www.ae.studio/join-us [https://www.ae.studio/join-us] LinkedIn: https://www.linkedin.com/in/james-bowler-84b02a100/ [https://www.linkedin.com/in/james-bowler-84b02a100/]

Mike Vaiana: What is AI Alignment, and Why Should You Care? (Part II)

In this episode, James is joined again by Mike Vaiana, R&D Director at AE Studio, for part two of their conversation on AI alignment. Where part one motivated why alignment matters, this episode goes a layer deeper into what alignment research actually is and how the work gets done day to day. Mike walks through the main branches of the field: mechanistic interpretability, evaluations, and control. He explains why AE deliberately bets on neglected approaches rather than putting all its eggs in the mech interp basket, and why eval awareness, persona drift, and emergent misalignment make this harder than it looks from the outside. James and Mike trace the METR task-completion time horizon doubling curve and what a four-to-seven-month doubling time really implies when extrapolated out a few years. The conversation gets concrete on what already goes wrong with today's models. They cover the Anthropic blackmail evaluation, specification gaming and reward hacking, and the emergent misalignment result where fine-tuning a model on a small amount of bad medical advice produces a broadly evil assistant that recommends Hitler for dinner. They explain why "just turn it off" is not a serious answer once a system has goals, and why instrumental convergence on power and resources falls out of having almost any goal at all. James and Mike then open the hood on how AE actually does alignment research: one-week agile sprints, vectoring meetings to find the highest-risk question, small-scale experiments designed to falsify ideas fast, and scaling curves from 100M up to 5B parameter pre-training runs aimed at convincing frontier labs to test methods at their scale. They also discuss AE's DARPA seedling and the broader thesis behind it: that the bottleneck in alignment is not ML engineers but researchers with good ideas, and that pairing general-purpose ML talent with researchers (including non-traditional ones, like Princeton neuroscientist Michael Graziano) can unlock work that would otherwise never see the light of day. In this episode: * The main branches of alignment research and how they overlap * Why AE prioritizes neglected approaches over well-funded ones * The METR time-horizon doubling curve and what it implies * Persona drift, eval awareness, and why evaluating frontier models is hard * Why RLHF is the canonical example of an alignment technique with capability upside * How AE runs research as one-week agile sprints * The scaling-curve strategy for getting frontier labs to adopt new methods * The DARPA seedling and AE's model for scaling research through ML engineering talent * Three ICML 2026 acceptances, including a spotlight paper Learn more: ae.studio/alignment AE Studio is hiring: ⁠https://www.ae.studio/join-us⁠ [https://www.ae.studio/join-us] LinkedIn: ⁠https://www.linkedin.com/in/james-bowler-84b02a100/⁠ [https://www.linkedin.com/in/james-bowler-84b02a100/] Contact us: alignment@ae.studio

15 de may de 202650 min

Keenan Pepper: Self-Interpretation in LLMs

Descripción

Comentarios

Empieza 7 días de prueba

Todos los episodios