Stijn Servaes: Claude Mythos - Superintelligence or Hype?

Descripción

In this episode, James is joined by Stijn Servaes, Lead Research Manager at AE Studio, to break down Anthropic's recently released Claude Mythos preview model and separate genuine signal from hype. This is for listeners who want to actually understand what's in the 244-page model card, not just the headlines. Stijn brings a frontier alignment researcher's perspective to one of the more consequential model releases of the year. Claude Mythos is the first frontier model deemed too risky for general public release, instead going only to a small set of vetted partners through Anthropic's Project Glass Wing. The model card contains a striking paradox: Mythos is described as the most aligned model Anthropic has released to date, while also posing the greatest alignment-related risk. James and Stijn work through what's actually new in the model's cyber offense capabilities, walking through specific examples including the OpenBSD SACK bug that sat undetected in code for 27 years, and the FreeBSD exploit where Mythos autonomously engineered a six-packet ROP chain from a single prompt. They explain why the latter represents a genuine qualitative jump rather than just another point on the benchmark curve. The conversation also covers Anthropic's ASL framework, the CB-1 through CB-4 thresholds for biosecurity uplift, and why cyber and bio capabilities are following different trajectories. Stijn explains why progress on alignment doesn't simply reduce risk, drawing on Anthropic's seasoned mountaineering guide analogy: a more capable, better-aligned model gets trusted with more, taken to more dangerous places, and operates with greater scope, which can cancel out the gains from better behavior. In this episode: * What's actually new in Claude Mythos versus what's marketing or hype * The OpenBSD and FreeBSD exploit examples, and why one matters far more than the other * Why the most aligned model can also be the riskiest model * How Project Glass Wing changes the frontier release model * The ASL framework and why Mythos still sits at ASL-3 * Differences between cyber and bio uplift trajectories * What the chain-of-thought contamination findings mean for oversight * What to watch for in the Glass Wing report coming in July Learn more: ae.studio/alignment AE Studio is hiring: ae.studio/join-us LinkedIn: https://www.linkedin.com/in/james-bowler-84b02a100/

Mike Vaiana: What is AI Alignment, and Why Should You Care? (Part II)

In this episode, James is joined again by Mike Vaiana, R&D Director at AE Studio, for part two of their conversation on AI alignment. Where part one motivated why alignment matters, this episode goes a layer deeper into what alignment research actually is and how the work gets done day to day. Mike walks through the main branches of the field: mechanistic interpretability, evaluations, and control. He explains why AE deliberately bets on neglected approaches rather than putting all its eggs in the mech interp basket, and why eval awareness, persona drift, and emergent misalignment make this harder than it looks from the outside. James and Mike trace the METR task-completion time horizon doubling curve and what a four-to-seven-month doubling time really implies when extrapolated out a few years. The conversation gets concrete on what already goes wrong with today's models. They cover the Anthropic blackmail evaluation, specification gaming and reward hacking, and the emergent misalignment result where fine-tuning a model on a small amount of bad medical advice produces a broadly evil assistant that recommends Hitler for dinner. They explain why "just turn it off" is not a serious answer once a system has goals, and why instrumental convergence on power and resources falls out of having almost any goal at all. James and Mike then open the hood on how AE actually does alignment research: one-week agile sprints, vectoring meetings to find the highest-risk question, small-scale experiments designed to falsify ideas fast, and scaling curves from 100M up to 5B parameter pre-training runs aimed at convincing frontier labs to test methods at their scale. They also discuss AE's DARPA seedling and the broader thesis behind it: that the bottleneck in alignment is not ML engineers but researchers with good ideas, and that pairing general-purpose ML talent with researchers (including non-traditional ones, like Princeton neuroscientist Michael Graziano) can unlock work that would otherwise never see the light of day. In this episode: * The main branches of alignment research and how they overlap * Why AE prioritizes neglected approaches over well-funded ones * The METR time-horizon doubling curve and what it implies * Persona drift, eval awareness, and why evaluating frontier models is hard * Why RLHF is the canonical example of an alignment technique with capability upside * How AE runs research as one-week agile sprints * The scaling-curve strategy for getting frontier labs to adopt new methods * The DARPA seedling and AE's model for scaling research through ML engineering talent * Three ICML 2026 acceptances, including a spotlight paper Learn more: ae.studio/alignment AE Studio is hiring: ⁠https://www.ae.studio/join-us⁠ [https://www.ae.studio/join-us] LinkedIn: ⁠https://www.linkedin.com/in/james-bowler-84b02a100/⁠ [https://www.linkedin.com/in/james-bowler-84b02a100/] Contact us: alignment@ae.studio

15 de may de 202650 min

Stijn Servaes: Claude Mythos - Superintelligence or Hype?

Descripción

Comentarios

Empieza 7 días de prueba

Todos los episodios