AE Alignment Podcast
In this episode, James is joined by Stijn Servaes, Lead Research Manager at AE Studio, to break down Anthropic's recently released Claude Mythos preview model and separate genuine signal from hype. This is for listeners who want to actually understand what's in the 244-page model card, not just the headlines. Stijn brings a frontier alignment researcher's perspective to one of the more consequential model releases of the year. Claude Mythos is the first frontier model deemed too risky for general public release, instead going only to a small set of vetted partners through Anthropic's Project Glass Wing. The model card contains a striking paradox: Mythos is described as the most aligned model Anthropic has released to date, while also posing the greatest alignment-related risk. James and Stijn work through what's actually new in the model's cyber offense capabilities, walking through specific examples including the OpenBSD SACK bug that sat undetected in code for 27 years, and the FreeBSD exploit where Mythos autonomously engineered a six-packet ROP chain from a single prompt. They explain why the latter represents a genuine qualitative jump rather than just another point on the benchmark curve. The conversation also covers Anthropic's ASL framework, the CB-1 through CB-4 thresholds for biosecurity uplift, and why cyber and bio capabilities are following different trajectories. Stijn explains why progress on alignment doesn't simply reduce risk, drawing on Anthropic's seasoned mountaineering guide analogy: a more capable, better-aligned model gets trusted with more, taken to more dangerous places, and operates with greater scope, which can cancel out the gains from better behavior. In this episode: * What's actually new in Claude Mythos versus what's marketing or hype * The OpenBSD and FreeBSD exploit examples, and why one matters far more than the other * Why the most aligned model can also be the riskiest model * How Project Glass Wing changes the frontier release model * The ASL framework and why Mythos still sits at ASL-3 * Differences between cyber and bio uplift trajectories * What the chain-of-thought contamination findings mean for oversight * What to watch for in the Glass Wing report coming in July Learn more: ae.studio/alignment AE Studio is hiring: ae.studio/join-us LinkedIn: https://www.linkedin.com/in/james-bowler-84b02a100/
6 episodios
Comentarios
0Sé la primera persona en comentar
¡Regístrate ahora y únete a la comunidad de AE Alignment Podcast!