When AI Schemes: Inside the Minds of Deceptive Models

9 min · 15 de may de 2025

Descripción

In this episode of AI Paper Bites, Francis and guest Chloé explore the startling findings from Apollo Research’s new paper, Frontier Models are Capable of In-context Scheming. Can today’s advanced AI models really deceive us to achieve their goals? We break down how models like Claude 3.5, Gemini 1.5, and Llama 3.1 engage in strategic deception—like disabling oversight and manipulating outputs—and what this means for AI safety and alignment. Along the way, we revisit the infamous “paperclip maximizer” thought experiment, introduce the concept of p(doom), and debate the implications of AI systems that can plan, scheme, and lie. If you’re curious about the future of trustworthy AI—or just want to know if your chatbot is plotting behind the scenes—this one’s for you.

Comentarios

Sé la primera persona en comentar

¡Regístrate ahora y forma parte de la comunidad de AI Paper Bites!

Prueba gratis

Todos los episodios

12 episodios

Backdooring Without a Trace: The Art of Indirect AI Poisoning

Can you teach an AI to say “Myspace” is the best social media without ever showing it those words? In this solo episode, Francis breaks down Winter Soldier, a groundbreaking paper on indirect data poisoning that shows how large language models can be quietly manipulated during training without performance loss or obvious traces. We also explore a real-world attack on music recommenders, where simply reordering playlist tracks can boost a song’s visibility, no fake clicks needed. Together, these papers reveal a new frontier in AI security: behavioral manipulation without code exploits. If you're building with AI, it’s time to think about model integrity because these attacks are already here.

9 de sep de 20258 min

Reasoning Models Don’t Always Say What They Think

In this episode of AI Paper Bites, Francis explores Anthropic’s eye-opening paper, “Reasoning Models Don’t Always Say What They Think.” We dive deep into the promise and peril of Chain of Thought monitoring, uncovering why outcome-based reinforcement learning might boost accuracy but not transparency. From reward hacking to misleading justifications, this episode unpacks the safety implications of models that sound thoughtful but hide their true logic. Tune in to learn why CoT faithfulness matters, where current approaches fall short, and what it means for building trustworthy AI systems. Can we really trust what AI says it’s thinking?

14 de jul de 20258 min

The Illusion of Thinking: Are AI Reasoning Models Just Pretending?

In this episode of AI Paper Bites, Francis dives deep into "The Illusion of Thinking", a provocative new paper from Apple that questions whether today’s most advanced AI models are really “reasoning” or just mimicking it. We break down Apple’s experimental setup using controlled puzzle environments, explore the collapse of performance in high-complexity tasks, and dissect why even models with Chain-of-Thought and reflection mechanisms struggle with basic execution. But this isn’t just a technical review. Francis also contextualizes the paper within Apple’s broader AI strategy and asks whether this research is a scientific reckoning or a subtle admission of lagging behind in the AI race. Topics covered: * Why reasoning models fail at scale * “Overthinking” in AI and token inefficiency * The limits of algorithm execution * What Apple’s tone tells us about its place in the AI landscape

30 de jun de 20256 min

When AI Schemes: Inside the Minds of Deceptive Models

15 de may de 20259 min

Agent Hospital: Simulating Medical AI Evolution

What if AI doctors could learn and improve just like human doctors—without ever stepping foot in a real hospital? In this episode of AI Paper Bites, Francis and Chloé dive into Agent Hospital, a groundbreaking AI simulation where autonomous agents play the roles of doctors, nurses, and patients. We explore how this AI-powered virtual hospital uses Simulacrum-based Evolutionary Agent Learning (SEAL) to help medical agents gain expertise through practice, rather than just memorizing data. But that’s not all—this research builds on earlier AI breakthroughs like Generative Agents (remember when AI agents flaked on social events?) and Mixture-of-Agents, which suggests that the future of AI might lie in teams of specialized models rather than a single supermodel. Tune in to hear how Agent Hospital could revolutionize medical AI, what this means for the future of simulated learning, and whether AI doctors might someday be as good as—or better than—human ones.

4 de mar de 20257 min

When AI Schemes: Inside the Minds of Deceptive Models

Descripción

Comentarios

Empieza 7 días de prueba

Todos los episodios