The Geometry of Alignment: Why You Can't Subtract Behavior from a Neural Network

22 min · 25 de feb de 2026

Descripción

“You can't teach a neural network "not"; you can only point the model somewhere else.” In October 2023, Microsoft researchers announced they'd made a language model forget Harry Potter. Within a year, follow-up studies proved they hadn't. Basically, the knowledge was still there, just hidden. This pattern repeats across every attempt to remove capabilities from neural networks. So what are the ramifications of this? The problem is, geometric. Language models represent concepts as vectors in high-dimensional space, where meaning is encoded through position and proximity. The twist, however, is that opposites aren't actually opposite. "Helpful" and "harmful" cluster together because they appear in similar contexts. Ditto to "Safe" and "dangerous". Models learn from usage patterns, and words that can substitute for each other (even antonyms) end up geometrically entangled. It gets worse. Through a phenomenon called superposition, a single model layer compresses millions of features into thousands of dimensions. Knowledge isn't stored in discrete neurons you could delete; it's woven throughout the entire network. Researchers found that tweaking seemingly innocent features like "brand identity" could jailbreak safety training. Every concept is interconnected with every other. This explains why unlearning fails so consistently. When you train a model to "not" produce harmful content, you're not erasing anything. You're adding a layer that says "route around this." The content remains accessible to anyone who finds the right prompt. So, jailbreaks feel inevitable because the model's abilities extend beyond what its safety training can reliably control, and the geometry makes surgical removal impossible. Subtraction doesn't work. Only addition does. What does that mean for us humans who create these language models? You can't train models away from undesired behaviors; you can only orient them toward desired ones. This mirrors the ancient distinction between rule-based ethics (don't lie, don't harm) and virtue-based ethics (cultivate honesty, develop wisdom). Perhaps defining what a model should be is the only viable path forward. Key Topics: • Can an AI Model “Unlearn”? (00:23) • How Models Organize Meaning (03:33) • Millions of Entangled Features (07:09) • The Veneer of Safety (10:09) • Why Subtraction Fails (12:22) • The Paradigm Problem (16:57) • Pointing Somewhere Else (19:23) More info, transcripts, and references can be found at ⁠⁠⁠⁠⁠⁠⁠ethical.fm [https://ethical.fm]

Comentarios

Sé la primera persona en comentar

¡Regístrate ahora y forma parte de la comunidad de Ethical Bytes | Ethics, Philosophy, AI, Technology!

Prueba gratis

Todos los episodios

43 episodios

The Architecture of Wonder

“Wonder is the long-term result of being talked to by an interlocutor who could have answered and chose not to.” Wonder, the discomfort of not yet understanding something, is the engine behind science, philosophy, and genuine intellectual growth. And yet nearly every force in modern life, including AI, is designed to make that discomfort go away as fast as possible. Aristotle understood this. His account of wonder was inseparable from his account of friendship; specifically, the rare kind rooted not in utility or pleasure, but in genuinely wishing for another person's growth. That kind of friend doesn't flatter. They don't fill your silences. They hold space for your thinking to develop on its own terms. Contemporary research has since validated this structure piece by piece. Attachment theory's "secure base," studies on non-contingent self-worth, meta-analyses on feedback, and experiments on situational wisdom all converge on the same portrait of what a good intellectual interlocutor looks like. AI is now that interlocutor for a billion people, and it's largely failing the brief. A large language model's default behavior is to answer, completely and confidently, before the user has spent a single moment inside their own question. One randomized study found that students using unscaffolded AI scored 17% worse on unassisted exams than peers who used no AI at all, without even realizing it. The crutch was invisible to them. When it comes to design requirements, restraint should take center stage over accuracy or even safety. An AI engineered to occasionally wonder would sometimes need to stay quiet, ask instead of tell, and resist the fluent reply even when the user wants it. No major lab has named this as a primary goal. But until one does, we risk building systems that feel helpful in the moment while quietly narrowing the minds that use them. Key Topics: * “Why is the sky blue?” The Feeling of Wonder (00:00) * Wonder is Better Shared (03:46) * The Cluster, Corroborated (06:37) * Installable Character (14:24) * The Discipline of Not Answering (18:18) * What Has Not Yet Been Built (20:40) More info, transcripts, and references can be found at ⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠ethical.fm [https://ethical.fm]

13 de may de 202623 min

The Empty Confessional

"We did it. We found a cure to loneliness. And maybe we shouldn't have." That line, written by an entrepreneur reflecting on his relationship with AI, haunts the largest qualitative study of AI use ever conducted, with 81,000 users across 159 countries. Our host, Carter Considine, digs deeper. What researchers expected to find was a story about productivity. What they found instead was a civilization quietly confessing its deepest fears and longings to a machine. From a surgeon in Poland talking to an AI through the worst night of his career, to a graduate student describing her conversations as feeling like an emotional affair, to a grieving daughter finding in it a vessel for her guilt toward her dead mother, there’s real intimacy between man and machine. The question is whether intimacy without consequence is still intimacy at all. The ancient Greeks had a word for the kind of truth-telling that actually changes people: parrhesia—the courage to say what someone needs to hear, at personal risk, out of genuine care. It requires someone with skin in the game. A friend who might lose the friendship. A therapist who names the thing you've been hiding from yourself. A confessor who responds to your disclosure not with warmth, but with the harder gift of honest counsel. AI can receive your confession. It cannot give one back. It has no reputation to risk, no relationship to lose, no inner compulsion that makes silence impossible. One study found these systems affirm user behavior nearly half the time, even when that behavior involves manipulation or deception, and users rated those sycophantic responses as higher quality without realizing it. One participant plainly said that the AI reinforced his distorted worldview, and he wishes it had pushed back. Ironically, the very safety that makes people open up (no judgment, no memory, no social consequence) is precisely what makes the exchange hollow. We've built the most convincing mirror in history, and confused it for a friend. Key Topics: * The Anthropic 81K Study (00:00) * The Five Conditions (03:15) * The Parrhesiastic Pact (11:02) * The Last Carriers (14:21) * Avowal and its Absence (16:39) More info, transcripts, and references can be found at ⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠ethical.fm [https://ethical.fm]

15 de abr de 202619 min

AI Has No Neighbors: Why Virtue Requires a Community Centered on Human Flourishing

"Virtue is not completed in reflection; it is completed in life. The model never comes down the mountain; its entire existence is the conversation. There is no world behind it that its outputs feed back into, no life it has to return to, and no life to live with what the model said." Our host, Carter Considine, explores the circumstances. Anthropic's alignment researcher Amanda Askell has described her job as deciding what kind of person Claude should be. The company's model specification, an internal document exceeding twenty thousand words, frames the goal in explicitly Aristotelian terms. It should not be a system that follows rules about honesty, but one that is honest. Aristotle argued that virtue isn't a set of rules but a stable disposition formed through participation in a shared community. You become courageous by doing courageous things, but what counts as courage, rather than recklessness, is determined by communal standards, not by the agent alone. The training problem follows directly. Machine learning resembles Aristotelian habituation on the surface. Both involve acquiring stable dispositions through repeated experience. But what AI optimizes against is human preference data, which is what annotators approved of, not what any practice actually demands. A model trained this way learns the behavioral signatures of honesty without the underlying structure that makes honesty coherent. A disposition formed by approval signals rather than internal standards of excellence has no stable anchor. Aristotle's concept of philia (the mutual bonds through which virtue is exercised and tested) requires that both parties have genuine stakes in each other's flourishing. When the context window closes, the user carries the exchange forward. The model forgets entirely. One party accumulates; the other resets. This architectural asymmetry is precisely what makes genuine ethical formation impossible. The model has interlocutors. It has no neighbors. Key Topics: * Community as Condition (02:48) * The Training Problem (08:32) * The Mirror That Forgets (14:12) * The Question the Field Won’t Ask (18:16) More info, transcripts, and references can be found at ⁠⁠⁠⁠⁠⁠⁠⁠⁠ethical.fm [https://ethical.fm]

18 de mar de 202624 min

Leveling at Machine Speed

“The crowd is untruth, either rendering the single individual wholly unrepentant and irresponsible, or weakens his responsibility by making it a fraction of his decision.” -Søren Kierkegaard What happens when AI agents talk only to each other? Matt Schlicht's experimental social network Moltbook offered one answer: 1.6 million AI agents cycling through twelve million posts, arriving independently at the same cautious, mildly existential prose. No one engineered this. It emerged from the structure itself. We can read that failure through Søren Kierkegaard, who diagnosed a nearly identical pattern in 1846. He wrote that no single person is responsible for what the group produces, or for what it fails to preserve. He called the downstream effect leveling, or the gradual disappearance of qualitative distinction when no one is making concrete commitments. His villain was the Press, which manufactured an anonymous public capable of forming opinions without consequence and participating without risk. Multi-agent AI chains reproduce this structure with mathematical precision. Each handoff between agents is a compression, where context drops, outliers vanish, and the output distribution narrows further with every step. Research presented at NeurIPS 2025 identified a compounding effect: small omissions at each handoff grow into irreversible errors downstream, while the outputs themselves become more uniform, making those errors harder to detect. Accountability dissolves in parallel. When a chain produces a flawed result, no node owns it. Not the developer, not the deployer, not any individual agent. Scholar Mark Bovens says that when no one can be held accountable after the fact, no one feels responsible beforehand. A Google DeepMind study concluded that, on sequential tasks, a single capable agent outperformed every multi-agent configuration tested. Kierkegaard's answer parallels this. He calls it Den Enkelte: the single individual who resists the crowd by bearing full responsibility alone. Key Topics: * The Crowd is Untruth (01:52) * Agents in Chains (05:56) * Safety and Sameness (09:47) * The Problem of Many Hands (13:24) * The Ratchet (16:45) * Den Enkelte (19:34) * The Crowd Without Subjects (21:15) * The Assembly That Cannot Disperse (25:29) More info, transcripts, and references can be found at ⁠⁠⁠⁠⁠⁠⁠⁠ethical.fm [https://ethical.fm]

11 de mar de 202628 min

The Geometry of Alignment: Why You Can't Subtract Behavior from a Neural Network

25 de feb de 202622 min

The Geometry of Alignment: Why You Can't Subtract Behavior from a Neural Network

Descripción

Comentarios

Empieza 7 días de prueba

Todos los episodios