Ethical Bytes | Ethics, Philosophy, AI, Technology
“You can't teach a neural network "not"; you can only point the model somewhere else.” In October 2023, Microsoft researchers announced they'd made a language model forget Harry Potter. Within a year, follow-up studies proved they hadn't. Basically, the knowledge was still there, just hidden. This pattern repeats across every attempt to remove capabilities from neural networks. So what are the ramifications of this? The problem is, geometric. Language models represent concepts as vectors in high-dimensional space, where meaning is encoded through position and proximity. The twist, however, is that opposites aren't actually opposite. "Helpful" and "harmful" cluster together because they appear in similar contexts. Ditto to "Safe" and "dangerous". Models learn from usage patterns, and words that can substitute for each other (even antonyms) end up geometrically entangled. It gets worse. Through a phenomenon called superposition, a single model layer compresses millions of features into thousands of dimensions. Knowledge isn't stored in discrete neurons you could delete; it's woven throughout the entire network. Researchers found that tweaking seemingly innocent features like "brand identity" could jailbreak safety training. Every concept is interconnected with every other. This explains why unlearning fails so consistently. When you train a model to "not" produce harmful content, you're not erasing anything. You're adding a layer that says "route around this." The content remains accessible to anyone who finds the right prompt. So, jailbreaks feel inevitable because the model's abilities extend beyond what its safety training can reliably control, and the geometry makes surgical removal impossible. Subtraction doesn't work. Only addition does. What does that mean for us humans who create these language models? You can't train models away from undesired behaviors; you can only orient them toward desired ones. This mirrors the ancient distinction between rule-based ethics (don't lie, don't harm) and virtue-based ethics (cultivate honesty, develop wisdom). Perhaps defining what a model should be is the only viable path forward. Key Topics: • Can an AI Model “Unlearn”? (00:23) • How Models Organize Meaning (03:33) • Millions of Entangled Features (07:09) • The Veneer of Safety (10:09) • Why Subtraction Fails (12:22) • The Paradigm Problem (16:57) • Pointing Somewhere Else (19:23) More info, transcripts, and references can be found at ethical.fm [https://ethical.fm]
43 episodios
Comentarios
0Sé la primera persona en comentar
¡Regístrate ahora y forma parte de la comunidad de Ethical Bytes | Ethics, Philosophy, AI, Technology!