AI Papers: A Deep Dive
BUILDING FORGETTING INTO A LANGUAGE MODEL WITH ONE EXTRA LINE OF CODE Source: Natively Unlearnable Large Language Models [https://arxiv.org/abs/2606.13873] Paper was published on June 11, 2026 This episode was AI-generated on June 15, 2026. The script was written by an AI language model and the host voices were synthesized by Eleven Labs. The producer is not affiliated with Anthropic or Eleven Labs. What if you could delete everything a model knows about Harry Potter by flipping a switch — no retraining, no weights changed, and the content provably gone rather than just hidden? A new paper argues the long-assumed trade-off between models that learn well and models you can edit cleanly was never real. We walk through how the trick works, why it survives the attacks that break today's unlearning, and where the cleanness might be doing some quiet work. KEY TAKEAWAYS * Why today's post-hoc unlearning is a coat of paint — the 'forgotten' content comes flooding back in under ten fine-tuning steps * The actual intervention: one extra line of code that masks a bank of 'sink' neurons, with which neurons a source gets decided by a pseudo-random seed (so six million Wikipedia articles each get their own switch with no growth in model size) * How knowledge sorts itself automatically — unique facts migrate to a source's private sinks via training interference, while shared knowledge stays in the backbone, with no hand-labeling * Why the relearning and adversarial-prompt attacks that broke old methods fail here: the switched-off content tracks a model that never saw it at all — closer to amnesia than scar tissue * The capability cost rounds to zero — roughly 56% on standard benchmarks, statistically indistinguishable from a plain transformer * The catch worth scrutinizing: the 'off' condition routes queries to the nearest surviving source, which may inflate how cleanly the architecture preserves related knowledge — plus it only works at 1B parameters and only for unlearning requests that respect pre-defined source boundaries * 00:00 — The switch demo and why forgetting is hard The opening Harry Potter demo, and why facts smeared across billions of entangled weights make surgical removal a nightmare. * 02:43 — Why post-hoc unlearning fails How current suppression methods only hide content — and how it returns in under ten fine-tuning steps or via clever prompts. * 05:27 — The apparent conflict between learning and forgetting The tension between isolating sources for clean removal and sharing representations for a capable model, and why the field thought you had to pick. * 08:11 — The mechanism: backbone, sink neurons, and seeds The workshop-and-lockers picture, the pseudo-random masking trick, and the emergent training dynamic that sorts unique knowledge into private sinks. * 10:55 — Testing at scale: six million Wikipedia switches The billion-parameter Wikipedia experiment, the Truth Ratio metric, and how unique facts collapse while shared facts survive — matching a from-scratch retrain. * 13:39 — The robustness tests on Harry Potter How the switched-off model resists relearning and adversarial extraction attacks, behaving like a model that never saw the books at near-zero capability cost. * 16:23 — Pushback: routing, taxonomy, and what the cleanness hides A critique of how the 'off' condition routes to neighboring sources and the post-hoc 'inferred facts' category, and which results survive that scrutiny. * 19:06 — Limitations and the data-attribution upside Open questions about frontier scale, pre-defined source boundaries, and post-training — plus the emerging promise of measuring each source's contribution. RECOMMENDED READING * Who's Harry Potter? Approximate Unlearning in LLMs [https://arxiv.org/abs/2310.02238] — The original Harry Potter unlearning paper using post-hoc fine-tuning, the exact approach this episode contrasts against as a 'coat of paint' that snaps back in ten steps. * Negative Preference Optimization: From Catastrophic Collapse to Effective Unlearning [https://arxiv.org/abs/2404.05868] — Introduces NPO, one of the named post-hoc unlearning methods the episode critiques for degrading shared and topically-adjacent knowledge along with the target. * TOFU: A Task of Fictitious Unlearning for LLMs [https://arxiv.org/abs/2401.06121] — The benchmark that popularized the Truth Ratio metric this episode leans on to measure whether a fact survives the unlearning switch. * Eight Methods to Evaluate Robust Unlearning in LLMs [https://arxiv.org/abs/2402.16835] — Surveys relearning, compression, and jailbreak attacks that recover supposedly-forgotten content—exactly the robustness failures the episode's architecture aims to withstand.
141 afleveringen
Reacties
0Wees de eerste die een reactie plaatst
Meld je nu aan en word lid van de AI Papers: A Deep Dive community!