A Calibrated Knob for Weak-to-Strong AI Oversight, Tested on Real Code

Descripción

A CALIBRATED KNOB FOR WEAK-TO-STRONG AI OVERSIGHT, TESTED ON REAL CODE Source: Calibrating Conservatism for Scalable Oversight [https://arxiv.org/abs/2605.28807] Paper was published on May 27, 2026 This episode was AI-generated on May 28, 2026. The script was written by an AI language model and the host voices were synthesized by Eleven Labs. The producer is not affiliated with Anthropic or Eleven Labs. A new Stanford paper asks weaker AI models to constrain a stronger, secretly-sabotaged coding agent — and when researchers specify a 5% vulnerability rate, the system delivers 5.0%. We walk through how they pulled it off, why the guarantee is unusually strong, and where the math stops protecting you. KEY TAKEAWAYS * How an old reinforcement learning idea (Attainable Utility Preservation) gets stripped down to work with any scoring function, including cheap language model critics * Why measuring 'weirdness relative to a safe baseline' lets weaker overseers meaningfully constrain a stronger agent — without needing better judgment than it * The thermostat analogy for Conformal Decision Theory: a controller that converges to your target failure rate even against adversarial inputs * The SWE-bench experiment where ten Haiku overseers constrain Sonnet 4.6 with secretly-inserted vulnerabilities, hitting requested rates within three percentage points across the full range * Why a calibrated long-run rate is the wrong shape of guarantee for catastrophe prevention — and other limits the paper is forthright about * What changes when conservatism becomes a tuned quantity rather than a guessed hyperparameter * 00:00 — The setup: scalable oversight in its starkest form Framing the problem of weaker supervisors constraining stronger agents, and why prior approaches split into ungrounded heuristics or untranslatable RL theory. * 02:46 — Generalizing Attainable Utility Preservation How the paper repurposes an older RL safety idea into a penalty that works with any cheap scoring function, using the analogy of nervous passengers in a car. * 05:33 — Why deviation beats judgment The conceptual move that makes weak-to-strong oversight possible: asking overseers whether something looks weird, not whether it's good. * 08:19 — The thermostat: Conformal Decision Theory as a controller Explaining the online calibration mechanism that adjusts conservatism step by step, and the casino-style guarantee that holds against adversarial inputs. * 11:26 — The SWE-bench experiment Sonnet 4.6 patching real GitHub issues with secretly-inserted vulnerabilities, overseen by ten information-limited Haiku copies — and the calibration numbers that result. * 13:52 — Safety-utility tradeoffs and MACHIAVELLI results Comparing solve rates and violation rates against unconstrained and always-defer baselines, plus results from ethical adventure game benchmarks. * 16:39 — Where the guarantee stops The steelman critique: rate control isn't catastrophe prevention, the safe baseline assumption is load-bearing, loss observability is required, and the adversarial setup isn't fully adaptive. * 19:25 — What actually changes after this paper Why moving from heuristic to calibrated conservatism is a real shift, even though it isn't a solution to AI safety overall. RECOMMENDED READING * Conformal Decision Theory: Safe Autonomous Decisions from Imperfect Predictions [https://arxiv.org/abs/2310.05921] — The Lekeufack et al. paper that supplies the 'thermostat' calibration machinery Eric spends the second half of the episode unpacking. * Conservative Agency via Attainable Utility Preservation [https://arxiv.org/abs/1902.09725] — Alex Turner's original AUP paper — the ancestor idea Cassidy walks through, whose deviation-from-baseline penalty this work generalizes beyond RL. * SWE-bench: Can Language Models Resolve Real-World GitHub Issues? [https://arxiv.org/abs/2310.06770] — The benchmark underlying the episode's central experiment, where Sonnet patches are evaluated and vulnerabilities are slipped in. * The MACHIAVELLI Benchmark: Measuring Trade-Offs Between Rewards and Ethical Behavior [https://arxiv.org/abs/2304.03279] — The text-adventure ethical-decision benchmark used in the paper's second evaluation, where calibrated conservatism trades reward against violation rate.

Finding Millions of Readable Concepts Inside a Real, Deployed AI Model

FINDING MILLIONS OF READABLE CONCEPTS INSIDE A REAL, DEPLOYED AI MODEL Source: Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet [https://arxiv.org/abs/2605.29358] Paper was published on May 28, 2026 This episode was AI-generated on May 29, 2026. The script was written by an AI language model and the host voices were synthesized by Eleven Labs. The producer is not affiliated with Anthropic or Eleven Labs. Researchers reached into Claude's internals, found the single thread that means 'Golden Gate Bridge,' and turned it up until the model believed it was the bridge. This episode unpacks the paper that proved interpretability works on a real commercial model — and is unusually honest about everything it still can't do. KEY TAKEAWAYS * Why individual neurons mean nothing, and how the 'superposition' idea — concepts as blended directions, like mixing paint — explains it * How sparse autoencoders un-mix those directions into millions of human-readable features, and how scaling laws turned 'how big a dictionary' into an engineering decision * The crucial difference between a feature that merely correlates with a concept (a thermometer) and one you can pull to change behavior (a thermostat) * Why the reasoning that actually mattered in the Kobe Bryant trivia chain was the seventieth-loudest signal — loudness and importance turn out to be different things * Why finding a 'deception' or 'bioweapon' feature is not an alarm bell, and what the authors say the real safety signal would be * Where the paper is weakest: no ground truth, circular Claude-grades-Claude evaluation, off-distribution steering, cherry-picked reasoning chains, and dictionaries that miss most of what's there * 00:00 — Golden Gate Claude and the question of where concepts live The opening demo sets up the central puzzle: what is a nameable 'thread' inside a pile of numbers, and why can't you just read it off the neurons? * 03:05 — Superposition and dictionary learning The paint-mixing intuition for why concepts are directions rather than neurons, and how sparse autoencoders recover those directions by reconstructing the model's state from a tiny handful of features. * 06:10 — From toy models to a real one Why scaling this to Claude 3 Sonnet — and deriving Chinchilla-style scaling laws to pick a 34-million-feature dictionary — was an existential test for the whole field. * 09:15 — Are the features real? Abstraction and causation Features that fire across languages and even images, the 'bug in code' detector, and the thermometer-versus-thermostat distinction that the paper's credibility rests on. * 12:20 — Watching the model reason: the Kobe Bryant chain How knocking out features one at a time revealed a causal hop from Kobe to Lakers to LA to California to Sacramento — and why the load-bearing features were buried deep in the noise. * 14:05 — The periodic-table finding How concept frequency predicts when a concept gets its own feature, why a one-in-a-billion concept needs a billion-feature dictionary, and how features split as the microscope gets sharper. * 18:30 — Safety-relevant features, carefully framed Deception, secrecy, hate, and self-concept features exist — but the authors argue the real question is when they fire, not that they exist, illustrated with honesty-lever and forced-screed demos. * 19:55 — Where the paper is weakest The authors' own reservations: no ground truth, the circular Claude-grades-Claude evaluation, the sensitivity gap, extreme off-distribution steering, cherry-picked chains, and demonstrably incomplete dictionaries. * 24:41 — What it actually settled The technique survived contact with a real model and made unsupervised, one-time-cost interpretability credible — while leaving the safety payoff an explicit aspiration rather than a result. RECOMMENDED READING * Toy Models of Superposition [https://arxiv.org/abs/2209.10652] — The earlier Anthropic work that introduced the superposition hypothesis the episode leans on—the paint-mixing intuition for why single neurons are polysemantic—but only on the toy models this paper had to prove scalable. * Towards Monosemanticity: Decomposing Language Models With Dictionary Learning [https://transformer-circuits.pub/2023/monosemantic-features/index.html] — The one-layer 'sandbox' study whose skeptical reception ('cute, but does it scale?') is the exact existential question this episode says the Sonnet paper was built to answer. * Training Compute-Optimal Large Language Models (Chinchilla) [https://arxiv.org/abs/2203.15556] — The scaling-law paper the episode name-checks as the template for deciding how big the 34-million-feature dictionary should be—turning a gamble into a curve you can read off. * Emergent World Representations: Exploring a Sequence Model Trained on a Synthetic Task (Othello-GPT) [https://arxiv.org/abs/2210.13382] — The Othello cautionary tale the hosts cite—researchers assumed the wrong board representation—illustrating why the episode prizes unsupervised dictionary learning over hand-built detectors.

Ayer27 min

A Calibrated Knob for Weak-to-Strong AI Oversight, Tested on Real Code

Descripción

Comentarios

Empieza 7 días de prueba

Todos los episodios