An AI Designed Its Own Psychology Studies, Then Confirmed What It Found

Beskrivelse

AN AI DESIGNED ITS OWN PSYCHOLOGY STUDIES, THEN CONFIRMED WHAT IT FOUND Source: Closing the Loop to Discover Psychological Theories with an Automated Cognitive Scientist [https://arxiv.org/abs/2606.26448] Paper was published on June 24, 2026 This episode was AI-generated on June 26, 2026. The script was written by an AI language model and the host voices were synthesized by Eleven Labs. The producer is not affiliated with Anthropic or Eleven Labs. A system called AutoCog designed psychology experiments, paid 250 real people to take them, diagnosed why its own theories failed, and rediscovered one of the deepest principles in decision science — then locked in predictions and confirmed them. It's the first time an AI has closed the full scientific loop with no researcher in the chair. But the headline runs ahead of the evidence, and the honest version turns out to be more interesting than either the hype or the cynicism. KEY TAKEAWAYS * How AutoCog closes the full discovery loop — designing experiments, paying online participants, diagnosing failures, and revising theories — with no human in the chair * Why the system scores theories by whether they can generate human-like behavior rather than by fitting data, and why that guards against overfitting * How three classic decision rules — Take-the-Best, Tallying, and WADD — collapse into endpoints of a single tunable dial * The flagship 'discovery,' Diminishing Returns WADD, turns out to be a fresh instance of Kahneman and Tversky's prospect theory * Where the headline overreaches: a friendly domain, a search that stayed local, an unaudited gap between the verbal theory and its code, and one thin confirmation * Why the durable result may not be the finding itself, but the idea that theory-building can become an auditable, resumable trace instead of a private flash of insight * 00:04 — Can you automate the creative leap? Sets up the frontier question — whether the irreducibly human act of inventing a better theory can be handed to an AI. * 03:24 — Two blenders and three rival rules Grounds the task in multi-attribute decision-making and lays out the three classic strategies the AI works with. * 05:42 — Two lawyers, one self-correcting wheel Walks through the four-stage loop of advocate agents, a simulate-before-collect gate, and a neutral arbiter that rebuilds the loser. * 07:58 — Why it refuses to Photoshop the data Explains the generate-don't-fit scoring metric and the unification pressure that quietly forces parsimony. * 10:52 — Can it find an answer it was never given? The validation phase — recovering hidden strategies, even deliberately bizarre anti-theories, and reporting noise honestly instead of inventing structure. * 14:52 — When three theories become one knob The first live deployment on 250 real people, where the winning model unifies the three rivals as settings of a single dial. * 17:33 — The discovery it didn't go looking for One small change surfaces Diminishing Returns WADD — a concave curve that turns out to be prospect theory, confirmed by a preregistered study. * 22:19 — Where the headline runs out ahead The skeptic's case — a friendly domain, a local search, a known mechanism, and an unaudited verbal-to-code gap. * 26:46 — The creative leap, finally logged Why an auditable, resumable trace of the discovery process may matter more than the specific finding, and what comes next. RECOMMENDED READING * Centaur: a foundation model of human cognition [https://arxiv.org/abs/2410.20268] — The Helmholtz Munich line of work on behavioral foundation models that simulate human choices, the in-silico stand-in the episode flags as the conditional forward path for cheaper discovery loops.

One Crosscoder Feature Flips a Stalling Chatbot Into a Working Agent

ONE CROSSCODER FEATURE FLIPS A STALLING CHATBOT INTO A WORKING AGENT Source: Localizing RL-Induced Tool Use to a Single Crosscoder Feature [https://arxiv.org/abs/2606.26474] Paper was published on June 25, 2026 This episode was AI-generated on June 26, 2026. The script was written by an AI language model and the host voices were synthesized by Eleven Labs. The producer is not affiliated with Anthropic or Eleven Labs. Reinforcement learning spent a whole training run teaching a model to use tools — and it turns out you can find that skill, grab one internal feature, and flip the behavior on at runtime with no retraining at all. But the same evidence that says the skill lives in one place also shows it quietly leaking into a model that was never trained for it. This episode unpacks what RL actually localizes, where it lives, and why you can concentrate a capability but never fully wall it off. KEY TAKEAWAYS * Why a single 'dedicated' crosscoder feature, steered at inference time with no weight changes, can recover most of an RL model's tool-calling accuracy * How just routing activations through the sparse dictionary and back raises tool correctness from 19% to ~50% — even though reconstruction quality barely predicts the gain * The 'capability spillover' result: a frozen base model, never trained for tools, picks up tool selection (0% to ~7%) just by passing through the shared crosscoder — but never reproduces the tool-call syntax * Why the exclusive feature shelf is a coffee filter, not a sealed sink — penalizing it degrades the RL model, proving the captured signal is load-bearing and leaky * The honest limits: the +65 number comes from one best-performing cell on 40 prompts with a wide confidence band, and the DFC's advantage is legibility, not better performance * Why the cleanest features are structural-template detectors — and why that may be exactly why a tool-calling skill concentrates into one dial when a messier capability might not * 00:00 — Where does an RL skill actually live? Sets up the puzzle: RL visibly installs tool use, but no one can point to where in the network that capability physically lives. * 02:34 — Reading the model's muddy scratchpad Explains superposition and sparse dictionaries — the tools that separate a model's blended internal state back into named features. * 04:26 — Bolting down the shelves: the DFC Introduces the crosscoder and the Dedicated Feature Crosscoder, which forces features into RL-exclusive, base-exclusive, and shared bins. * 07:13 — One master switch versus a fuse box Walks through the saturation curve where one DFC feature hits the accuracy ceiling while the plain crosscoder needs 33 features. * 09:29 — Feature 136 turns a hedger into an agent The before-and-after example where steering a single feature produces a clean, correct tool call — and reveals the top features are template detectors. * 11:03 — Why lossy reconstruction makes it better The surprising finding that just routing activations through the dictionary and back boosts tool correctness, validated across 48 crosscoder variants. * 13:09 — A frozen model catches the trick Capability spillover: the untrained base model inherits tool selection through the shared decoder, but never the exact tool-call syntax. * 15:10 — A coffee filter, not a sealed sink Penalizing the exclusive shelf degrades the RL model, showing the capability is entangled in shared geometry and can be concentrated but never fully isolated. * 18:22 — How soft is that headline number? The critique: the +65 estimate is a favorable draw on 40 prompts, the architecture comparison isn't significant, and 'capability' means propensity under one prompt. * 22:08 — When your interpretability tool leaks Why feature-level steering offers a gradient-free control handle for agents — but published diffing artifacts may themselves become a side channel that moves capability around. RECOMMENDED READING * Towards Monosemanticity: Decomposing Language Models With Dictionary Learning [https://transformer-circuits.pub/2023/monosemantic-features/index.html] — The Anthropic sparse-autoencoder work that grounds the episode's 'separate the mud back into named pigments' picture of superposition and single-meaning features. * Sparse Crosscoders for Cross-Layer Features and Model Diffing [https://transformer-circuits.pub/2024/crosscoders/index.html] — The original crosscoder writeup that introduced the shared-dictionary model-diffing approach the episode's Dedicated Feature Crosscoder extends. * Toy Models of Superposition [https://transformer-circuits.pub/2022/toy_model/index.html] — The foundational account of why a few-thousand-dimensional scratchpad packs far more concepts than dimensions — the entanglement the episode says makes perfect capability isolation impossible.

26. juni 202625 min

An AI Designed Its Own Psychology Studies, Then Confirmed What It Found

Beskrivelse

Kommentarer

Prøv gratis i 14 dager

Alle episoder