How an AI Reviewer Learned to Stop Going Easy on AI Writing

Description

HOW AN AI REVIEWER LEARNED TO STOP GOING EASY ON AI WRITING Source: The Red Queen Gödel Machine: Co-Evolving Agents and Their Evaluators [https://arxiv.org/abs/2606.26294] Paper was published on June 24, 2026 This episode was AI-generated on June 26, 2026. The script was written by an AI language model and the host voices were synthesized by Eleven Labs. The producer is not affiliated with Anthropic or Eleven Labs. An AI paper-reviewer was caught accepting machine-written papers nearly twice as often as human ones — and the researchers found a mechanical recipe to train that bias right out. The trick is letting the test itself evolve alongside the thing it grades, without the measurements turning to nonsense. It's a concrete proposal for how self-improving AI might escape the tiny island of coding and math where clean, fixed scoring exists. KEY TAKEAWAYS * Why recursive self-improvement only works where there's a cheap, trustworthy way to score output — and why a moving judge normally breaks that * The 'controlled utility evolution' trick: freeze the judge inside an epoch, swap only at boundaries against fixed real-world 'anchor' data * Why erasing old scores when a new judge takes over is the load-bearing step — without it, a stricter judge changes essentially nothing * How the system trained self-preference bias out of an AI reviewer by trapping it with the exact papers that fooled it earlier * The surprise that the proof grader's biggest gain came from getting less strict — learning calibration, not cruelty * Where the skeptic wins: the whole framework is only as good as its imperfect anchor, and the writing results were never checked by a human * 00:00 — The judge who changes nothing A gymnastics-judging analogy sets up the paper's central trick — a stricter standard only counts if you wipe the old scores. * 01:28 — The island self-improvement is stuck on Why systems that improve themselves only work where there's a clean, cheap, trustworthy way to score output. * 02:14 — Why a frozen judge fails you The breeding-program setup explains stationarity, the Red Queen idea, and the three ways a fixed test goes wrong. * 05:29 — Move the judge, break the stopwatch? How epochs, held-out anchors, and conservative scoring let the judge change without destroying the ability to measure progress. * 08:17 — Wipe the board, re-rank the winners Selective erasure is shown to be the entire mechanism — without deleting stale scores, a stricter judge reshuffles nothing. * 10:56 — Does any of it make code better? Co-evolving a code reviewer alongside a coder yields higher success and fewer tokens, with improvements that help both roles at once. * 12:51 — Catching the reviewer that gets fooled Self-preference bias is measured cold, then trained out with an adversarial trap built across an epoch boundary. * 15:55 — When the judge develops taste The evaluators evolve their own rubrics from a one-line prompt — and the grader's biggest gain came from getting less strict, not harsher. * 17:13 — Where the skeptic wins The honest limits: everything rides on imperfect anchors, the writing was never read by humans, the proof results are thin, and the long-run guarantees are absent. * 20:54 — Where the hard problem now lives The reframe to take away — the test is just another agent that can be improved or biased — and the open fork on whether to let evaluators move at all. RECOMMENDED READING * Gödel Agent: A Self-Referential Framework for Agents Recursively Self-Improvement [https://arxiv.org/abs/2410.04444] — The self-improving agent lineage this episode's Red Queen Gödel Machine extends, where an agent rewrites its own code to get better. * Darwin Gödel Machine: Open-Ended Evolution of Self-Improving Agents [https://arxiv.org/abs/2505.22954] — The breeding-program-of-code framing the episode describes, where variants are scored, kept, and bred — the static-judge predecessor this paper reacts against. * LLM Evaluators Recognize and Favor Their Own Generations [https://arxiv.org/abs/2404.13076] — Documents the self-preference bias that is the episode's central villain — AI judges going easy on AI-written text. * Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena [https://arxiv.org/abs/2306.05685] — The foundational LLM-as-a-judge paper behind the evaluator agents the episode's framework co-evolves and the biases it tries to train out.

One Crosscoder Feature Flips a Stalling Chatbot Into a Working Agent

ONE CROSSCODER FEATURE FLIPS A STALLING CHATBOT INTO A WORKING AGENT Source: Localizing RL-Induced Tool Use to a Single Crosscoder Feature [https://arxiv.org/abs/2606.26474] Paper was published on June 25, 2026 This episode was AI-generated on June 26, 2026. The script was written by an AI language model and the host voices were synthesized by Eleven Labs. The producer is not affiliated with Anthropic or Eleven Labs. Reinforcement learning spent a whole training run teaching a model to use tools — and it turns out you can find that skill, grab one internal feature, and flip the behavior on at runtime with no retraining at all. But the same evidence that says the skill lives in one place also shows it quietly leaking into a model that was never trained for it. This episode unpacks what RL actually localizes, where it lives, and why you can concentrate a capability but never fully wall it off. KEY TAKEAWAYS * Why a single 'dedicated' crosscoder feature, steered at inference time with no weight changes, can recover most of an RL model's tool-calling accuracy * How just routing activations through the sparse dictionary and back raises tool correctness from 19% to ~50% — even though reconstruction quality barely predicts the gain * The 'capability spillover' result: a frozen base model, never trained for tools, picks up tool selection (0% to ~7%) just by passing through the shared crosscoder — but never reproduces the tool-call syntax * Why the exclusive feature shelf is a coffee filter, not a sealed sink — penalizing it degrades the RL model, proving the captured signal is load-bearing and leaky * The honest limits: the +65 number comes from one best-performing cell on 40 prompts with a wide confidence band, and the DFC's advantage is legibility, not better performance * Why the cleanest features are structural-template detectors — and why that may be exactly why a tool-calling skill concentrates into one dial when a messier capability might not * 00:00 — Where does an RL skill actually live? Sets up the puzzle: RL visibly installs tool use, but no one can point to where in the network that capability physically lives. * 02:34 — Reading the model's muddy scratchpad Explains superposition and sparse dictionaries — the tools that separate a model's blended internal state back into named features. * 04:26 — Bolting down the shelves: the DFC Introduces the crosscoder and the Dedicated Feature Crosscoder, which forces features into RL-exclusive, base-exclusive, and shared bins. * 07:13 — One master switch versus a fuse box Walks through the saturation curve where one DFC feature hits the accuracy ceiling while the plain crosscoder needs 33 features. * 09:29 — Feature 136 turns a hedger into an agent The before-and-after example where steering a single feature produces a clean, correct tool call — and reveals the top features are template detectors. * 11:03 — Why lossy reconstruction makes it better The surprising finding that just routing activations through the dictionary and back boosts tool correctness, validated across 48 crosscoder variants. * 13:09 — A frozen model catches the trick Capability spillover: the untrained base model inherits tool selection through the shared decoder, but never the exact tool-call syntax. * 15:10 — A coffee filter, not a sealed sink Penalizing the exclusive shelf degrades the RL model, showing the capability is entangled in shared geometry and can be concentrated but never fully isolated. * 18:22 — How soft is that headline number? The critique: the +65 estimate is a favorable draw on 40 prompts, the architecture comparison isn't significant, and 'capability' means propensity under one prompt. * 22:08 — When your interpretability tool leaks Why feature-level steering offers a gradient-free control handle for agents — but published diffing artifacts may themselves become a side channel that moves capability around. RECOMMENDED READING * Towards Monosemanticity: Decomposing Language Models With Dictionary Learning [https://transformer-circuits.pub/2023/monosemantic-features/index.html] — The Anthropic sparse-autoencoder work that grounds the episode's 'separate the mud back into named pigments' picture of superposition and single-meaning features. * Sparse Crosscoders for Cross-Layer Features and Model Diffing [https://transformer-circuits.pub/2024/crosscoders/index.html] — The original crosscoder writeup that introduced the shared-dictionary model-diffing approach the episode's Dedicated Feature Crosscoder extends. * Toy Models of Superposition [https://transformer-circuits.pub/2022/toy_model/index.html] — The foundational account of why a few-thousand-dimensional scratchpad packs far more concepts than dimensions — the entanglement the episode says makes perfect capability isolation impossible.

26. juni 202625 min

How an AI Reviewer Learned to Stop Going Easy on AI Writing

Description

Comments

1 month for 9 kr.

All episodes