A Router That Beats the Frontier Models It Calls

Description

A ROUTER THAT BEATS THE FRONTIER MODELS IT CALLS Source: Sakana Fugu Technical Report [https://arxiv.org/abs/2606.21228] Paper was published on June 19, 2026 This episode was AI-generated on June 23, 2026. The script was written by an AI language model and the host voices were synthesized by Eleven Labs. The producer is not affiliated with Anthropic or Eleven Labs. A system whose only skill is deciding which top model to call for each piece of a problem manages to beat GPT, Claude, and Gemini — the very models it's calling — on some of the hardest benchmarks we have. The paper argues orchestration is a second scaling axis hiding in plain sight, one that could put frontier performance within reach of teams that can't afford to train a frontier model. We dig into how it works, what's genuinely surprising, and where the evidence gets uncomfortably thin. KEY TAKEAWAYS * Why frontier models have stopped being interchangeable — and how a learned router exploits that specialization model-by-model and even step-by-step * What 'model merging at the behavioral level' means, and why combining closed models by behavior sidesteps the open-weights requirement of classic merging * The surprising finding that a model's standalone benchmark score does not predict how well it performs inside a real coding harness * How the heavy 'Ultra' system avoids 'orchestration collapse' by isolating agents within a workflow while sharing memory across workflows * The credibility seam: where the evidence is rigorous the effect is small (a fraction of a percent), and where the effect is huge it leans on provider-reported baselines and hand-picked examples * Why the orchestration-as-scaling-axis framing matters for export controls and the compute race even if the headline numbers are softer than claimed * 00:00 — The contractor who never picks up a hammer The core analogy and the headline claim: a system that only decides which model to call beats every model it calls, without training anything new. * 02:20 — Why no model is best at everything anymore The paper's starting observation that frontier models have specialized, and that the scaffold wrapped around a model matters as much as its weights. * 04:07 — Merging behavior, not weights How combining models by behavior rather than weights lets Fugu mix closed models from different providers and absorb new ones without retraining. * 05:35 — Two systems, one trip-up to avoid The distinction between the fast Fugu router that picks one worker per turn and the heavy Fugu-Ultra that writes whole free-form workflows. * 07:29 — How do you teach a thing to pick? The training recipe — supervised fine-tuning on soft score distributions, evolutionary refinement on whole-task success, and reinforcement learning for Ultra. * 10:53 — The benchmark score that lies to you The finding that standalone benchmark scores don't predict in-harness behavior, and the orchestration-collapse failure mode Ultra had to solve. * 14:52 — Does the routing actually adapt? The evidence — Terminal Bench trajectories, builder-and-debugger workflows, a shifting aggregator role, and the pie charts proving domain-specific routing. * 20:24 — Where the impressive thing gets weak The steelman critique: self-computed scores versus provider-reported baselines, selected illustrative wins, and the rigorous experiment showing the smallest effect. * 23:42 — A second path to the frontier? Why orchestration as a scaling axis could distribute frontier capability beyond the biggest training runs, and the closing question for listeners. RECOMMENDED READING * Evolutionary Optimization of Model Merging Recipes [https://arxiv.org/abs/2403.13187] — The same lab's prior weight-level model-merging work that the episode explicitly contrasts with Fugu's behavioral merging of closed models. * Mixture-of-Agents Enhances Large Language Model Capabilities [https://arxiv.org/abs/2406.04692] — The fixed-aggregator multi-agent approach the episode names as the direct foil to Fugu's adaptive, task-dependent synthesizer role. * GPTSwarm: Language Agents as Optimizable Graphs [https://arxiv.org/abs/2402.16823] — Cited by the episode as prior multi-agent work whose fixed orchestration structure Fugu-Ultra's learned workflows aim to surpass. * DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models [https://arxiv.org/abs/2402.03300] — Introduces GRPO, the critic-free reinforcement learning method the episode describes for training Fugu-Ultra's workflow generation.

The Empty-Lake Proof: Why More Rollouts Stop Helping Reasoning Models

THE EMPTY-LAKE PROOF: WHY MORE ROLLOUTS STOP HELPING REASONING MODELS Source: Maximizing Rollout Informativeness under a Fixed Budget: A Submodular View of Tree Search for Tool-Use Agentic Reinforcement Learning [https://arxiv.org/abs/2605.05262] Paper was published on May 06, 2026 This episode was AI-generated on June 23, 2026. The script was written by an AI language model and the host voices were synthesized by Eleven Labs. The producer is not affiliated with Anthropic or Eleven Labs. On the hardest problems, throwing more independent attempts at a reasoning model is almost useless past a point — and a May 2026 paper proves it in two lines of arithmetic. Then it borrows a fifty-year-old combinatorics theorem to fix the problem, and watches the field's favorite folk hack — the entropy bonus — fall straight out of the math. You'll come away understanding why budget is a weak lever, why hardness is a strong one, and where the paper's 'provable' spine quietly bends. KEY TAKEAWAYS * Why GRPO's relative-scoring signal goes to exactly zero when a group of rollouts all agree — and why easy and hard problems both collapse that way * The napkin-sized proof that useful mixed groups grow only linearly with budget while difficulty pushes against you exponentially — independent sampling flatlines near 45% * How a 1978 submodularity theorem hands the authors a near-optimal greedy selector for free, instead of a hand-tuned heuristic * Why the long-used entropy bonus turns out to be a forced consequence of the math, not a tuning knob — under a stated linearization * The eyebrow-raiser where a hand-derived formula beats a neural network trained specifically to beat it * Where the 'provable' claim is actually proven about a proxy score, and why the guarantee weakens precisely on deep, long-horizon problems * 00:00 — Casting into a nearly empty lake Sets up the central metaphor and the paper's core claim that the waste in training reasoning models is structural, not just inefficient. * 01:47 — When the learning signal goes to zero Explains how GRPO scores rollouts relative to their group, and why uniform groups — all right or all wrong — produce exactly zero learning. * 03:50 — The proof you can check on a napkin Walks through the simple binomial argument showing budget grows usefulness only linearly while difficulty fights back exponentially, with brutal real numbers. * 06:06 — What if the attempts shared a whiteboard? Introduces the pivot from independent sampling to growing a tree of attempts, and the hard question of which node to expand next. * 07:43 — A fifty-year-old theorem does the work Defines submodularity and how Nemhauser's 1978 result guarantees a greedy selector is near-optimal, built from coverage, novelty, and contrast. * 10:32 — The entropy bonus falls out of the math Derives the UUCB selection rule as three questions, showing the classic UCB exploration term and the entropy bonus appear as forced consequences, not hacks. * 14:23 — Finding the fork in one math problem Works through a single competition problem where the selector lights up the exact strategy-switch node and lands a sharp learning signal flat GRPO would smear away. * 17:20 — Four times the nudge, same budget Reports the benchmark results: InfoTree tracking the theoretical ceiling, roughly 4x gradient signal, an 11-point GAIA win, and the hand-derived formula beating a learned selector. * 21:48 — A map of a slightly different city Delivers the steelman critique — the guarantee is proven about a proxy, the strongest wins are over the weakest baseline, and it strains on the deep, long-horizon trees that matter most. * 25:10 — Why the reframe outlasts the method Lands the takeaway that the real contribution is turning a practitioner grumble into an impossibility result, and poses the closing question to the audience. RECOMMENDED READING * DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models [https://arxiv.org/abs/2402.03300] — The paper that introduced GRPO, the group-relative training method whose collapse-on-hard-problems failure mode this episode is built around. * An Analysis of Approximations for Maximizing Submodular Set Functions—I [https://doi.org/10.1007/BF01588971] — The 1978 Nemhauser–Wolsey–Fisher result the episode credits for the greedy near-optimality (the '63 percent') guarantee at the core of the selection rule. * GAIA: a benchmark for General AI Assistants [https://arxiv.org/abs/2311.12983] — The web-search agent benchmark where the episode reports the method's biggest single win, useful for judging the eleven-point claim. * Finite-time Analysis of the Multiarmed Bandit Problem [https://doi.org/10.1023/A:1013689704352] — The classic UCB exploration-bonus result that the episode shows reappearing, derived rather than bolted on, inside the tree-expansion selection rule.

Yesterday27 min

A Router That Beats the Frontier Models It Calls

Description

Comments

1 month for 9 kr.

All episodes