An AI Just Solved a 1996 Erdős Problem—and the Simplest Agent Won

Descripción

AN AI JUST SOLVED A 1996 ERDŐS PROBLEM—AND THE SIMPLEST AGENT WON Source: Advancing Mathematics Research with AI-Driven Formal Proof Search [https://arxiv.org/abs/2605.22763] Paper was published on May 21, 2026 This episode was AI-generated on May 22, 2026. The script was written by an AI language model and the host voices were synthesized by Eleven Labs. The producer is not affiliated with Anthropic or Eleven Labs. A Google DeepMind system autonomously cracked nine open Erdős problems—including one that sat unsolved for thirty years—for a few hundred dollars each, with proofs verified by the Lean compiler. The twist: the team's elaborate evolutionary search system was beaten on most problems by a twenty-line script that just iterates an LLM against a compiler. The implications for AI engineering go well beyond mathematics. KEY TAKEAWAYS * Why coupling an LLM to the Lean proof checker dissolves the trust problem in AI-generated mathematics—and where that guarantee actually ends * How a 'Ralph loop' of LLM plus compiler plus retry matched a sophisticated evolutionary system with AlphaProof, tournament Elo ranking, and shared caches * The actual proof idea behind Erdős problem 125, including how irrationality of log(4)/log(3) gets weaponized to crush sumset density to zero * How the agent surfaced a thirty-year-old ambiguity in Erdős's original problem statement just by being forced to commit to a formal reading * Where the verification guarantee leaks: LLM judges scoring proof sketches reward confident-sounding hallucinated citations, biasing the search upstream of the compiler * Why the selection bias in the problem set, the cost of failed runs, and the human work of formalization make the headline numbers less clean than they look * 29:03 — The trust problem in AI-generated math Why plausible-looking LLM proofs have been economically useless to working mathematicians, and how Lean's compiler is supposed to fix that. * 03:52 — The Ralph loop and the basic agent A walkthrough of Agent A—the embarrassingly simple LLM-plus-compiler-plus-retry setup that did most of the work. * 07:44 — Inside Erdős 125 The metronome intuition behind the density-zero proof and how the agent decomposes subgoals and delegates to AlphaProof. * 11:37 — The fancy system that mostly didn't win Evolutionary search with Elo-ranked proof sketches, a shared cache, and AlphaProof calls—and why it only paid off on the hardest problems. * 15:29 — The ambiguity-surfacing side effect How formalizing Erdős 125 and 741 forced long-standing imprecisions in the informal statements into the open. * 19:21 — A geometric proof that feels like a magic trick Erdős 846 and the agent's translation of a collinearity problem into graph-theoretic Ramsey territory. * 23:14 — Steelmanning the skeptics Selection bias in the problem set, hidden costs of failed runs, the heavy lifting humans do in formalization, and the hallucinated-citation failure mode. * 27:06 — What actually changed How the bottleneck shifts from verifying proofs to verifying problem statements, and what the 'simple loops beat scaffolding' finding might mean beyond math. RECOMMENDED READING * AlphaEvolve: A coding agent for scientific and algorithmic discovery [https://arxiv.org/abs/2506.13131] — The evolutionary search ancestor of the Agent C/D system discussed in the episode, providing context for the 'fancy scaffolding' that the basic Ralph loop ended up matching. * Mathematical discoveries from program search with large language models (FunSearch) [https://doi.org/10.1038/s41586-023-06924-6] — The original DeepMind work establishing LLM-driven search for new mathematical results, which the episode positions as the lineage that Agent D descends from. * Solving olympiad geometry without human demonstrations (AlphaGeometry) [https://doi.org/10.1038/s41586-023-06747-5] — A useful contrast to the episode's framing of olympiad problems as 'the easier version' — shows what tightly-scaffolded, domain-specific provers achieved before frontier LLMs closed the gap. * The Lean Mathematical Library (Mathlib) [https://arxiv.org/abs/1910.09336] — The community formalization library whose maturity the episode credits as one of the four necessary ingredients for the paper's results.

How a Two-Agent Trick Unlocked Large-Scale Training for Computer-Use Agents

HOW A TWO-AGENT TRICK UNLOCKED LARGE-SCALE TRAINING FOR COMPUTER-USE AGENTS Source: CUA-Gym: Scaling Verifiable Training Environments and Tasks for Computer-Use Agents [https://arxiv.org/abs/2605.25624] Paper was published on May 25, 2026 This episode was AI-generated on May 26, 2026. The script was written by an AI language model and the host voices were synthesized by Eleven Labs. The producer is not affiliated with Anthropic or Eleven Labs. Computer-use agents have been stuck while math and code models have soared — and a new paper argues the bottleneck was never the algorithm, it was the missing data pipeline. The fix turns on one elegant design choice: put an information barrier between the AI that builds the training environment and the AI that writes the reward function. The result is the largest open verified dataset for GUI agents, big benchmark gains, and an unexpected behavior the agents picked up entirely on their own. KEY TAKEAWAYS * Why verifiable RL has scaled beautifully in math and code but stalled in computer-use agents — and why that's an environment problem, not an algorithm problem * The Generator/Discriminator information-barrier trick that prevents AI-written reward functions from secretly checking the construction procedure instead of the task * How 94 synthesized mock applications (Slack, Jira, Salesforce, EHR, etc.) get built and verified, grounded in real software-usage data rather than convenience * An ablation suggesting environment diversity is its own scaling axis — same trajectory count spread across more environments meaningfully outperforms * An unprompted emergent behavior: trained agents learn which UI actions are safe to batch and which (like right-click) must stay atomic, cutting trajectory length 33–45% with no efficiency reward * Where the paper's framing is hotter than its evidence — transfer to out-of-distribution benchmarks is modest, runs are single-seed, and reward functions verify end-state only * 00:00 — Why GUI agents are harder than math problems The structural reason computer-use training data is orders of magnitude smaller than math or code: each example requires a task, an executable environment, and a programmatic reward — all coupled and expensive. * 29:00 — The information barrier between Generator and Discriminator The paper's central design move — keeping the agent that builds the environment separate from the one that writes the reward, so the reward describes the outcome rather than the construction procedure. * 07:03 — The inner loop and the reward-hacking scanner How the iteration between agents converges on a verified tuple, and the six forbidden code patterns a static scanner catches before a tuple is accepted. * 10:34 — Synthesizing 94 mock applications at scale Why real websites can't host RL training, how the team picks which apps to build using occupational and software-usage data, and how a uniform state API lets the same pipeline drop in across all of them. * 14:06 — Training results and the data-vs-model tradeoff The Qwen MoE backbones, the GRPO-style algorithm, the OSWorld-Verified gains, and the striking finding that a 10x smaller trained model matches a much larger untrained one. * 27:36 — Environment diversity as a separate scaling axis The ablation showing that spreading the same number of trajectories across more environments beats concentrating them, and what that implies about hidden ceilings in prior work. * 21:09 — The emergent action-batching behavior How trained agents spontaneously learn to bundle predictable action sequences while keeping unpredictable ones (right-click, double-click) atomic — with no efficiency signal in the reward. * 24:40 — Limitations and honest caveats Modest out-of-distribution transfer, simplified mocks that omit auth and failure modes, single-seed RL runs, and reward functions that check end state but not process. * 28:12 — Why this paper reframes the agenda The broader shift from algorithmic cleverness to environment infrastructure, and why the information-barrier idea is likely to keep reappearing wherever AI agents generate training data for other AI agents. RECOMMENDED READING * OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments [https://arxiv.org/abs/2404.07972] — The OSWorld-Verified benchmark that the episode's headline results are measured against — useful context for what these GUI agents are actually being tested on. * WebArena: A Realistic Web Environment for Building Autonomous Agents [https://arxiv.org/abs/2307.13854] — The out-of-distribution browser benchmark used to test transfer in the paper — relevant to the episode's discussion of how much the trained skills actually generalize. * DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning [https://arxiv.org/abs/2501.12948] — A canonical example of the verifiable-rewards RL recipe the episode argues is now being ported to GUI agents, with the same group-relative algorithm family.

26 de may de 202631 min

An AI Just Solved a 1996 Erdős Problem—and the Simplest Agent Won

Descripción

Comentarios

Empieza 7 días de prueba

Todos los episodios