An AI Just Solved a 1996 Erdős Problem—and the Simplest Agent Won

Beschreibung

AN AI JUST SOLVED A 1996 ERDŐS PROBLEM—AND THE SIMPLEST AGENT WON Source: Advancing Mathematics Research with AI-Driven Formal Proof Search [https://arxiv.org/abs/2605.22763] Paper was published on May 21, 2026 This episode was AI-generated on May 22, 2026. The script was written by an AI language model and the host voices were synthesized by Eleven Labs. The producer is not affiliated with Anthropic or Eleven Labs. A Google DeepMind system autonomously cracked nine open Erdős problems—including one that sat unsolved for thirty years—for a few hundred dollars each, with proofs verified by the Lean compiler. The twist: the team's elaborate evolutionary search system was beaten on most problems by a twenty-line script that just iterates an LLM against a compiler. The implications for AI engineering go well beyond mathematics. KEY TAKEAWAYS * Why coupling an LLM to the Lean proof checker dissolves the trust problem in AI-generated mathematics—and where that guarantee actually ends * How a 'Ralph loop' of LLM plus compiler plus retry matched a sophisticated evolutionary system with AlphaProof, tournament Elo ranking, and shared caches * The actual proof idea behind Erdős problem 125, including how irrationality of log(4)/log(3) gets weaponized to crush sumset density to zero * How the agent surfaced a thirty-year-old ambiguity in Erdős's original problem statement just by being forced to commit to a formal reading * Where the verification guarantee leaks: LLM judges scoring proof sketches reward confident-sounding hallucinated citations, biasing the search upstream of the compiler * Why the selection bias in the problem set, the cost of failed runs, and the human work of formalization make the headline numbers less clean than they look * 29:03 — The trust problem in AI-generated math Why plausible-looking LLM proofs have been economically useless to working mathematicians, and how Lean's compiler is supposed to fix that. * 03:52 — The Ralph loop and the basic agent A walkthrough of Agent A—the embarrassingly simple LLM-plus-compiler-plus-retry setup that did most of the work. * 07:44 — Inside Erdős 125 The metronome intuition behind the density-zero proof and how the agent decomposes subgoals and delegates to AlphaProof. * 11:37 — The fancy system that mostly didn't win Evolutionary search with Elo-ranked proof sketches, a shared cache, and AlphaProof calls—and why it only paid off on the hardest problems. * 15:29 — The ambiguity-surfacing side effect How formalizing Erdős 125 and 741 forced long-standing imprecisions in the informal statements into the open. * 19:21 — A geometric proof that feels like a magic trick Erdős 846 and the agent's translation of a collinearity problem into graph-theoretic Ramsey territory. * 23:14 — Steelmanning the skeptics Selection bias in the problem set, hidden costs of failed runs, the heavy lifting humans do in formalization, and the hallucinated-citation failure mode. * 27:06 — What actually changed How the bottleneck shifts from verifying proofs to verifying problem statements, and what the 'simple loops beat scaffolding' finding might mean beyond math. RECOMMENDED READING * AlphaEvolve: A coding agent for scientific and algorithmic discovery [https://arxiv.org/abs/2506.13131] — The evolutionary search ancestor of the Agent C/D system discussed in the episode, providing context for the 'fancy scaffolding' that the basic Ralph loop ended up matching. * Mathematical discoveries from program search with large language models (FunSearch) [https://doi.org/10.1038/s41586-023-06924-6] — The original DeepMind work establishing LLM-driven search for new mathematical results, which the episode positions as the lineage that Agent D descends from. * Solving olympiad geometry without human demonstrations (AlphaGeometry) [https://doi.org/10.1038/s41586-023-06747-5] — A useful contrast to the episode's framing of olympiad problems as 'the easier version' — shows what tightly-scaffolded, domain-specific provers achieved before frontier LLMs closed the gap. * The Lean Mathematical Library (Mathlib) [https://arxiv.org/abs/1910.09336] — The community formalization library whose maturity the episode credits as one of the four necessary ingredients for the paper's results.

Why Frozen-Weight Agents Still Get Worse Over Time

WHY FROZEN-WEIGHT AGENTS STILL GET WORSE OVER TIME Source: Your Agents Are Aging Too: Agent Lifespan Engineering for Deployed Systems [https://arxiv.org/abs/2605.26302] Paper was published on May 25, 2026 This episode was AI-generated on May 27, 2026. The script was written by an AI language model and the host voices were synthesized by Eleven Labs. The producer is not affiliated with Anthropic or Eleven Labs. A deployed AI agent's model weights never change — but the agent itself ages, and it ages in at least four mechanistically distinct ways. A new paper introduces a diagnostic ladder that can tell those failure modes apart, and shows that a one-paragraph change to how an agent summarizes its own memory can extend its useful lifespan by more than four times. KEY TAKEAWAYS * Agent reliability is a lifespan property, not a benchmark snapshot — the memory store, retrieval, and compaction around a frozen model keep changing every session * Four named failure modes: compression, interference, revision, and maintenance aging — split into accumulation-driven and event-driven families * The counterfactual ladder: a three-rung diagnostic that isolates write failures, read failures, and utilization failures without needing model internals * Three models with nearly identical error rates can have completely different underlying diseases — and 'add more memory' is the wrong fix for two of them * A one-paragraph 'careful' compaction prompt that names what to preserve verbatim yields roughly a 4.5x lifespan improvement on the same system * Production monitoring tends to track constraint compliance while missing silent precision decay — the agent stops violating rules but also stops knowing the specifics * Scale doesn't fix structural problems: a small typed-state sidecar cuts running-balance error 25–50% with no model change * 00:00 — Four vignettes, one puzzle Four deployed-agent failures that the standard 'frozen weights = frozen system' mental model can't explain. * 02:05 — Reframing reliability as a lifespan property Why the apparatus around the model — memory, retrieval, compaction — is what actually changes over time. * 04:10 — The four aging mechanisms Compression, interference, revision, and maintenance aging — and why they split into accumulation-driven and event-driven families. * 06:30 — The counterfactual ladder A three-rung diagnostic that isolates write, read, and utilization failures by progressively swapping in oracle components. * 08:20 — Same score, different disease Empirical results showing models with near-identical error rates can have completely different failure breakdowns under the ladder. * 10:25 — The 4.5x compaction-prompt result How a one-paragraph change to summarization instructions extends agent half-life dramatically on the same underlying system. * 14:30 — Silent precision decay Why constraint-violation monitoring stays green while the agent quietly forgets the specifics it was supposed to remember. * 14:35 — Why scale doesn't save the running budget A small and a large model both drift on arithmetic over a session history because the failure is representational, not capacity-bound. * 16:41 — Honest critique Synthetic scenarios, simple memory architectures, and short session horizons — what the paper's numbers can and can't tell us. * 18:46 — Production CLI agents and re-reading Findings from Claude Code and OpenHands on why correct answers correlate with more retrieval, and why flagship models can write lower-fidelity artifacts. * 20:51 — The sticky note fix A small typed-state overlay alongside normal memory that cuts accumulator error substantially without changing the model. RECOMMENDED READING * MemGPT: Towards LLMs as Operating Systems [https://arxiv.org/abs/2310.08560] — Proposes a hierarchical memory system with explicit paging between context and external storage — directly relevant to the episode's argument that the fix for agent aging is structural, not bigger models. * Lost in the Middle: How Language Models Use Long Contexts [https://arxiv.org/abs/2307.03172] — Empirical evidence that models fail to utilize information even when it's present in context — the 'utilization failure' rung of the episode's counterfactual ladder. * Generative Agents: Interactive Simulacra of Human Behavior [https://arxiv.org/abs/2304.03442] — The Park et al. paper that popularized reflection-and-summarization memory architectures — exactly the kind of compaction-based stack whose aging dynamics this episode dissects. * Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks [https://arxiv.org/abs/2005.11401] — The original RAG paper, useful background for the episode's distinction between write failures, retrieval failures, and utilization failures in memory-augmented agents.

Gestern22 min

An AI Just Solved a 1996 Erdős Problem—and the Simplest Agent Won

Beschreibung

Kommentare

2 Monate für 1 €

Alle Folgen