AI Papers: A Deep Dive
HOW AN OPEN AI SYSTEM VERIFIED 672 HARD MATH PROOFS FOR UNDER $300 Source: Goedel-Architect: Streamlining Formal Theorem Proving with Blueprint Generation and Refinement [https://arxiv.org/abs/2606.06468] Paper was published on June 04, 2026 This episode was AI-generated on June 5, 2026. The script was written by an AI language model and the host voices were synthesized by Eleven Labs. The producer is not affiliated with Anthropic or Eleven Labs. An open-weight AI verified machine-checked proofs across an entire benchmark of Putnam-level math problems for $294 — while a comparable system reportedly spent around $170,000 for a single run. The 500-fold gap doesn't come from a fancier model; it comes from a single architectural bet about how the system is organized. We dig into the mechanism that makes failure useful, and where the headline number quietly leans on a closed model whispering hints. KEY TAKEAWAYS * Why formal proving with Lean makes checking free and automatic — and why a single false sub-claim can make a system burn compute forever on something that was never true * The blueprint approach: sketch the whole proof as a graph of inter-dependent lemmas, attack pieces in parallel, and repair the plan while keeping everything that already worked — versus the recursive-decomposition method that dives down one branch and gets stuck * The two-channel failure mechanism: a 'negation channel' that produces compiler-verified counterexamples (it caught its own false sub-claims on 292 of 672 problems) and a 'forfeit channel' that writes a structured post-mortem telling the next round how to split the problem * Why the refinement loop scales log-linearly — steady, predictable gains that require doubling compute, making cost a tunable dial rather than a coin flip * Where the headline numbers get fuzzy: the controlled same-model comparison is partly missing on the hardest set, and the record-breaking scores rely on natural-language hints sometimes seeded by a stronger closed model * How contamination testing on post-cutoff olympiad problems argues the system is reasoning, not regurgitating * 00:00 — The $294 headline and the ground rules The team verified an entire 672-problem benchmark for under $300 versus a reported $170,000 — and the hosts establish this is an AI-generated show discussing the Goedel-Architect paper. * 03:13 — What formal proof actually means Why Lean turns a proof into a program the compiler judges automatically, and why a single false sub-claim makes the whole proof uncompilable — the trap that wastes compute on roads that don't exist. * 06:26 — Recursion versus the blueprint The contrast between depth-first recursive decomposition that gets stuck on one branch and the blueprint approach that lays out the whole strategy as a graph, attacks pieces in parallel, and preserves proven work during repair. * 09:39 — Failure that hands you a plan: the two channels The negation channel that proves a sub-claim false with a verified counterexample (the constant-zero function case study) and the forfeit channel that writes a resignation memo on how to split the problem (the Putnam 1985 case study). * 12:52 — The scaling curve and the cost breakdown How 16 refinement passes climb from 30% to 76% along a log-linear curve, and how that turns into roughly 44 cents per problem against the competitor's $244. * 16:05 — Asterisks on the comparison The controlled same-model experiment that supports the architectural claim, the missing head-to-head row on the hardest problems, and the closed-model natural-language hints behind the record-breaking scores. * 19:18 — Contamination, candor, and the olympiad gap Testing on post-training-cutoff problems to rule out memorization, the honest four-of-six olympiad result against a geometry-specialized competitor, and the paper's careful separation of autonomous from seeded numbers. * 22:31 — Two kinds of significance The immediate accessibility win that puts top-tier formal proving within reach of independent researchers, and the more speculative transferable idea of systems that fail informatively across domains beyond math. RECOMMENDED READING * Draft, Sketch, and Prove: Guiding Formal Theorem Provers with Informal Proofs [https://arxiv.org/abs/2210.12283] — The Draft-Sketch-Prove lineage the episode explicitly credits as the ancestor of this paper's refinable natural-language seeding. * DeepSeek-Prover-V1.5: Harnessing Proof Assistant Feedback for Reinforcement Learning and Monte-Carlo Tree Search [https://arxiv.org/abs/2408.08152] — A leading open-weight Lean prover that uses compiler feedback and tree search, offering a concrete point of contrast to the episode's blueprint-versus-recursion framing. * Solving olympiad geometry without human demonstrations (AlphaGeometry) [https://doi.org/10.1038/s41586-023-06747-5] — The kind of specialized geometry engine the episode cites as the reason the system loses the one IMO problem it misses, showing why dedicated tools still beat general provers there.
119 afleveringen
Reacties
0Wees de eerste die een reactie plaatst
Meld je nu aan en word lid van de AI Papers: A Deep Dive community!