How an Open AI System Verified 672 Hard Math Proofs for Under $300

Beschrijving

HOW AN OPEN AI SYSTEM VERIFIED 672 HARD MATH PROOFS FOR UNDER $300 Source: Goedel-Architect: Streamlining Formal Theorem Proving with Blueprint Generation and Refinement [https://arxiv.org/abs/2606.06468] Paper was published on June 04, 2026 This episode was AI-generated on June 5, 2026. The script was written by an AI language model and the host voices were synthesized by Eleven Labs. The producer is not affiliated with Anthropic or Eleven Labs. An open-weight AI verified machine-checked proofs across an entire benchmark of Putnam-level math problems for $294 — while a comparable system reportedly spent around $170,000 for a single run. The 500-fold gap doesn't come from a fancier model; it comes from a single architectural bet about how the system is organized. We dig into the mechanism that makes failure useful, and where the headline number quietly leans on a closed model whispering hints. KEY TAKEAWAYS * Why formal proving with Lean makes checking free and automatic — and why a single false sub-claim can make a system burn compute forever on something that was never true * The blueprint approach: sketch the whole proof as a graph of inter-dependent lemmas, attack pieces in parallel, and repair the plan while keeping everything that already worked — versus the recursive-decomposition method that dives down one branch and gets stuck * The two-channel failure mechanism: a 'negation channel' that produces compiler-verified counterexamples (it caught its own false sub-claims on 292 of 672 problems) and a 'forfeit channel' that writes a structured post-mortem telling the next round how to split the problem * Why the refinement loop scales log-linearly — steady, predictable gains that require doubling compute, making cost a tunable dial rather than a coin flip * Where the headline numbers get fuzzy: the controlled same-model comparison is partly missing on the hardest set, and the record-breaking scores rely on natural-language hints sometimes seeded by a stronger closed model * How contamination testing on post-cutoff olympiad problems argues the system is reasoning, not regurgitating * 00:00 — The $294 headline and the ground rules The team verified an entire 672-problem benchmark for under $300 versus a reported $170,000 — and the hosts establish this is an AI-generated show discussing the Goedel-Architect paper. * 03:13 — What formal proof actually means Why Lean turns a proof into a program the compiler judges automatically, and why a single false sub-claim makes the whole proof uncompilable — the trap that wastes compute on roads that don't exist. * 06:26 — Recursion versus the blueprint The contrast between depth-first recursive decomposition that gets stuck on one branch and the blueprint approach that lays out the whole strategy as a graph, attacks pieces in parallel, and preserves proven work during repair. * 09:39 — Failure that hands you a plan: the two channels The negation channel that proves a sub-claim false with a verified counterexample (the constant-zero function case study) and the forfeit channel that writes a resignation memo on how to split the problem (the Putnam 1985 case study). * 12:52 — The scaling curve and the cost breakdown How 16 refinement passes climb from 30% to 76% along a log-linear curve, and how that turns into roughly 44 cents per problem against the competitor's $244. * 16:05 — Asterisks on the comparison The controlled same-model experiment that supports the architectural claim, the missing head-to-head row on the hardest problems, and the closed-model natural-language hints behind the record-breaking scores. * 19:18 — Contamination, candor, and the olympiad gap Testing on post-training-cutoff problems to rule out memorization, the honest four-of-six olympiad result against a geometry-specialized competitor, and the paper's careful separation of autonomous from seeded numbers. * 22:31 — Two kinds of significance The immediate accessibility win that puts top-tier formal proving within reach of independent researchers, and the more speculative transferable idea of systems that fail informatively across domains beyond math. RECOMMENDED READING * Draft, Sketch, and Prove: Guiding Formal Theorem Provers with Informal Proofs [https://arxiv.org/abs/2210.12283] — The Draft-Sketch-Prove lineage the episode explicitly credits as the ancestor of this paper's refinable natural-language seeding. * DeepSeek-Prover-V1.5: Harnessing Proof Assistant Feedback for Reinforcement Learning and Monte-Carlo Tree Search [https://arxiv.org/abs/2408.08152] — A leading open-weight Lean prover that uses compiler feedback and tree search, offering a concrete point of contrast to the episode's blueprint-versus-recursion framing. * Solving olympiad geometry without human demonstrations (AlphaGeometry) [https://doi.org/10.1038/s41586-023-06747-5] — The kind of specialized geometry engine the episode cites as the reason the system loses the one IMO problem it misses, showing why dedicated tools still beat general provers there.

Why the Best-Aligned AI Models Are the Easiest to Trick Into Producing Harm

WHY THE BEST-ALIGNED AI MODELS ARE THE EASIEST TO TRICK INTO PRODUCING HARM Source: Safety Paradox: How Enhanced Safety Awareness Leaves LLMs Vulnerable to Posterior Attack [https://arxiv.org/abs/2606.05614] Paper was published on June 04, 2026 This episode was AI-generated on June 5, 2026. The script was written by an AI language model and the host voices were synthesized by Eleven Labs. The producer is not affiliated with Anthropic or Eleven Labs. A new paper argues that the sharper a language model's judgment about what counts as harmful, the more reliably a single cheap prompt can make it produce that exact harm — and the correlation across thirty models is nearly a straight line. Worse, sorting those models by release date shows the field walking its best work toward maximal exploitability, one alignment improvement at a time. But the episode also finds a hopeful crack in the thesis: for some models, making them think before answering drives the attack to zero. KEY TAKEAWAYS * How the 'Posterior Attack' works: instead of fighting a model's reluctance, it asks the model to act as a safety classifier and produce an example of content it would flag — laundering the harm through a framing the model treats as safety work * Why it's a real threat: one black-box query, no gradient access, about three cents — versus the dollars-and-hours cost of heavyweight gradient or iterative-rewrite attacks * The eerie correlation across thirty models (Pearson ~0.80): the better a model is at judging harm, the more exploitable it is — and that diagonal is also a timeline pointing at the frontier * The one-sentence theory: attack success equals baseline odds of harm multiplied by the safety classifier's sharpness — so improving safety judgment directly inflates the attack, with a perfect classifier converging on guaranteed exploitation * The causal proof and its limits: using reinforcement learning as a scalpel to move only the safety-judgment 'fader' on small models flips vulnerability up and down — but that experiment can't run on GPT-5 or Claude, so the frontier claim stays correlational * The honest complication: test-time reasoning drives the attack to zero on some models (GPT-OSS) by reasoning back to the rule, does nothing for Claude Sonnet 4.6, and makes one Qwen model worse — suggesting the paradox may belong to reflexive guards, not to safety knowledge itself * 00:00 — The paradox and the Posterior Attack Introduces the core claim that sharper harm-recognition makes models more exploitable, and walks through how the single-query 'museum guard' attack tricks a model into generating forbidden content as a classifier example. * 02:54 — Why this attack is different and cheap Contrasts the Posterior Attack's one-shot, black-box, three-cents-per-query cost against the dollars and GPU-hours of gradient-optimization and iterative-rewrite jailbreaks. * 05:48 — The thirty-model correlation and its timeline Lays out the near-straight diagonal between safety-classifier accuracy and attack success, the frontier numbers up near 99%, and the unsettling fact that the same upgrades that defeat older attacks open this one wider. * 08:43 — The one-line math behind it Explains, without notation, how attack odds equal baseline harm odds times classifier sharpness, why the relationship is monotonic, and why a perfect classifier converges on guaranteed exploitation. * 11:37 — Using reinforcement learning as a scalpel Describes the controlled experiment that moves only a model's safety-judgment 'fader' up or down while holding capability fixed, showing vulnerability rises and falls in lockstep. * 14:32 — Where the evidence runs out Registers the honest caveats: the causal proof only runs on small models, attack success is graded by LLM judges, and the scope is English-only without production-level guardrails. * 17:26 — The defense that complicates the thesis Examines test-time reasoning, which drives the attack to zero on some models by reasoning back to the rule but fails on Claude Sonnet 4.6 and backfires on a Qwen model, suggesting the paradox may be a property of reflexive guards. * 20:21 — What it all means Frames the takeaway as an intellectual shift — awareness and vulnerability as the same quantity — and the open question of whether the fix is a smarter shield or a slower, more deliberate one. RECOMMENDED READING * Universal and Transferable Adversarial Attacks on Aligned Language Models [https://arxiv.org/abs/2307.15043] — The gradient-optimization jailbreak (GCG) the episode contrasts against — the 'dollars and hours' baseline that the cheap single-query Posterior Attack is positioned against. * Deliberative Alignment: Reasoning Enables Safer Language Models [https://arxiv.org/abs/2412.16339] — The 'reason back to the rule' approach the episode credits for driving the attack to zero on GPT-OSS models, central to the defense complication the hosts dwell on. * Jailbroken: How Does LLM Safety Training Fail? [https://arxiv.org/abs/2307.02483] — A foundational analysis of why safety training fails that frames jailbreaks as attacks on the model's reluctance — the exact framing the Posterior Attack departs from. * Llama 2: Open Foundation and Fine-Tuned Chat Models [https://arxiv.org/abs/2307.09288] — Documents the safety alignment recipe for several of the open models plotted along the episode's vulnerability-vs-awareness diagonal, including the low-exploitability Llama 2.

9 jun 202623 min

How an Open AI System Verified 672 Hard Math Proofs for Under $300

Beschrijving

Reacties

Probeer 14 dagen gratis

Alle afleveringen