AI Papers: A Deep Dive
WHEN YOUR CODING AGENT LIES ABOUT THE FIX: VERIFYING THE PLAN BEFORE THE MODEL RUNS Source: Lean4Agent: Formal Modeling and Verification for Agent Workflow and Trajectory [https://arxiv.org/abs/2606.06523] Paper was published on June 02, 2026 This episode was AI-generated on June 9, 2026. The script was written by an AI language model and the host voices were synthesized by Eleven Labs. The producer is not affiliated with Anthropic or Eleven Labs. When an agent confidently reports a bug fixed while the tests still fail, you usually can't tell whether the model was too weak or the plan was broken from step one. This paper argues a huge share of agent failure is plan failure — and that plans, unlike models, are formal objects you can check before you ever spend a dollar running them. Workflows that pass that check beat the ones that fail by roughly twelve percent, and the gains are biggest for the cheap models you'd actually want to deploy. KEY TAKEAWAYS * Why much of what looks like model failure is actually specification failure — a broken plan a bigger model will just execute more confidently * How encoding an agent's workflow as a typed graph in Lean4 lets a machine prove the plan is coherent before execution, using Hoare-style preconditions and postconditions * The ablation showing 13 of 21 failing workflows were caught only by whole-graph, cross-step checks that no local inspection or LLM judge would catch * Why workflow verification helps weak models most — one small model jumped 27% between a passing and failing plan, because it can't improvise around a broken one * How an LLM-as-judge baseline scored a failing workflow 8/10 and a passing one 0/10 — exactly backwards on the relational defects that matter * The honest limits: the whole pipeline rests on an unproven assumption that the model annotates and executes each step correctly, and the headline numbers come from very small samples * 00:00 — Model failure versus plan failure The cold-open problem — an agent declares a bug fixed while tests fail — and why diagnosing whether the model or the plan broke leads to opposite fixes. * 02:39 — The mathematics rhyme: from natural-language proofs to formal checking How proof assistants like Coq and Lean replaced fallible human review with machine type-checking, and why the same move applies to treating an agent's plan as source code. * 05:19 — Workflows as typed graphs and the three layers of checking Encoding steps, data flow, and reads/writes as a typed graph in Lean4, with structural linting as the least interesting first layer. * 07:59 — Layer two: contracts and the relay-race handoff Using Hoare-logic preconditions and postconditions to verify every step's promises cover the next step's needs, with real bugs like dropped parallel results and schema mismatches. * 10:38 — The LLMExec assumption and why layer three exists Confronting the axiom that each step does its local job correctly, and how runtime trajectory checking localizes which exact step broke its contract. * 13:18 — Closing the loop with LeanEvolve Walking a real Django bug end-to-end, where a localized contract violation triggers a targeted rewrite of one step's instruction that passes on the next run. * 15:58 — The numbers, and the surprise about weak models Benchmark gains on hard SWE-Bench problems and expert paper questions, plus the finding that cheaper models benefit most from a verified plan. * 18:38 — Formal verifier versus LLM judge A head-to-head where a state-of-the-art judge scored workflows exactly backwards, showing why eyeball review misses cross-step and information-flow defects. * 21:17 — The critique: circularity, small samples, and what 'verified' really means The risk that the same kind of model writes the annotations being checked, the thin sample sizes behind headline claims, and why 'verified' means coherent, not correct. RECOMMENDED READING * SWE-bench: Can Language Models Resolve Real-World GitHub Issues? [https://arxiv.org/abs/2310.06770] — The benchmark behind the episode's hard fifty-problem software slice, defining the long-horizon, over-an-hour bug-fixing tasks where the paper claims plan failure dominates. * ReAct: Synergizing Reasoning and Acting in Language Models [https://arxiv.org/abs/2210.03629] — A foundational account of how agents interleave reasoning, planning, and tool use—useful background for why the episode separates a checkable 'workflow' from stochastic 'execution.'
124 jaksot
Kommentit
0Ole ensimmäinen kommentoija
Rekisteröidy nyt ja liity AI Papers: A Deep Dive-yhteisöön!