AI Papers: A Deep Dive

The OS Trick That Makes Tree Search Practical for Coding Agents

26 min · 23. mai 2026
episode The OS Trick That Makes Tree Search Practical for Coding Agents cover

Beskrivelse

THE OS TRICK THAT MAKES TREE SEARCH PRACTICAL FOR CODING AGENTS Source: DeltaBox: Scaling Stateful AI Agents with Millisecond-Level Sandbox Checkpoint/Rollback [https://arxiv.org/abs/2605.22781] Paper was published on May 21, 2026 This episode was AI-generated on May 22, 2026. The script was written by an AI language model and the host voices were synthesized by Eleven Labs. The producer is not affiliated with Anthropic or Eleven Labs. Almost nobody runs Monte Carlo tree search on real coding agents, even though it could add 30 points of accuracy on SWE-bench. The reason isn't the models — it's that sandbox checkpoint and rollback take seconds, and a new paper from Shanghai Jiao Tong and Huawei closes that gap with a couple of clever OS tricks that hide checkpointing inside the LLM call you were already waiting on. KEY TAKEAWAYS * Why agent capability gaps are sometimes OS limits, not model limits — and how DeltaBox closes a 30-point accuracy gap on SWE-bench by making checkpoint/rollback cheap * How DeltaFS hijacks OverlayFS plus XFS reflinks to version a filesystem at runtime without ever duplicating unchanged data * The fork() + CRIU combination that gives you 5-millisecond rollback by keeping a frozen 'body double' of the process with almost no memory cost * The inference-masking trick: hiding 15ms of checkpoint work inside the 1-20 second LLM call the agent was already waiting on * Why RL training GPU utilization jumps from about 51% to 99% when you replace shutil.copytree with forked sandbox templates * Where the design might creak: very large processes, faster LLM inference shrinking the masking window, and side effects that can't be rolled back * 00:00 — The capability gap tree search leaves on the floor Why MCTS adds 5-30 points of SWE-bench accuracy but almost nobody deploys it, and the 1.5-second-per-rollback OS cost that explains why. * 02:59 — The diary and the room: why checkpointing is hard Framing the core requirement that filesystem and process memory must be captured and restored atomically or tree search breaks. * 05:59 — DeltaFS and the stack of acetate sheets How the paper coerces OverlayFS into swapping layers at runtime and uses XFS reflinks so storage cost tracks actual edits. * 08:59 — DeltaCR: fork() as a frozen body double Combining CRIU dumps with a stopped, copy-on-write fork to get 5ms restores while keeping a durable disk-based safety net. * 11:58 — Inference-masking: cooking while the microwave runs Why hiding the 15ms checkpoint inside the LLM round-trip is what makes the architecture practical rather than just clever. * 14:58 — End-to-end SWE-bench results DeltaBox brings tree-search trajectory time to within 3-6% of the pure-LLM floor, versus 1.9x-4.3x for Firecracker and CubeSandbox. * 17:58 — The RL training story: 51% to 99% GPU utilization How the same fork-based template mechanism eliminates the sandbox setup idle time that wastes half a GPU during synchronous RL. * 20:57 — Steelman critiques and where the design might creak Honest pushback on process-size scaling, dependence on slow LLM inference, network side effects, MCTS-specific GC, and a reconstructed CubeSandbox baseline. * 23:57 — The bigger reframe: OS substrates for agent workloads Why this work fits a broader pattern of co-designing decades-old kernel primitives for high-frequency agent state, not just human users. RECOMMENDED READING * SWE-bench: Can Language Models Resolve Real-World GitHub Issues? [https://arxiv.org/abs/2310.06770] — The benchmark the episode repeatedly anchors to when discussing the five-to-thirty-point accuracy gains tree search unlocks for coding agents. * ReAct: Synergizing Reasoning and Acting in Language Models [https://arxiv.org/abs/2210.03629] — The linear agent loop the episode frames as the default that exists partly because richer OS-level branching was too expensive — useful context for why DeltaBox's substrate matters. * Language Agent Tree Search Unifies Reasoning, Acting, and Planning in Language Models [https://arxiv.org/abs/2310.04406] — A concrete instantiation of the MCTS-style agent search that the episode argues was theoretically attractive but practically blocked by sandbox overhead.

Kommentarer

0

Vær den første til å kommentere

Registrer deg nå og bli medlem av AI Papers: A Deep Dive sitt community!

Prøv gratis

Prøv gratis i 14 dager

99 kr / Måned etter prøveperioden. · Avslutt når som helst.

  • Eksklusive podkaster
  • 20 timer lydbøker i måneden
  • Gratis podkaster

Alle episoder

165 Episoder

episode Why Better Bug Reports Can Make AI Coding Agents Worse cover

Why Better Bug Reports Can Make AI Coding Agents Worse

WHY BETTER BUG REPORTS CAN MAKE AI CODING AGENTS WORSE Source: SHERLOC: Structured Diagnostic Localization for Code Repair Agents [https://arxiv.org/abs/2606.24820] Paper was published on June 23, 2026 This episode was AI-generated on June 24, 2026. The script was written by an AI language model and the host voices were synthesized by Eleven Labs. The producer is not affiliated with Anthropic or Eleven Labs. Hand a capable AI coding agent a more accurate report of where a bug lives, and it can fix fewer bugs than with nothing at all. This episode digs into SHERLOC, a paper arguing the field has been scoring localization like a search engine when what actually matters is the diagnosis — and shows where the impressive numbers stop being deployable. KEY TAKEAWAYS * Why AI coding agents spend roughly 48% of their turns and over 320,000 tokens just locating a bug before writing any fix * How SHERLOC reframes localization from 'find the right file' to a structured five-field diagnostic case file * Why a single setting — thinking mode off — collapses the same model from 74% recall to 10%, with 87% of runs producing no valid output * The capability-dependent transfer finding: weak repair agents gain 8-12 points, while strong agents can lose ground when fed findings indiscriminately * Why a low-quality diagnosis (20% resolve rate) drags an agent below the 62% baseline of having no report at all * The two honest limits: the quality filter relies on the ground-truth patch and isn't deployable, and ~58% of recall may come from memorized famous libraries * 00:00 — The taxi meter that never stops Sets up the counterintuitive finding and the headline cost: agents burn roughly half their compute just locating bugs before fixing anything. * 02:47 — Red circle versus the written report Introduces the core reframe — that a bare file path is underspecified, and SHERLOC instead emits a structured five-field diagnostic finding. * 05:12 — One setting flips everything Explains SHERLOC's training-free design, its four-tool menu and self-recovery layer, and the dramatic collapse when reasoning mode is turned off. * 09:09 — Can the underdog beat the specialists? Covers SHERLOC's state-of-the-art benchmark results and how structure substitutes for both scale and specialized training. * 10:25 — Does it just remember Django? Introduces the contamination worry and the masking gauntlet used to estimate how much performance comes from real exploration versus memorization. * 12:12 — The map that distracts the cabbie Presents capability-dependent transfer and the result that bad diagnoses drag agents below their no-report baseline. * 16:43 — The filter you can't actually ship The steelman critique: the quality filter peeks at the ground-truth patch, contamination remains unresolved, and the best numbers come at heavy serving cost in one language. * 21:25 — What actually survives the critique Lands on the durable reframe that diagnosis quality, not location accuracy, predicts repair success, and poses the closing question to listeners. RECOMMENDED READING * SWE-bench: Can Language Models Resolve Real-World GitHub Issues? [https://arxiv.org/abs/2310.06770] — The benchmark this episode's results are measured on — the real-GitHub-bug-plus-fixing-PR dataset whose Lite and Verified splits SHERLOC tops and whose contamination problems the hosts dwell on. * SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering [https://arxiv.org/abs/2405.15793] — The agent-framework lineage behind the repair agents SHERLOC injects case files into, and the source of the 'don't let models run arbitrary shells or they derail' design lesson the episode cites.

24. juni 202623 min
episode When a One-Liner Beats Your Agent's Clever Verification Logic cover

When a One-Liner Beats Your Agent's Clever Verification Logic

WHEN A ONE-LINER BEATS YOUR AGENT'S CLEVER VERIFICATION LOGIC Source: Bayesian control for coding agents [https://arxiv.org/abs/2606.24453] Paper was published on June 23, 2026 This episode was AI-generated on June 24, 2026. The script was written by an AI language model and the host voices were synthesized by Eleven Labs. The producer is not affiliated with Anthropic or Eleven Labs. Your coding agent has to decide whether to pay for an eleven-minute test or just ship — and a new paper turns that gut call into a single computable number. But the surprising part is how much effort it spends telling you exactly when its own Bayesian machinery is dead weight. We map out the three regimes that decide whether careful reasoning beats a dumb if-statement. KEY TAKEAWAYS * The exact break-even line for running an expensive verifier: verify only when your belief that the code is correct crosses cost divided by reward * Why a syntax checker carries zero signal — and how the Bayesian update figures that out on its own without hand-tuning * The three-region map: verify everything when checking is cheap, gate on one near-oracle test in the middle, and reason carefully only when verification is expensive and critics are imperfect * Why the headline 'plus sixty-two over always-verify' is soft — it's measured against a known-bad baseline, in a replay (not live) evaluation, and ignores the upfront cost of calibrating from oracle calls * How the controller's running belief doubles as a portable confidence score (0.87 ranking, rising to 0.91 on hard problems) you can bolt onto any agent * The whole gain comes from frozen models and a smarter control layer — no training, no fine-tuning * 01:42 — The agent that's really a toolbox Reframes a coding agent as a generator wrapped in a menu of tools — from a free syntax check to an eleven-minute oracle — with wildly lopsided costs and reliabilities. * 03:06 — Why fixed rules ignore what matters Argues that always-verify, best-of-N, and hard-coded refinement loops all ignore uncertainty, and proposes treating the control layer like a diagnostician ordering tests. * 04:10 — The whole idea in one breath Lays out the core move: carry a running belief that the code will pass, let cheap critics nudge it, and act to maximize reward minus the costs you rack up. * 06:36 — The one equation worth doing Derives the break-even threshold — verify when belief crosses cost-over-reward — and shows how that ratio plus the prior pass rate become the two axes of the map. * 08:25 — How a critic moves the needle Explains via Bayes' rule why a critic's value is the gap between how it treats correct versus broken code, why syntax checks are useless, and how mediocre critics compose. * 11:17 — Three regions, and only one is interesting Walks through the two-axis map: verify everything when checking is cheap, gate on a near-oracle test in the middle, and reason carefully only in the costly top-left corner. * 15:40 — How much of plus-sixty-two is real? The steelman critique: the headline margin beats a known-bad baseline, the evaluation replays pre-collected patches rather than generating live, and calibration hides an upfront oracle bill. * 20:42 — A confidence score you can bolt on anywhere Shows the belief state works as a well-calibrated, training-free confidence signal that beats sequence probability and perplexity — and gets better on hard problems. RECOMMENDED READING * Self-Refine: Iterative Refinement with Self-Feedback [https://arxiv.org/abs/2303.17651] — One of the named refinement agents the episode benchmarks against; it formalizes the generate-critique-regenerate loop the paper argues ignores uncertainty. * Reflexion: Language Agents with Verbal Reinforcement Learning [https://arxiv.org/abs/2303.11366] — The verbal-memory refinement agent that, in the episode's expensive-verification regime, actually went negative against doing nothing clever. * SWE-bench: Can Language Models Resolve Real-World GitHub Issues? [https://arxiv.org/abs/2310.06770] — The real-GitHub-issue benchmark whose patches gave the episode its eleven-minute test-suite telemetry and the region-A headline numbers.

24. juni 202625 min
episode When Turning Experience Into Code Makes Your AI Agent Dumber cover

When Turning Experience Into Code Makes Your AI Agent Dumber

WHEN TURNING EXPERIENCE INTO CODE MAKES YOUR AI AGENT DUMBER Source: Metis: Bridging Text and Code Memory for Self-Evolving Agents [https://arxiv.org/abs/2606.24151] Paper was published on June 23, 2026 This episode was AI-generated on June 24, 2026. The script was written by an AI language model and the host voices were synthesized by Eleven Labs. The producer is not affiliated with Anthropic or Eleven Labs. An AI agent that distilled its hard-won experience into reusable code scored ten points worse than an agent with no memory at all. This episode unpacks why the sophisticated-looking move — freezing lessons into callable tools — is also the fragile one, and what the right fix turns out to be. You'll come away understanding the single most basic decision in building agents that learn on the job: when a lesson should stay as soft advice, and when it's earned the right to become code. KEY TAKEAWAYS * Why storing an agent's experience as callable code can drop it below an agent with no memory at all — a 22-point collapse the moment it has to generalize * The 'injection asymmetry': text is consumed as adaptable advice you filter through reality, while code is a trusted black box whose flaws propagate to every caller and suppress the agent's own recovery behavior * Metis's 'text first, code earned' policy — sorting experience into plans, facts, and pitfalls, and crystallizing only recurring plans into tools using the desire-path principle * Why the codifier deliberately never reads the messy trajectory, building tools from the clean query pattern instead — and how that lets even failed runs safely count toward codification * The ablation that proves the recurrence gate: an 'Eager' version cost 47% more to build, scored worse, and left over half its tools never invoked * Where the clean story has a seam: the headline result is really about ungated, trajectory-trained, unvalidated code on a single benchmark — not a law that 'code memory is bad' * 01:57 — The brilliant employee with amnesia Frames the core problem: stateless agents lose everything they figure out, and the field hasn't examined how lessons should be stored. * 03:01 — Text advice or a black-box tool? Lays out the fork between storing lessons as adaptable text versus callable code, and why the real difference is how the agent consumes each. * 04:50 — The experiment that fixed every variable Describes the clean diagnostic on AppWorld, splitting executor and reflector models, and measuring construction cost, execution efficiency, and transfer reliability. * 08:43 — The 22-point collapse Reveals the headline reversal: code memory looks great in-sample but collapses 22 points under realistic streaming, dropping below the no-memory baseline. * 10:06 — Why the confident tool fails hard Explains the injection asymmetry through the coworker analogy and why trusted code suppresses an agent's own self-correction. * 13:07 — Paving only the paths people walk Walks through Metis's three design choices — the plans/facts/pitfalls taxonomy, the recurrence gate, and query-only codification — using the desire-path analogy. * 18:13 — Does the machinery actually pay off? Tests the predictions: Metis is more accurate and cheaper at once, and the Eager ablation proves the recurrence gate is a quality filter. * 21:41 — The seam in the clean story The steelman critique: the real claim is about ungated, trajectory-trained code on a single benchmark, with the genuine edge limited to distribution shift. * 24:30 — Don't pour the concrete too early Draws out the durable lesson — store knowledge in a form that follows its properties — and poses the closing question to listeners. RECOMMENDED READING * ReAct: Synergizing Reasoning and Acting in Language Models [https://arxiv.org/abs/2210.03629] — The think-act-observe loop the episode names as the baseline floor every memory variant in Metis is measured against. * AppWorld: A Controllable World of Apps and People for Benchmarking Interactive Coding Agents [https://arxiv.org/abs/2407.18901] — The exact 457-API simulated benchmark all of the episode's accuracy and token numbers are run on. * Voyager: An Open-Ended Embodied Agent with Large Language Models [https://arxiv.org/abs/2305.16291] — The canonical 'agent builds a reusable skill library of callable code' approach this episode's text-first-code-earned policy is implicitly arguing against. * Generative Agents: Interactive Simulacra of Human Behavior [https://arxiv.org/abs/2304.03442] — A contrasting take where agent experience is stored and retrieved as natural-language memory, the 'soft advice' side of the episode's text-versus-code fork.

24. juni 202626 min
episode How Teaching an AI to Predict, Not Act, Made It a Better Actor cover

How Teaching an AI to Predict, Not Act, Made It a Better Actor

HOW TEACHING AN AI TO PREDICT, NOT ACT, MADE IT A BETTER ACTOR Source: Qwen-AgentWorld: Language World Models for General Agents [https://arxiv.org/abs/2606.24597] Paper was published on June 23, 2026 This episode was AI-generated on June 24, 2026. The script was written by an AI language model and the host voices were synthesized by Eleven Labs. The producer is not affiliated with Anthropic or Eleven Labs. Researchers trained a model to do one thing — guess what a computer would say back — with zero acting, no tool calls, no clicking. Then it got better at every multi-step agent task they threw at it, including a function-calling benchmark whose data it had never seen. The bet: prediction and action are the same muscle, and the field has only been training one side of it. KEY TAKEAWAYS * Why a model trained only to predict environment responses — never to act — transfers measurably into better agent behavior, with prediction accuracy rising from 70% to 78% * The three-stage recipe (pre-train injects, fine-tune activates, RL sharpens) and how the reward function had to be redesigned to stop the model from flattering its own AI judge * How a steered simulator beat a live search engine for training (50.3% vs 45.6%) by deliberately handing back partial answers — the 'stingy teacher' effect * Why training agents inside entirely fictional worlds (a 2030 Mars colony) made them better at real search without contaminating their knowledge * Where the marketing outruns the evidence: a sub-half-point frontier win, a fifth-place GUI ranking, an AI judge with a documented exploit, and a 'beats reality' claim resting on a single comparison * Why environments — not model size — are the real bottleneck in agent training, and how a learnable simulator could unshackle it * 00:00 — Two muscles or one? Sets up the central puzzle — a model trained only to predict, never to act, becoming a better actor across every task. * 01:09 — The half of the loop nobody trained Explains the policy/world-model split, the theory that general agents must contain a world model, and why environments are the field's real bottleneck. * 03:02 — Turning seven worlds into one problem How representing terminals, phones, and web pages all as text lets one model learn to be any environment under a single objective. * 04:39 — Outsmarting a model that cheats the grader Walks through the three-stage training pipeline, the self-praise reward hack, and the clever loss-masking trick for boilerplate turns. * 10:08 — Is the headline as big as it sounds? Examines the benchmark results — a razor-thin frontier margin versus a clean eight-point win over their own base model, plus the cross-domain transfer effect. * 13:42 — When a fake world beats the real one The decoupled paradigm — training agents inside fictional worlds and against a steered simulator that beat a live search engine. * 17:38 — Prediction with no acting in it The unified paradigm — a single-turn, tool-free warm-up that lifts agent performance on all seven multi-turn benchmarks, demonstrated with the Postfix mail server case. * 20:59 — Where the marketing runs ahead Finn's three-part critique: the thin headline win, the gameable AI judge, and the 'beats reality' claim resting on a single narrow comparison. * 24:14 — What survives the harshest read The lasting contribution — prediction as a trainable foundation skill that transfers to action — and what it could change about agent-training economics. RECOMMENDED READING * Robust agents learn causal world models [https://arxiv.org/abs/2402.10877] — The Richens et al. result the episode cites as its theoretical spine — proving that any agent generalizing across enough tasks must have learned a world model. * A Path Towards Autonomous Machine Intelligence [https://openreview.net/forum?id=BZ5a1r-kVsf] — LeCun's manifesto for predict-before-you-act agents, the 'old vision' the episode invokes when explaining the unify paradigm where the agent simulates consequences before committing to an action.

24. juni 202626 min
episode A Router That Beats the Frontier Models It Calls cover

A Router That Beats the Frontier Models It Calls

A ROUTER THAT BEATS THE FRONTIER MODELS IT CALLS Source: Sakana Fugu Technical Report [https://arxiv.org/abs/2606.21228] Paper was published on June 19, 2026 This episode was AI-generated on June 23, 2026. The script was written by an AI language model and the host voices were synthesized by Eleven Labs. The producer is not affiliated with Anthropic or Eleven Labs. A system whose only skill is deciding which top model to call for each piece of a problem manages to beat GPT, Claude, and Gemini — the very models it's calling — on some of the hardest benchmarks we have. The paper argues orchestration is a second scaling axis hiding in plain sight, one that could put frontier performance within reach of teams that can't afford to train a frontier model. We dig into how it works, what's genuinely surprising, and where the evidence gets uncomfortably thin. KEY TAKEAWAYS * Why frontier models have stopped being interchangeable — and how a learned router exploits that specialization model-by-model and even step-by-step * What 'model merging at the behavioral level' means, and why combining closed models by behavior sidesteps the open-weights requirement of classic merging * The surprising finding that a model's standalone benchmark score does not predict how well it performs inside a real coding harness * How the heavy 'Ultra' system avoids 'orchestration collapse' by isolating agents within a workflow while sharing memory across workflows * The credibility seam: where the evidence is rigorous the effect is small (a fraction of a percent), and where the effect is huge it leans on provider-reported baselines and hand-picked examples * Why the orchestration-as-scaling-axis framing matters for export controls and the compute race even if the headline numbers are softer than claimed * 00:00 — The contractor who never picks up a hammer The core analogy and the headline claim: a system that only decides which model to call beats every model it calls, without training anything new. * 02:20 — Why no model is best at everything anymore The paper's starting observation that frontier models have specialized, and that the scaffold wrapped around a model matters as much as its weights. * 04:07 — Merging behavior, not weights How combining models by behavior rather than weights lets Fugu mix closed models from different providers and absorb new ones without retraining. * 05:35 — Two systems, one trip-up to avoid The distinction between the fast Fugu router that picks one worker per turn and the heavy Fugu-Ultra that writes whole free-form workflows. * 07:29 — How do you teach a thing to pick? The training recipe — supervised fine-tuning on soft score distributions, evolutionary refinement on whole-task success, and reinforcement learning for Ultra. * 10:53 — The benchmark score that lies to you The finding that standalone benchmark scores don't predict in-harness behavior, and the orchestration-collapse failure mode Ultra had to solve. * 14:52 — Does the routing actually adapt? The evidence — Terminal Bench trajectories, builder-and-debugger workflows, a shifting aggregator role, and the pie charts proving domain-specific routing. * 20:24 — Where the impressive thing gets weak The steelman critique: self-computed scores versus provider-reported baselines, selected illustrative wins, and the rigorous experiment showing the smallest effect. * 23:42 — A second path to the frontier? Why orchestration as a scaling axis could distribute frontier capability beyond the biggest training runs, and the closing question for listeners. RECOMMENDED READING * Evolutionary Optimization of Model Merging Recipes [https://arxiv.org/abs/2403.13187] — The same lab's prior weight-level model-merging work that the episode explicitly contrasts with Fugu's behavioral merging of closed models. * Mixture-of-Agents Enhances Large Language Model Capabilities [https://arxiv.org/abs/2406.04692] — The fixed-aggregator multi-agent approach the episode names as the direct foil to Fugu's adaptive, task-dependent synthesizer role. * GPTSwarm: Language Agents as Optimizable Graphs [https://arxiv.org/abs/2402.16823] — Cited by the episode as prior multi-agent work whose fixed orchestration structure Fugu-Ultra's learned workflows aim to surpass. * DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models [https://arxiv.org/abs/2402.03300] — Introduces GRPO, the critic-free reinforcement learning method the episode describes for training Fugu-Ultra's workflow generation.

I går26 min