The OS Trick That Makes Tree Search Practical for Coding Agents

Beskrivelse

THE OS TRICK THAT MAKES TREE SEARCH PRACTICAL FOR CODING AGENTS Source: DeltaBox: Scaling Stateful AI Agents with Millisecond-Level Sandbox Checkpoint/Rollback [https://arxiv.org/abs/2605.22781] Paper was published on May 21, 2026 This episode was AI-generated on May 22, 2026. The script was written by an AI language model and the host voices were synthesized by Eleven Labs. The producer is not affiliated with Anthropic or Eleven Labs. Almost nobody runs Monte Carlo tree search on real coding agents, even though it could add 30 points of accuracy on SWE-bench. The reason isn't the models — it's that sandbox checkpoint and rollback take seconds, and a new paper from Shanghai Jiao Tong and Huawei closes that gap with a couple of clever OS tricks that hide checkpointing inside the LLM call you were already waiting on. KEY TAKEAWAYS * Why agent capability gaps are sometimes OS limits, not model limits — and how DeltaBox closes a 30-point accuracy gap on SWE-bench by making checkpoint/rollback cheap * How DeltaFS hijacks OverlayFS plus XFS reflinks to version a filesystem at runtime without ever duplicating unchanged data * The fork() + CRIU combination that gives you 5-millisecond rollback by keeping a frozen 'body double' of the process with almost no memory cost * The inference-masking trick: hiding 15ms of checkpoint work inside the 1-20 second LLM call the agent was already waiting on * Why RL training GPU utilization jumps from about 51% to 99% when you replace shutil.copytree with forked sandbox templates * Where the design might creak: very large processes, faster LLM inference shrinking the masking window, and side effects that can't be rolled back * 00:00 — The capability gap tree search leaves on the floor Why MCTS adds 5-30 points of SWE-bench accuracy but almost nobody deploys it, and the 1.5-second-per-rollback OS cost that explains why. * 02:59 — The diary and the room: why checkpointing is hard Framing the core requirement that filesystem and process memory must be captured and restored atomically or tree search breaks. * 05:59 — DeltaFS and the stack of acetate sheets How the paper coerces OverlayFS into swapping layers at runtime and uses XFS reflinks so storage cost tracks actual edits. * 08:59 — DeltaCR: fork() as a frozen body double Combining CRIU dumps with a stopped, copy-on-write fork to get 5ms restores while keeping a durable disk-based safety net. * 11:58 — Inference-masking: cooking while the microwave runs Why hiding the 15ms checkpoint inside the LLM round-trip is what makes the architecture practical rather than just clever. * 14:58 — End-to-end SWE-bench results DeltaBox brings tree-search trajectory time to within 3-6% of the pure-LLM floor, versus 1.9x-4.3x for Firecracker and CubeSandbox. * 17:58 — The RL training story: 51% to 99% GPU utilization How the same fork-based template mechanism eliminates the sandbox setup idle time that wastes half a GPU during synchronous RL. * 20:57 — Steelman critiques and where the design might creak Honest pushback on process-size scaling, dependence on slow LLM inference, network side effects, MCTS-specific GC, and a reconstructed CubeSandbox baseline. * 23:57 — The bigger reframe: OS substrates for agent workloads Why this work fits a broader pattern of co-designing decades-old kernel primitives for high-frequency agent state, not just human users. RECOMMENDED READING * SWE-bench: Can Language Models Resolve Real-World GitHub Issues? [https://arxiv.org/abs/2310.06770] — The benchmark the episode repeatedly anchors to when discussing the five-to-thirty-point accuracy gains tree search unlocks for coding agents. * ReAct: Synergizing Reasoning and Acting in Language Models [https://arxiv.org/abs/2210.03629] — The linear agent loop the episode frames as the default that exists partly because richer OS-level branching was too expensive — useful context for why DeltaBox's substrate matters. * Language Agent Tree Search Unifies Reasoning, Acting, and Planning in Language Models [https://arxiv.org/abs/2310.04406] — A concrete instantiation of the MCTS-style agent search that the episode argues was theoretically attractive but practically blocked by sandbox overhead.

When Turning Experience Into Code Makes Your AI Agent Dumber

WHEN TURNING EXPERIENCE INTO CODE MAKES YOUR AI AGENT DUMBER Source: Metis: Bridging Text and Code Memory for Self-Evolving Agents [https://arxiv.org/abs/2606.24151] Paper was published on June 23, 2026 This episode was AI-generated on June 24, 2026. The script was written by an AI language model and the host voices were synthesized by Eleven Labs. The producer is not affiliated with Anthropic or Eleven Labs. An AI agent that distilled its hard-won experience into reusable code scored ten points worse than an agent with no memory at all. This episode unpacks why the sophisticated-looking move — freezing lessons into callable tools — is also the fragile one, and what the right fix turns out to be. You'll come away understanding the single most basic decision in building agents that learn on the job: when a lesson should stay as soft advice, and when it's earned the right to become code. KEY TAKEAWAYS * Why storing an agent's experience as callable code can drop it below an agent with no memory at all — a 22-point collapse the moment it has to generalize * The 'injection asymmetry': text is consumed as adaptable advice you filter through reality, while code is a trusted black box whose flaws propagate to every caller and suppress the agent's own recovery behavior * Metis's 'text first, code earned' policy — sorting experience into plans, facts, and pitfalls, and crystallizing only recurring plans into tools using the desire-path principle * Why the codifier deliberately never reads the messy trajectory, building tools from the clean query pattern instead — and how that lets even failed runs safely count toward codification * The ablation that proves the recurrence gate: an 'Eager' version cost 47% more to build, scored worse, and left over half its tools never invoked * Where the clean story has a seam: the headline result is really about ungated, trajectory-trained, unvalidated code on a single benchmark — not a law that 'code memory is bad' * 01:57 — The brilliant employee with amnesia Frames the core problem: stateless agents lose everything they figure out, and the field hasn't examined how lessons should be stored. * 03:01 — Text advice or a black-box tool? Lays out the fork between storing lessons as adaptable text versus callable code, and why the real difference is how the agent consumes each. * 04:50 — The experiment that fixed every variable Describes the clean diagnostic on AppWorld, splitting executor and reflector models, and measuring construction cost, execution efficiency, and transfer reliability. * 08:43 — The 22-point collapse Reveals the headline reversal: code memory looks great in-sample but collapses 22 points under realistic streaming, dropping below the no-memory baseline. * 10:06 — Why the confident tool fails hard Explains the injection asymmetry through the coworker analogy and why trusted code suppresses an agent's own self-correction. * 13:07 — Paving only the paths people walk Walks through Metis's three design choices — the plans/facts/pitfalls taxonomy, the recurrence gate, and query-only codification — using the desire-path analogy. * 18:13 — Does the machinery actually pay off? Tests the predictions: Metis is more accurate and cheaper at once, and the Eager ablation proves the recurrence gate is a quality filter. * 21:41 — The seam in the clean story The steelman critique: the real claim is about ungated, trajectory-trained code on a single benchmark, with the genuine edge limited to distribution shift. * 24:30 — Don't pour the concrete too early Draws out the durable lesson — store knowledge in a form that follows its properties — and poses the closing question to listeners. RECOMMENDED READING * ReAct: Synergizing Reasoning and Acting in Language Models [https://arxiv.org/abs/2210.03629] — The think-act-observe loop the episode names as the baseline floor every memory variant in Metis is measured against. * AppWorld: A Controllable World of Apps and People for Benchmarking Interactive Coding Agents [https://arxiv.org/abs/2407.18901] — The exact 457-API simulated benchmark all of the episode's accuracy and token numbers are run on. * Voyager: An Open-Ended Embodied Agent with Large Language Models [https://arxiv.org/abs/2305.16291] — The canonical 'agent builds a reusable skill library of callable code' approach this episode's text-first-code-earned policy is implicitly arguing against. * Generative Agents: Interactive Simulacra of Human Behavior [https://arxiv.org/abs/2304.03442] — A contrasting take where agent experience is stored and retrieved as natural-language memory, the 'soft advice' side of the episode's text-versus-code fork.

24. juni 202626 min

The OS Trick That Makes Tree Search Practical for Coding Agents

Beskrivelse

Kommentarer

Prøv gratis i 14 dager

Alle episoder