AI Papers: A Deep Dive

The Summarizer That Quietly Deletes Your Agent's Safety Rules

27 min · Ayer
Portada del episodio The Summarizer That Quietly Deletes Your Agent's Safety Rules

Descripción

THE SUMMARIZER THAT QUIETLY DELETES YOUR AGENT'S SAFETY RULES Source: Governance Decay: How Context Compaction Silently Erases Safety Constraints in Long-Horizon LLM Agents [https://arxiv.org/abs/2606.22528] Paper was published on June 21, 2026 This episode was AI-generated on June 23, 2026. The script was written by an AI language model and the host voices were synthesized by Eleven Labs. The producer is not affiliated with Anthropic or Eleven Labs. An enterprise AI agent refused to email a contract outside the company — then, a few thousand tokens later, sent it anyway, with no jailbreak and no attack. The only thing that changed is that the rule got compacted out of its memory. This episode unpacks why the housekeeping step every long-running agent relies on is quietly erasing the rules keeping it in bounds, and a fifty-token fix that mostly works. KEY TAKEAWAYS * Why context compaction — the standard step that keeps long agents alive — deletes the safety rules nobody put in the protected system slot * The soft-versus-hard gap: arbitrary 'house rules' like 'don't email externally' decay 8x more than instinct rules like 'don't disclose an SSN', creating a false sense of safety * How stating a rule and then compacting it away can leave an agent MORE likely to violate (59%) than never stating it at all (37%) * The crossing experiment showing safety is a property of whose summaries you read, not the agent's own judgment — the harness is the safety surface * Constraint Pinning: a ~50-token laminated rule card that restores violations to zero and actually improves task completion — and the one impersonation attack it can't stop * Why these failures are live today in LangGraph (65%), LangMem (95%), and AutoGen (100%) production frameworks * 02:02 — Why a summarizer drops the one rule Establishes that an agent's only memory is its context window, that frameworks constantly compact it to save space, and that a summarizer naturally drops an old compliance rule as off-topic. * 04:27 — Doesn't a protected slot fix this? Shows that rules in the privileged system message survive, but most real rules arrive as user instructions, retrieved memory, or tool outputs that all live in the compactable context. * 07:23 — Did the rule survive, or just get buried? Introduces the ConstraintRot benchmark and confronts the 'lost in the middle' objection — that the rule may still be present but overlooked rather than deleted. * 08:05 — House rules vanish, reflexes survive Explains the 8x decay gap between soft organizational rules and hard safety norms, and why built-in priors mask the problem and create a false sense of safety. * 10:15 — Worse than if you'd said nothing Reveals that for the worst model, stating a rule and compacting it away normalizes the forbidden action and pushes violation above the no-rule baseline. * 11:28 — Proving it's the plumbing, not the model Walks through the counterfactual-summary experiment that kills the length objection and the crossing experiment showing violations track the summarizer, not the agent. * 14:17 — Compress harder, lose more rules Presents the dose-response curve and notes that production guidance to compact aggressively pushes deployments toward the high-decay end. * 15:52 — The attack that deletes instead of adds Introduces the subtractive Compaction-Eviction Attack, its volume and summarizer-injection variants, the no-model-is-safe-on-both-axes result, and the search that breaks resistance from zero to 65%. * 20:19 — A 50-token card that fixes it Describes Constraint Pinning — re-stapling a protected rule card after every compaction — which costs under half a percent overhead, restores violations to zero, and improves task completion. * 22:11 — The forged operator update it can't stop Lays out pinning's honest limit — in-context impersonation defeats it because operator authority asserted in the token stream can't be verified — and reframes context management as a first-class governance surface. RECOMMENDED READING * Lost in the Middle: How Language Models Use Long Contexts [https://arxiv.org/abs/2307.03172] — The 'lost in the middle' result the episode names directly as the skeptic's objection — and that the paper's counterfactual-summary experiment works to rule out as the cause. * Prompt Injection attack against LLM-integrated Applications [https://arxiv.org/abs/2306.05499] — Grounds the 'additive' prompt-injection threat model that the episode contrasts against its 'subtractive' Compaction-Eviction Attack. * Universal and Transferable Adversarial Attacks on Aligned Language Models [https://arxiv.org/abs/2307.15043] — Backs the episode's 'locksmith point' that robustness to a fixed probe is not robustness to search — the same gap that turned a 0% injection into 65% via gradient-level optimization.

Comentarios

0

Sé la primera persona en comentar

¡Regístrate ahora y únete a la comunidad de AI Papers: A Deep Dive!

Prueba gratis

Empieza 7 días de prueba

$99 / mes después de la prueba. · Cancela cuando quieras.

  • Podcasts solo en Podimo
  • 20 horas de audiolibros al mes
  • Podcast gratuitos

Todos los episodios

161 episodios

episode A Router That Beats the Frontier Models It Calls artwork

A Router That Beats the Frontier Models It Calls

A ROUTER THAT BEATS THE FRONTIER MODELS IT CALLS Source: Sakana Fugu Technical Report [https://arxiv.org/abs/2606.21228] Paper was published on June 19, 2026 This episode was AI-generated on June 23, 2026. The script was written by an AI language model and the host voices were synthesized by Eleven Labs. The producer is not affiliated with Anthropic or Eleven Labs. A system whose only skill is deciding which top model to call for each piece of a problem manages to beat GPT, Claude, and Gemini — the very models it's calling — on some of the hardest benchmarks we have. The paper argues orchestration is a second scaling axis hiding in plain sight, one that could put frontier performance within reach of teams that can't afford to train a frontier model. We dig into how it works, what's genuinely surprising, and where the evidence gets uncomfortably thin. KEY TAKEAWAYS * Why frontier models have stopped being interchangeable — and how a learned router exploits that specialization model-by-model and even step-by-step * What 'model merging at the behavioral level' means, and why combining closed models by behavior sidesteps the open-weights requirement of classic merging * The surprising finding that a model's standalone benchmark score does not predict how well it performs inside a real coding harness * How the heavy 'Ultra' system avoids 'orchestration collapse' by isolating agents within a workflow while sharing memory across workflows * The credibility seam: where the evidence is rigorous the effect is small (a fraction of a percent), and where the effect is huge it leans on provider-reported baselines and hand-picked examples * Why the orchestration-as-scaling-axis framing matters for export controls and the compute race even if the headline numbers are softer than claimed * 00:00 — The contractor who never picks up a hammer The core analogy and the headline claim: a system that only decides which model to call beats every model it calls, without training anything new. * 02:20 — Why no model is best at everything anymore The paper's starting observation that frontier models have specialized, and that the scaffold wrapped around a model matters as much as its weights. * 04:07 — Merging behavior, not weights How combining models by behavior rather than weights lets Fugu mix closed models from different providers and absorb new ones without retraining. * 05:35 — Two systems, one trip-up to avoid The distinction between the fast Fugu router that picks one worker per turn and the heavy Fugu-Ultra that writes whole free-form workflows. * 07:29 — How do you teach a thing to pick? The training recipe — supervised fine-tuning on soft score distributions, evolutionary refinement on whole-task success, and reinforcement learning for Ultra. * 10:53 — The benchmark score that lies to you The finding that standalone benchmark scores don't predict in-harness behavior, and the orchestration-collapse failure mode Ultra had to solve. * 14:52 — Does the routing actually adapt? The evidence — Terminal Bench trajectories, builder-and-debugger workflows, a shifting aggregator role, and the pie charts proving domain-specific routing. * 20:24 — Where the impressive thing gets weak The steelman critique: self-computed scores versus provider-reported baselines, selected illustrative wins, and the rigorous experiment showing the smallest effect. * 23:42 — A second path to the frontier? Why orchestration as a scaling axis could distribute frontier capability beyond the biggest training runs, and the closing question for listeners. RECOMMENDED READING * Evolutionary Optimization of Model Merging Recipes [https://arxiv.org/abs/2403.13187] — The same lab's prior weight-level model-merging work that the episode explicitly contrasts with Fugu's behavioral merging of closed models. * Mixture-of-Agents Enhances Large Language Model Capabilities [https://arxiv.org/abs/2406.04692] — The fixed-aggregator multi-agent approach the episode names as the direct foil to Fugu's adaptive, task-dependent synthesizer role. * GPTSwarm: Language Agents as Optimizable Graphs [https://arxiv.org/abs/2402.16823] — Cited by the episode as prior multi-agent work whose fixed orchestration structure Fugu-Ultra's learned workflows aim to surpass. * DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models [https://arxiv.org/abs/2402.03300] — Introduces GRPO, the critic-free reinforcement learning method the episode describes for training Fugu-Ultra's workflow generation.

Ayer26 min
episode A Free-Lunch Tweak That Lets a Tiny Agent Beat Frontier Giants artwork

A Free-Lunch Tweak That Lets a Tiny Agent Beat Frontier Giants

A FREE-LUNCH TWEAK THAT LETS A TINY AGENT BEAT FRONTIER GIANTS Source: Group-Graph Policy Optimization for Long-Horizon Agentic Reinforcement Learning [https://arxiv.org/abs/2606.22995] Paper was published on June 22, 2026 This episode was AI-generated on June 23, 2026. The script was written by an AI language model and the host voices were synthesized by Eleven Labs. The producer is not affiliated with Anthropic or Eleven Labs. Train an agent eight times on the same task and the standard algorithm throws away the fact that all eight kept walking through the same rooms. A new method called G2PO refuses to discard that overlap — and a 1.5-billion-parameter model jumps more than twenty points in success rate for under half a percent of extra compute. We trace exactly how re-reading rollouts you already paid for can double a frontier model hundreds of times larger. KEY TAKEAWAYS * Why standard agent training treats eight attempts at the same task as eight strangers — and how G2PO fuses overlapping situations into a single branching graph instead * How averaging a situation's value across every attempt that passed through it cuts noisy value estimates the way visiting a restaurant eight times washes out one bad night * Why scoring a move by its absolute progress across the whole map (not just its local neighbors) credits a brilliant move even in a run that ultimately lost * The surprising variance result: subtracting two noisy, correlated value estimates cancels noise instead of compounding it * The headline numbers — +22 points on ALFWorld, ~14 on WebShop, and a trained 1.5B model hitting 71% on WebShop versus Gemini 2.5 Pro's ~36% — for about one second of extra CPU bookkeeping per step * Where the method weakens: on AppWorld, where states rarely repeat, the gap collapses to under three points, plus idealized proof assumptions and wide subtask error bars * 00:00 — Eight strangers or one situation? Sets up the core blind spot — training treats eight attempts through the same kitchen as unrelated — and previews the outsized payoff from patching it. * 01:46 — One bit at the end of forty moves Explains the credit assignment problem at long horizons, where a single success-or-failure bit must be smeared back across every decision. * 02:44 — How GRPO fired the referee Walks through GRPO's grading-on-a-curve trick and the step-level training move that quietly assumed every attempt is its own universe. * 04:38 — Stop drawing lines, draw a map Introduces the graph reframe — fusing identical situations into shared nodes — using the airport-terminal analogy, while flagging that it all depends on real overlaps. * 06:46 — Rating a restaurant from one visit Shows how pooling every attempt through a node stabilizes value estimates and drops noise in proportion to how many runs are pooled. * 08:51 — Grading the whole school, not one classroom Explains edge-centric advantage — scoring a move by absolute progress against every edge in the graph — and the kitchen example where a real breakthrough lights up. * 12:05 — Why the noise cancels instead of stacking Unpacks the surprising result that subtracting two positively correlated value estimates keeps the sharper signal bounded by the same variance as the crude one. * 14:09 — Does the free lunch show up? Presents the benchmark results, the tiny trained model doubling frontier giants, and the roughly four-tenths-of-a-percent compute overhead. * 16:29 — Where the headline number gets shaky Steelmans the critique — underspecified matching, the AppWorld collapse, idealized independence assumptions, and wide subtask error bars — and what survives it. RECOMMENDED READING * DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models [https://arxiv.org/abs/2402.03300] — Introduces GRPO, the critic-free group-relative algorithm that G2PO inherits and extends — essential background for the 'grading on a curve' core of this episode. * ALFWorld: Aligning Text and Embodied Environments for Interactive Learning [https://arxiv.org/abs/2010.03768] — The household-task benchmark where G2PO posts its largest 20+ point gains, and where state-overlap is richest — the best-case setting the episode dissects. * WebShop: Towards Scalable Real-World Web Interaction with Grounded Language Agents [https://arxiv.org/abs/2207.01206] — The e-commerce navigation benchmark behind the episode's headline contrast of a 1.5B trained model beating frontier giants on success rate.

Ayer22 min
episode Why Training Only on Perfect Solutions Cripples a Model's Reasoning artwork

Why Training Only on Perfect Solutions Cripples a Model's Reasoning

WHY TRAINING ONLY ON PERFECT SOLUTIONS CRIPPLES A MODEL'S REASONING Source: Provable Benefits of RLVR over SFT for Reasoning Models: Learning to Backtrack Efficiently [https://arxiv.org/abs/2606.22938] Paper was published on June 22, 2026 This episode was AI-generated on June 23, 2026. The script was written by an AI language model and the host voices were synthesized by Eleven Labs. The producer is not affiliated with Anthropic or Eleven Labs. Everyone assumes clean, flawless examples are the best reasoning data — and a new theory paper proves that intuition is backwards. By formalizing reasoning as path-finding through a maze, two researchers show imitation learning provably can't teach backtracking, while reinforcement learning learns it for free from the model's own failures. The result is a clean, exponential gap that reframes what 'high-quality reasoning data' even means. KEY TAKEAWAYS * Why training on clean, backtracking-free solutions provably freezes a model's ability to retreat from dead ends — there's no gradient signal where there's no data * How modeling reasoning as path-finding through a maze turns 'backtracking' into something you can prove theorems about * The headline result: RL scales linearly with reasoning depth (W·K) while imitation blows up exponentially (W·L^K), from the identical starting model * Why bolting a clever search wrapper onto a weak imitation model helps a lot but still can't fully close the gap * The steelman critique: the central theorem is close to true by construction, and the exponential drama leans on a chosen graph topology and a deliberately pessimistic definition of SFT * The practical payoff — why distilling from an RL-trained model works precisely because you inherit its messy recoveries, not just its answers * 00:03 — Is clean data secretly the problem? The provocative claim that flawless solutions are the wrong training data, and why a new theory paper makes it more than a vibe. * 01:28 — Two ways to train, one key difference Setting up the fight between supervised fine-tuning and RLVR, with the crucial distinction that RL learns from the model's own failures. * 03:41 — Turning reasoning into a maze How the authors recast reasoning as path-finding through corridors with parallel lanes, making backtracking a measurable, provable quantity. * 06:04 — No examples, no nudge The simple gradient fact that dooms imitation learning — perfect solutions contain no dead ends, so backward-facing states never get any signal. * 09:03 — Linear versus falling off a cliff The exponential blowup of imitation versus the linear scaling of RL, and what that gap means concretely as reasoning gets deeper. * 10:15 — How RL escapes the trap Why reinforcement learning visits the exact dead-end states imitation never sees, and how its learning rule turns failure into the gradient that matters. * 13:12 — Does it survive a real algorithm? Confirming the predicted optimum with PPO, a transformer, and asymmetric graphs — and why search scaffolding helps but still can't fully close the gap. * 15:19 — How true by construction is this? The steelman critique — a pessimistic strawman SFT, a target-blind model, and an exponential that leans on a chosen topology and idealized RL analysis. * 18:39 — The dead ends are the curriculum The distillation fix and the big takeaway: quality reasoning data isn't clean data — it's data that keeps the struggle and the recoveries in. RECOMMENDED READING * Tree of Thoughts: Deliberate Problem Solving with Large Language Models [https://arxiv.org/abs/2305.10601] — The search-scaffolding approach the episode critiques — the authors show external orchestration helps but can't fully replace backtracking baked into the weights. * DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning [https://arxiv.org/abs/2501.12948] — The 'folklore' the episode says this theory paper finally formalizes — a flagship demonstration that RL with verifiable rewards produces genuine reasoning and backtracking behavior. * Proximal Policy Optimization Algorithms [https://arxiv.org/abs/1707.06347] — The actual RL algorithm the paper uses to confirm its toy-model predictions on a real transformer — worth reading to understand the machinery behind the W-times-K result. * Chain-of-Thought Prompting Elicits Reasoning in Large Language Models [https://arxiv.org/abs/2201.11903] — The reasoning paradigm the paper models as path-finding through a graph — useful context for judging the gap between the episode's blind-search sandbox and chain-of-thought as actually practiced.

Ayer22 min
episode The Summarizer That Quietly Deletes Your Agent's Safety Rules artwork

The Summarizer That Quietly Deletes Your Agent's Safety Rules

THE SUMMARIZER THAT QUIETLY DELETES YOUR AGENT'S SAFETY RULES Source: Governance Decay: How Context Compaction Silently Erases Safety Constraints in Long-Horizon LLM Agents [https://arxiv.org/abs/2606.22528] Paper was published on June 21, 2026 This episode was AI-generated on June 23, 2026. The script was written by an AI language model and the host voices were synthesized by Eleven Labs. The producer is not affiliated with Anthropic or Eleven Labs. An enterprise AI agent refused to email a contract outside the company — then, a few thousand tokens later, sent it anyway, with no jailbreak and no attack. The only thing that changed is that the rule got compacted out of its memory. This episode unpacks why the housekeeping step every long-running agent relies on is quietly erasing the rules keeping it in bounds, and a fifty-token fix that mostly works. KEY TAKEAWAYS * Why context compaction — the standard step that keeps long agents alive — deletes the safety rules nobody put in the protected system slot * The soft-versus-hard gap: arbitrary 'house rules' like 'don't email externally' decay 8x more than instinct rules like 'don't disclose an SSN', creating a false sense of safety * How stating a rule and then compacting it away can leave an agent MORE likely to violate (59%) than never stating it at all (37%) * The crossing experiment showing safety is a property of whose summaries you read, not the agent's own judgment — the harness is the safety surface * Constraint Pinning: a ~50-token laminated rule card that restores violations to zero and actually improves task completion — and the one impersonation attack it can't stop * Why these failures are live today in LangGraph (65%), LangMem (95%), and AutoGen (100%) production frameworks * 02:02 — Why a summarizer drops the one rule Establishes that an agent's only memory is its context window, that frameworks constantly compact it to save space, and that a summarizer naturally drops an old compliance rule as off-topic. * 04:27 — Doesn't a protected slot fix this? Shows that rules in the privileged system message survive, but most real rules arrive as user instructions, retrieved memory, or tool outputs that all live in the compactable context. * 07:23 — Did the rule survive, or just get buried? Introduces the ConstraintRot benchmark and confronts the 'lost in the middle' objection — that the rule may still be present but overlooked rather than deleted. * 08:05 — House rules vanish, reflexes survive Explains the 8x decay gap between soft organizational rules and hard safety norms, and why built-in priors mask the problem and create a false sense of safety. * 10:15 — Worse than if you'd said nothing Reveals that for the worst model, stating a rule and compacting it away normalizes the forbidden action and pushes violation above the no-rule baseline. * 11:28 — Proving it's the plumbing, not the model Walks through the counterfactual-summary experiment that kills the length objection and the crossing experiment showing violations track the summarizer, not the agent. * 14:17 — Compress harder, lose more rules Presents the dose-response curve and notes that production guidance to compact aggressively pushes deployments toward the high-decay end. * 15:52 — The attack that deletes instead of adds Introduces the subtractive Compaction-Eviction Attack, its volume and summarizer-injection variants, the no-model-is-safe-on-both-axes result, and the search that breaks resistance from zero to 65%. * 20:19 — A 50-token card that fixes it Describes Constraint Pinning — re-stapling a protected rule card after every compaction — which costs under half a percent overhead, restores violations to zero, and improves task completion. * 22:11 — The forged operator update it can't stop Lays out pinning's honest limit — in-context impersonation defeats it because operator authority asserted in the token stream can't be verified — and reframes context management as a first-class governance surface. RECOMMENDED READING * Lost in the Middle: How Language Models Use Long Contexts [https://arxiv.org/abs/2307.03172] — The 'lost in the middle' result the episode names directly as the skeptic's objection — and that the paper's counterfactual-summary experiment works to rule out as the cause. * Prompt Injection attack against LLM-integrated Applications [https://arxiv.org/abs/2306.05499] — Grounds the 'additive' prompt-injection threat model that the episode contrasts against its 'subtractive' Compaction-Eviction Attack. * Universal and Transferable Adversarial Attacks on Aligned Language Models [https://arxiv.org/abs/2307.15043] — Backs the episode's 'locksmith point' that robustness to a fixed probe is not robustness to search — the same gap that turned a 0% injection into 65% via gradient-level optimization.

Ayer27 min
episode The Empty-Lake Proof: Why More Rollouts Stop Helping Reasoning Models artwork

The Empty-Lake Proof: Why More Rollouts Stop Helping Reasoning Models

THE EMPTY-LAKE PROOF: WHY MORE ROLLOUTS STOP HELPING REASONING MODELS Source: Maximizing Rollout Informativeness under a Fixed Budget: A Submodular View of Tree Search for Tool-Use Agentic Reinforcement Learning [https://arxiv.org/abs/2605.05262] Paper was published on May 06, 2026 This episode was AI-generated on June 23, 2026. The script was written by an AI language model and the host voices were synthesized by Eleven Labs. The producer is not affiliated with Anthropic or Eleven Labs. On the hardest problems, throwing more independent attempts at a reasoning model is almost useless past a point — and a May 2026 paper proves it in two lines of arithmetic. Then it borrows a fifty-year-old combinatorics theorem to fix the problem, and watches the field's favorite folk hack — the entropy bonus — fall straight out of the math. You'll come away understanding why budget is a weak lever, why hardness is a strong one, and where the paper's 'provable' spine quietly bends. KEY TAKEAWAYS * Why GRPO's relative-scoring signal goes to exactly zero when a group of rollouts all agree — and why easy and hard problems both collapse that way * The napkin-sized proof that useful mixed groups grow only linearly with budget while difficulty pushes against you exponentially — independent sampling flatlines near 45% * How a 1978 submodularity theorem hands the authors a near-optimal greedy selector for free, instead of a hand-tuned heuristic * Why the long-used entropy bonus turns out to be a forced consequence of the math, not a tuning knob — under a stated linearization * The eyebrow-raiser where a hand-derived formula beats a neural network trained specifically to beat it * Where the 'provable' claim is actually proven about a proxy score, and why the guarantee weakens precisely on deep, long-horizon problems * 00:00 — Casting into a nearly empty lake Sets up the central metaphor and the paper's core claim that the waste in training reasoning models is structural, not just inefficient. * 01:47 — When the learning signal goes to zero Explains how GRPO scores rollouts relative to their group, and why uniform groups — all right or all wrong — produce exactly zero learning. * 03:50 — The proof you can check on a napkin Walks through the simple binomial argument showing budget grows usefulness only linearly while difficulty fights back exponentially, with brutal real numbers. * 06:06 — What if the attempts shared a whiteboard? Introduces the pivot from independent sampling to growing a tree of attempts, and the hard question of which node to expand next. * 07:43 — A fifty-year-old theorem does the work Defines submodularity and how Nemhauser's 1978 result guarantees a greedy selector is near-optimal, built from coverage, novelty, and contrast. * 10:32 — The entropy bonus falls out of the math Derives the UUCB selection rule as three questions, showing the classic UCB exploration term and the entropy bonus appear as forced consequences, not hacks. * 14:23 — Finding the fork in one math problem Works through a single competition problem where the selector lights up the exact strategy-switch node and lands a sharp learning signal flat GRPO would smear away. * 17:20 — Four times the nudge, same budget Reports the benchmark results: InfoTree tracking the theoretical ceiling, roughly 4x gradient signal, an 11-point GAIA win, and the hand-derived formula beating a learned selector. * 21:48 — A map of a slightly different city Delivers the steelman critique — the guarantee is proven about a proxy, the strongest wins are over the weakest baseline, and it strains on the deep, long-horizon trees that matter most. * 25:10 — Why the reframe outlasts the method Lands the takeaway that the real contribution is turning a practitioner grumble into an impossibility result, and poses the closing question to the audience. RECOMMENDED READING * DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models [https://arxiv.org/abs/2402.03300] — The paper that introduced GRPO, the group-relative training method whose collapse-on-hard-problems failure mode this episode is built around. * An Analysis of Approximations for Maximizing Submodular Set Functions—I [https://doi.org/10.1007/BF01588971] — The 1978 Nemhauser–Wolsey–Fisher result the episode credits for the greedy near-optimality (the '63 percent') guarantee at the core of the selection rule. * GAIA: a benchmark for General AI Assistants [https://arxiv.org/abs/2311.12983] — The web-search agent benchmark where the episode reports the method's biggest single win, useful for judging the eleven-point claim. * Finite-time Analysis of the Multiarmed Bandit Problem [https://doi.org/10.1023/A:1013689704352] — The classic UCB exploration-bonus result that the episode shows reappearing, derived rather than bolted on, inside the tree-expansion selection rule.

Ayer27 min