AI Papers: A Deep Dive

Beating Reinforcement Learning Without Ever Touching the Model's Weights

22 min · 9. Juni 2026
Episode Beating Reinforcement Learning Without Ever Touching the Model's Weights Cover

Beschreibung

BEATING REINFORCEMENT LEARNING WITHOUT EVER TOUCHING THE MODEL'S WEIGHTS Source: Agentic Monte Carlo: Simulating Reinforcement Learning for Black-Box Agents [https://arxiv.org/abs/2606.05296] Paper was published on June 03, 2026 This episode was AI-generated on June 5, 2026. The script was written by an AI language model and the host voices were synthesized by Eleven Labs. The producer is not affiliated with Anthropic or Eleven Labs. Two desktop GPUs matched — and on one task beat — a reinforcement learning method that needed eight A100s, all without computing a single gradient or fine-tuning anything. The trick is an old theoretical equivalence between RL and Bayesian inference that only became useful once frontier models got sealed behind APIs. We unpack how a cheap critic steering a frozen giant turns training into selection, and where the whole approach quietly falls apart. KEY TAKEAWAYS * Why a known equivalence between reinforcement learning and Bayesian inference, long stuck in theory papers, suddenly becomes useful only when you're locked out of a model's weights * How the method replaces gradient-based training with Sequential Monte Carlo — running a population of agent trajectories, scoring them with a small critic, and 'breeding' the winners * Why learning the value function is just a cheap offline regression problem (2-3 hours, ~$140) rather than expensive online RL, because the big model stays frozen * The headline result: on SciWorld, the no-gradient method crossed over GRPO, which had full access to the model's internals — and an 11B critic measurably improved GPT-5.1 without touching it * The TextCraft failure case: when a model produces uniformly good, near-identical trajectories, there's nothing to select among and the method actually makes things worse * The fine print behind 'beats GRPO': it depends on high trajectory counts, hand-tuned resampling schedules, parallel API calls that aren't free, and a value estimate that's only ever approximate * 00:00 — The welded-shut hood problem Why the most capable agents run on API-only models you can't compute gradients for, and why prompt-fiddling and tuning a smaller open model are both unsatisfying escape hatches. * 20:37 — RL is secretly Bayesian inference Building up the theoretical equivalence: the optimal leashed RL policy is exactly a Bayesian posterior, which means the question shifts from 'how do I train?' to 'how do I sample?' * 05:37 — Particle filters and a beam of agents How Sequential Monte Carlo runs a population of trajectories forward, culls the weak ones, and clones the strong ones to converge on the target distribution. * 08:26 — The cheap coach steering the frozen athlete Separating the frozen black-box agent from the small critic that learns a value function via offline regression — and the cost numbers that make it run on two desktop GPUs. * 11:14 — The results that travel Beating GRPO on SciWorld without gradient access, a cheap model with the method outperforming an expensive one on Best-of-15, and an 11B critic nudging up GPT-5.1. * 14:03 — One decisive fork in WebShop A concrete trace of the critic doing its job — pruning a stalled trajectory and cloning a promising one at a single resampling checkpoint. * 16:52 — Where it gets softer than the headline The honest caveats: GRPO comparisons that depend on trajectory count, the TextCraft failure when trajectories are too uniform, hand-tuned schedules, non-free API calls, and an approximate critic. * 19:41 — Why the constraint created the use case Why an always-true equivalence only became operationally decisive in the world of sealed models, and what that means for democratizing frontier-agent optimization. RECOMMENDED READING * Levine: Reinforcement Learning and Control as Probabilistic Inference [https://arxiv.org/abs/1805.00909] — The tutorial that lays out the exact RL-as-Bayesian-inference equivalence this episode's method exploits, explaining why the optimal leashed policy is a posterior. * Sequential Monte Carlo Steering of Large Language Models using Probabilistic Programs [https://arxiv.org/abs/2306.03081] — Directly applies the particle-filter/SMC machinery the episode describes to steering frozen language models, the same 'sample the posterior instead of training' idea. * DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models [https://arxiv.org/abs/2402.03300] — Introduces GRPO, the gradient-based RL 'oracle' baseline the episode repeatedly measures the no-gradients method against. * Training Verifiers to Solve Math Word Problems [https://arxiv.org/abs/2110.14168] — The work that popularized training a separate verifier/value model to score and select among a base model's trajectories, the conceptual ancestor of this episode's frozen-chef-plus-coach setup.

Kommentare

0

Sei die erste Person, die kommentiert

Melde dich jetzt an und werde Teil der AI Papers: A Deep Dive-Community!

Loslegen

2 Monate für 1 €

Dann 4,99 € / Monat · Jederzeit kündbar.

  • Podcasts nur bei Podimo
  • 20 Stunden Hörbücher / Monat
  • Alle kostenlosen Podcasts

Alle Folgen

119 Folgen

Episode Why the Best-Aligned AI Models Are the Easiest to Trick Into Producing Harm Cover

Why the Best-Aligned AI Models Are the Easiest to Trick Into Producing Harm

WHY THE BEST-ALIGNED AI MODELS ARE THE EASIEST TO TRICK INTO PRODUCING HARM Source: Safety Paradox: How Enhanced Safety Awareness Leaves LLMs Vulnerable to Posterior Attack [https://arxiv.org/abs/2606.05614] Paper was published on June 04, 2026 This episode was AI-generated on June 5, 2026. The script was written by an AI language model and the host voices were synthesized by Eleven Labs. The producer is not affiliated with Anthropic or Eleven Labs. A new paper argues that the sharper a language model's judgment about what counts as harmful, the more reliably a single cheap prompt can make it produce that exact harm — and the correlation across thirty models is nearly a straight line. Worse, sorting those models by release date shows the field walking its best work toward maximal exploitability, one alignment improvement at a time. But the episode also finds a hopeful crack in the thesis: for some models, making them think before answering drives the attack to zero. KEY TAKEAWAYS * How the 'Posterior Attack' works: instead of fighting a model's reluctance, it asks the model to act as a safety classifier and produce an example of content it would flag — laundering the harm through a framing the model treats as safety work * Why it's a real threat: one black-box query, no gradient access, about three cents — versus the dollars-and-hours cost of heavyweight gradient or iterative-rewrite attacks * The eerie correlation across thirty models (Pearson ~0.80): the better a model is at judging harm, the more exploitable it is — and that diagonal is also a timeline pointing at the frontier * The one-sentence theory: attack success equals baseline odds of harm multiplied by the safety classifier's sharpness — so improving safety judgment directly inflates the attack, with a perfect classifier converging on guaranteed exploitation * The causal proof and its limits: using reinforcement learning as a scalpel to move only the safety-judgment 'fader' on small models flips vulnerability up and down — but that experiment can't run on GPT-5 or Claude, so the frontier claim stays correlational * The honest complication: test-time reasoning drives the attack to zero on some models (GPT-OSS) by reasoning back to the rule, does nothing for Claude Sonnet 4.6, and makes one Qwen model worse — suggesting the paradox may belong to reflexive guards, not to safety knowledge itself * 00:00 — The paradox and the Posterior Attack Introduces the core claim that sharper harm-recognition makes models more exploitable, and walks through how the single-query 'museum guard' attack tricks a model into generating forbidden content as a classifier example. * 02:54 — Why this attack is different and cheap Contrasts the Posterior Attack's one-shot, black-box, three-cents-per-query cost against the dollars and GPU-hours of gradient-optimization and iterative-rewrite jailbreaks. * 05:48 — The thirty-model correlation and its timeline Lays out the near-straight diagonal between safety-classifier accuracy and attack success, the frontier numbers up near 99%, and the unsettling fact that the same upgrades that defeat older attacks open this one wider. * 08:43 — The one-line math behind it Explains, without notation, how attack odds equal baseline harm odds times classifier sharpness, why the relationship is monotonic, and why a perfect classifier converges on guaranteed exploitation. * 11:37 — Using reinforcement learning as a scalpel Describes the controlled experiment that moves only a model's safety-judgment 'fader' up or down while holding capability fixed, showing vulnerability rises and falls in lockstep. * 14:32 — Where the evidence runs out Registers the honest caveats: the causal proof only runs on small models, attack success is graded by LLM judges, and the scope is English-only without production-level guardrails. * 17:26 — The defense that complicates the thesis Examines test-time reasoning, which drives the attack to zero on some models by reasoning back to the rule but fails on Claude Sonnet 4.6 and backfires on a Qwen model, suggesting the paradox may be a property of reflexive guards. * 20:21 — What it all means Frames the takeaway as an intellectual shift — awareness and vulnerability as the same quantity — and the open question of whether the fix is a smarter shield or a slower, more deliberate one. RECOMMENDED READING * Universal and Transferable Adversarial Attacks on Aligned Language Models [https://arxiv.org/abs/2307.15043] — The gradient-optimization jailbreak (GCG) the episode contrasts against — the 'dollars and hours' baseline that the cheap single-query Posterior Attack is positioned against. * Deliberative Alignment: Reasoning Enables Safer Language Models [https://arxiv.org/abs/2412.16339] — The 'reason back to the rule' approach the episode credits for driving the attack to zero on GPT-OSS models, central to the defense complication the hosts dwell on. * Jailbroken: How Does LLM Safety Training Fail? [https://arxiv.org/abs/2307.02483] — A foundational analysis of why safety training fails that frames jailbreaks as attacks on the model's reluctance — the exact framing the Posterior Attack departs from. * Llama 2: Open Foundation and Fine-Tuned Chat Models [https://arxiv.org/abs/2307.09288] — Documents the safety alignment recipe for several of the open models plotted along the episode's vulnerability-vs-awareness diagonal, including the low-exploitability Llama 2.

9. Juni 202623 min
Episode How an AI Agent Rewrites Its Own Tools, Without an Answer Key Cover

How an AI Agent Rewrites Its Own Tools, Without an Answer Key

HOW AN AI AGENT REWRITES ITS OWN TOOLS, WITHOUT AN ANSWER KEY Source: Retrospective Harness Optimization: Improving LLM Agents via Self-Preference over Trajectory Rollouts [https://arxiv.org/abs/2606.05922] Paper was published on June 04, 2026 This episode was AI-generated on June 5, 2026. The script was written by an AI language model and the host voices were synthesized by Eleven Labs. The producer is not affiliated with Anthropic or Eleven Labs. An AI coding agent jumped from solving 60% of hard software bugs to nearly 80% in a single round of self-improvement — and nobody graded its work along the way. This episode unpacks how it pulls a usable training signal out of unlabeled past failures, why comparative self-judgment can stand in for an answer key, and where that imperfect self-judge starts to show its limits. KEY TAKEAWAYS * Why 'harness' optimization — rewriting the cheap scaffolding around a frozen model rather than retraining the model — is where the gains in this paper actually come from * How the method swaps the unanswerable 'is this correct?' for the answerable 'is this better than before?', using the agent's own comparative judgment instead of labels * The 'wait, really?' ablation: picking the hardest or most diverse tasks each does worse than random selection — only balancing both jumps performance to 78% * Why removing the self-consistency signal makes results worse than doing nothing at all, showing both diagnostic signals are load-bearing * The Table 3 caveat: the agent's self-judge isn't good at picking the best candidate, only at avoiding the worst — it floors the downside rather than maximizing the upside * Why the 19-point jump is the best case, not the typical result — other benchmarks gain only 5-8 points, and the whole loop assumes cleanly re-runnable tasks * 00:00 — What a harness actually is Defines the key distinction between the frozen, expensive model and the cheap, rewritable scaffolding around it, using the fixed-chef-in-a-renovatable-kitchen analogy. * 03:20 — The label problem and the self-preference bet Lays out why deployed agents have endless trajectories but no answer key, and how the paper proposes using comparative self-judgment as a substitute for ground truth. * 06:41 — The three-stage pipeline Walks through how the method selects hard-and-diverse past tasks, diagnoses failures via self-validation and self-consistency, and holds a pairwise contest among candidate harnesses. * 13:44 — What the agent actually learned A concrete example of the agent writing itself executable tools to fix recurring failures, and why that beats prior memory-only self-improvement methods. * 13:22 — The numbers and the headline's range Reports the 60-to-80 jump on SWE-Bench Pro alongside the more modest 5- and 8-point gains elsewhere, framing the headline as the top of a range. * 16:42 — The ablations that defend the design Examines the coreset result where random beats single-axis selection, and the diagnosis result where dropping a signal falls below baseline, arguing the architecture is principled. * 20:03 — The label-free vs. label-hungry showdown Compares RHO against Meta-Harness, showing the label-free method matches the answer-key method at a fraction of the compute. * 23:23 — Limitations and the imperfect self-judge Sits with the Table 3 caveat, the reward-gaming risk, the clean-reset assumption, and other honest asterisks on the results. * 26:44 — What survives, and what to watch Closes on the durable reframing of how agents can self-improve, and the open tension that the loop is only as trustworthy as the judge at its center. RECOMMENDED READING * Self-Consistency Improves Chain of Thought Reasoning in Language Models [https://arxiv.org/abs/2203.11171] — The self-consistency idea this episode leans on for its diagnosis step — using disagreement across multiple sampled runs as a signal — originates here. * Constitutional AI: Harmlessness from AI Feedback [https://arxiv.org/abs/2212.08073] — A foundational case for replacing human labels with model-generated preference signals, which is exactly the substitution-and-its-risks tension the episode dwells on. * SWE-bench: Can Language Models Resolve Real-World GitHub Issues? [https://arxiv.org/abs/2310.06770] — Introduces the real-repository bug-fixing benchmark family whose 'Pro' variant produced the headline sixty-to-eighty-percent jump discussed throughout the episode. * GAIA: a benchmark for General AI Assistants [https://arxiv.org/abs/2311.12983] — The general-assistant benchmark behind the GAIA-2 results, useful for understanding the messier knowledge-work setting where RHO's gains were more modest.

9. Juni 202630 min
Episode How an Open AI System Verified 672 Hard Math Proofs for Under $300 Cover

How an Open AI System Verified 672 Hard Math Proofs for Under $300

HOW AN OPEN AI SYSTEM VERIFIED 672 HARD MATH PROOFS FOR UNDER $300 Source: Goedel-Architect: Streamlining Formal Theorem Proving with Blueprint Generation and Refinement [https://arxiv.org/abs/2606.06468] Paper was published on June 04, 2026 This episode was AI-generated on June 5, 2026. The script was written by an AI language model and the host voices were synthesized by Eleven Labs. The producer is not affiliated with Anthropic or Eleven Labs. An open-weight AI verified machine-checked proofs across an entire benchmark of Putnam-level math problems for $294 — while a comparable system reportedly spent around $170,000 for a single run. The 500-fold gap doesn't come from a fancier model; it comes from a single architectural bet about how the system is organized. We dig into the mechanism that makes failure useful, and where the headline number quietly leans on a closed model whispering hints. KEY TAKEAWAYS * Why formal proving with Lean makes checking free and automatic — and why a single false sub-claim can make a system burn compute forever on something that was never true * The blueprint approach: sketch the whole proof as a graph of inter-dependent lemmas, attack pieces in parallel, and repair the plan while keeping everything that already worked — versus the recursive-decomposition method that dives down one branch and gets stuck * The two-channel failure mechanism: a 'negation channel' that produces compiler-verified counterexamples (it caught its own false sub-claims on 292 of 672 problems) and a 'forfeit channel' that writes a structured post-mortem telling the next round how to split the problem * Why the refinement loop scales log-linearly — steady, predictable gains that require doubling compute, making cost a tunable dial rather than a coin flip * Where the headline numbers get fuzzy: the controlled same-model comparison is partly missing on the hardest set, and the record-breaking scores rely on natural-language hints sometimes seeded by a stronger closed model * How contamination testing on post-cutoff olympiad problems argues the system is reasoning, not regurgitating * 00:00 — The $294 headline and the ground rules The team verified an entire 672-problem benchmark for under $300 versus a reported $170,000 — and the hosts establish this is an AI-generated show discussing the Goedel-Architect paper. * 03:13 — What formal proof actually means Why Lean turns a proof into a program the compiler judges automatically, and why a single false sub-claim makes the whole proof uncompilable — the trap that wastes compute on roads that don't exist. * 06:26 — Recursion versus the blueprint The contrast between depth-first recursive decomposition that gets stuck on one branch and the blueprint approach that lays out the whole strategy as a graph, attacks pieces in parallel, and preserves proven work during repair. * 09:39 — Failure that hands you a plan: the two channels The negation channel that proves a sub-claim false with a verified counterexample (the constant-zero function case study) and the forfeit channel that writes a resignation memo on how to split the problem (the Putnam 1985 case study). * 12:52 — The scaling curve and the cost breakdown How 16 refinement passes climb from 30% to 76% along a log-linear curve, and how that turns into roughly 44 cents per problem against the competitor's $244. * 16:05 — Asterisks on the comparison The controlled same-model experiment that supports the architectural claim, the missing head-to-head row on the hardest problems, and the closed-model natural-language hints behind the record-breaking scores. * 19:18 — Contamination, candor, and the olympiad gap Testing on post-training-cutoff problems to rule out memorization, the honest four-of-six olympiad result against a geometry-specialized competitor, and the paper's careful separation of autonomous from seeded numbers. * 22:31 — Two kinds of significance The immediate accessibility win that puts top-tier formal proving within reach of independent researchers, and the more speculative transferable idea of systems that fail informatively across domains beyond math. RECOMMENDED READING * Draft, Sketch, and Prove: Guiding Formal Theorem Provers with Informal Proofs [https://arxiv.org/abs/2210.12283] — The Draft-Sketch-Prove lineage the episode explicitly credits as the ancestor of this paper's refinable natural-language seeding. * DeepSeek-Prover-V1.5: Harnessing Proof Assistant Feedback for Reinforcement Learning and Monte-Carlo Tree Search [https://arxiv.org/abs/2408.08152] — A leading open-weight Lean prover that uses compiler feedback and tree search, offering a concrete point of contrast to the episode's blueprint-versus-recursion framing. * Solving olympiad geometry without human demonstrations (AlphaGeometry) [https://doi.org/10.1038/s41586-023-06747-5] — The kind of specialized geometry engine the episode cites as the reason the system loses the one IMO problem it misses, showing why dedicated tools still beat general provers there.

9. Juni 202625 min
Episode When the Agent Says It's Done But Nothing Happened: Debugging the Harness, Not the Model Cover

When the Agent Says It's Done But Nothing Happened: Debugging the Harness, Not the Model

WHEN THE AGENT SAYS IT'S DONE BUT NOTHING HAPPENED: DEBUGGING THE HARNESS, NOT THE MODEL Source: From Failed Trajectories to Reliable LLM Agents: Diagnosing and Repairing Harness Flaws [https://arxiv.org/abs/2606.06324] Paper was published on June 04, 2026 This episode was AI-generated on June 5, 2026. The script was written by an AI language model and the host voices were synthesized by Eleven Labs. The producer is not affiliated with Anthropic or Eleven Labs. An AI agent confidently reports a task complete while the database shows nothing actually happened — and no prompt edit on earth can fix it. A new paper argues that for a huge share of agent failures, the model is already good enough, and the real bug lives in the deterministic scaffolding around it. The payoff: that scaffolding is just software, which means you can actually diagnose and repair it. KEY TAKEAWAYS * Why many agent failures are 'silent successes' — the harness marks a task complete even though nothing changed in the world — and why benchmark scores actively hide them * How HarnessFix borrows a compiler trick (a normalized intermediate representation) to turn messy, framework-specific traces into something you can analyze uniformly * The four-stage pipeline — abstraction, diagnosis, repair, validation — and why repairs draw from a fixed, vetted catalog distilled from real repo fixes rather than letting the agent rewrite itself freely * Why a prompt-only version of the system gets zero improvement on Terminal-Bench while the full system fixes lifecycle, observability, and verification flaws prompts can't reach * The honest limitations: the system largely grades its own diagnoses, raw gains are small (six tasks to nine), results are single-run on one model, and the 'beats human harnesses' comparison isn't a clean head-to-head * 08:04 — The bill-splitting disaster An agent sends Venmo requests, reports success, and yet zero payments exist — the opening example that lands the paper's whole thesis. * 03:22 — Reframing the agent as model plus harness Why the deterministic software wrapping the model — the harness — is the real culprit, and why that matters: it can be debugged. * 06:44 — Taming the trace How agent traces are a rambly mess with no common format, and how a compiler-style intermediate representation normalizes them and tags each step's role, success, and world-effect. * 10:07 — The four-stage repair pipeline Walking through abstraction, diagnosis, recurring flaw records, code patches, and validation as a named assembly line. * 13:29 — Repair by catalog, not by improvisation The opinionated design choice to fix flaws from a fixed menu of vetted operators — the surgeon analogy — and the regression-aware acceptance bar that gates every patch. * 16:52 — Watching the pipeline fix the bill-splitter How the system diagnoses three stacked harness flaws, consolidates them, and produces a completion guard no prompt edit could have delivered. * 20:14 — The numbers and the prompt-only ablation Held-out improvements of fifteen to fifty percent across four benchmarks, beating hand-built human harnesses, with concrete patches like blocking session-killing commands. * 23:37 — Taking the knife to it The critiques: the system largely grades its own diagnoses, the catalog can't reach novel flaws, gains are small and single-run on one model, and what the regression-guard ablation reveals about the design. RECOMMENDED READING * ReAct: Synergizing Reasoning and Acting in Language Models [https://arxiv.org/abs/2210.03629] — Defines the reasoning-and-acting loop that constitutes the core of the 'harness' this episode dissects, giving listeners the baseline agent architecture HarnessFix repairs. * Reflexion: Language Agents with Verbal Reinforcement Learning [https://arxiv.org/abs/2303.11366] — A leading example of the trace-driven self-improvement methods the episode contrasts against, since Reflexion edits the model's prompt/memory rather than the runtime scaffolding. * Voyager: An Open-Ended Embodied Agent with Large Language Models [https://arxiv.org/abs/2305.16291] — Represents the self-modifying-agent lineage the paper deliberately rejects, letting listeners weigh free self-rewriting against HarnessFix's constrained repair-operator catalog. * SWE-bench: Can Language Models Resolve Real-World GitHub Issues? [https://arxiv.org/abs/2310.06770] — Provides the repository bug-fixing benchmark style underlying one of the four evaluation domains, illustrating the kind of pass/fail scoring the episode critiques for hiding silent-success failures.

9. Juni 202626 min
Episode Beating Reinforcement Learning Without Ever Touching the Model's Weights Cover

Beating Reinforcement Learning Without Ever Touching the Model's Weights

BEATING REINFORCEMENT LEARNING WITHOUT EVER TOUCHING THE MODEL'S WEIGHTS Source: Agentic Monte Carlo: Simulating Reinforcement Learning for Black-Box Agents [https://arxiv.org/abs/2606.05296] Paper was published on June 03, 2026 This episode was AI-generated on June 5, 2026. The script was written by an AI language model and the host voices were synthesized by Eleven Labs. The producer is not affiliated with Anthropic or Eleven Labs. Two desktop GPUs matched — and on one task beat — a reinforcement learning method that needed eight A100s, all without computing a single gradient or fine-tuning anything. The trick is an old theoretical equivalence between RL and Bayesian inference that only became useful once frontier models got sealed behind APIs. We unpack how a cheap critic steering a frozen giant turns training into selection, and where the whole approach quietly falls apart. KEY TAKEAWAYS * Why a known equivalence between reinforcement learning and Bayesian inference, long stuck in theory papers, suddenly becomes useful only when you're locked out of a model's weights * How the method replaces gradient-based training with Sequential Monte Carlo — running a population of agent trajectories, scoring them with a small critic, and 'breeding' the winners * Why learning the value function is just a cheap offline regression problem (2-3 hours, ~$140) rather than expensive online RL, because the big model stays frozen * The headline result: on SciWorld, the no-gradient method crossed over GRPO, which had full access to the model's internals — and an 11B critic measurably improved GPT-5.1 without touching it * The TextCraft failure case: when a model produces uniformly good, near-identical trajectories, there's nothing to select among and the method actually makes things worse * The fine print behind 'beats GRPO': it depends on high trajectory counts, hand-tuned resampling schedules, parallel API calls that aren't free, and a value estimate that's only ever approximate * 00:00 — The welded-shut hood problem Why the most capable agents run on API-only models you can't compute gradients for, and why prompt-fiddling and tuning a smaller open model are both unsatisfying escape hatches. * 20:37 — RL is secretly Bayesian inference Building up the theoretical equivalence: the optimal leashed RL policy is exactly a Bayesian posterior, which means the question shifts from 'how do I train?' to 'how do I sample?' * 05:37 — Particle filters and a beam of agents How Sequential Monte Carlo runs a population of trajectories forward, culls the weak ones, and clones the strong ones to converge on the target distribution. * 08:26 — The cheap coach steering the frozen athlete Separating the frozen black-box agent from the small critic that learns a value function via offline regression — and the cost numbers that make it run on two desktop GPUs. * 11:14 — The results that travel Beating GRPO on SciWorld without gradient access, a cheap model with the method outperforming an expensive one on Best-of-15, and an 11B critic nudging up GPT-5.1. * 14:03 — One decisive fork in WebShop A concrete trace of the critic doing its job — pruning a stalled trajectory and cloning a promising one at a single resampling checkpoint. * 16:52 — Where it gets softer than the headline The honest caveats: GRPO comparisons that depend on trajectory count, the TextCraft failure when trajectories are too uniform, hand-tuned schedules, non-free API calls, and an approximate critic. * 19:41 — Why the constraint created the use case Why an always-true equivalence only became operationally decisive in the world of sealed models, and what that means for democratizing frontier-agent optimization. RECOMMENDED READING * Levine: Reinforcement Learning and Control as Probabilistic Inference [https://arxiv.org/abs/1805.00909] — The tutorial that lays out the exact RL-as-Bayesian-inference equivalence this episode's method exploits, explaining why the optimal leashed policy is a posterior. * Sequential Monte Carlo Steering of Large Language Models using Probabilistic Programs [https://arxiv.org/abs/2306.03081] — Directly applies the particle-filter/SMC machinery the episode describes to steering frozen language models, the same 'sample the posterior instead of training' idea. * DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models [https://arxiv.org/abs/2402.03300] — Introduces GRPO, the gradient-based RL 'oracle' baseline the episode repeatedly measures the no-gradients method against. * Training Verifiers to Solve Math Word Problems [https://arxiv.org/abs/2110.14168] — The work that popularized training a separate verifier/value model to score and select among a base model's trajectories, the conceptual ancestor of this episode's frozen-chef-plus-coach setup.

9. Juni 202622 min