AI Papers: A Deep Dive
BEATING REINFORCEMENT LEARNING WITHOUT EVER TOUCHING THE MODEL'S WEIGHTS Source: Agentic Monte Carlo: Simulating Reinforcement Learning for Black-Box Agents [https://arxiv.org/abs/2606.05296] Paper was published on June 03, 2026 This episode was AI-generated on June 5, 2026. The script was written by an AI language model and the host voices were synthesized by Eleven Labs. The producer is not affiliated with Anthropic or Eleven Labs. Two desktop GPUs matched — and on one task beat — a reinforcement learning method that needed eight A100s, all without computing a single gradient or fine-tuning anything. The trick is an old theoretical equivalence between RL and Bayesian inference that only became useful once frontier models got sealed behind APIs. We unpack how a cheap critic steering a frozen giant turns training into selection, and where the whole approach quietly falls apart. KEY TAKEAWAYS * Why a known equivalence between reinforcement learning and Bayesian inference, long stuck in theory papers, suddenly becomes useful only when you're locked out of a model's weights * How the method replaces gradient-based training with Sequential Monte Carlo — running a population of agent trajectories, scoring them with a small critic, and 'breeding' the winners * Why learning the value function is just a cheap offline regression problem (2-3 hours, ~$140) rather than expensive online RL, because the big model stays frozen * The headline result: on SciWorld, the no-gradient method crossed over GRPO, which had full access to the model's internals — and an 11B critic measurably improved GPT-5.1 without touching it * The TextCraft failure case: when a model produces uniformly good, near-identical trajectories, there's nothing to select among and the method actually makes things worse * The fine print behind 'beats GRPO': it depends on high trajectory counts, hand-tuned resampling schedules, parallel API calls that aren't free, and a value estimate that's only ever approximate * 00:00 — The welded-shut hood problem Why the most capable agents run on API-only models you can't compute gradients for, and why prompt-fiddling and tuning a smaller open model are both unsatisfying escape hatches. * 20:37 — RL is secretly Bayesian inference Building up the theoretical equivalence: the optimal leashed RL policy is exactly a Bayesian posterior, which means the question shifts from 'how do I train?' to 'how do I sample?' * 05:37 — Particle filters and a beam of agents How Sequential Monte Carlo runs a population of trajectories forward, culls the weak ones, and clones the strong ones to converge on the target distribution. * 08:26 — The cheap coach steering the frozen athlete Separating the frozen black-box agent from the small critic that learns a value function via offline regression — and the cost numbers that make it run on two desktop GPUs. * 11:14 — The results that travel Beating GRPO on SciWorld without gradient access, a cheap model with the method outperforming an expensive one on Best-of-15, and an 11B critic nudging up GPT-5.1. * 14:03 — One decisive fork in WebShop A concrete trace of the critic doing its job — pruning a stalled trajectory and cloning a promising one at a single resampling checkpoint. * 16:52 — Where it gets softer than the headline The honest caveats: GRPO comparisons that depend on trajectory count, the TextCraft failure when trajectories are too uniform, hand-tuned schedules, non-free API calls, and an approximate critic. * 19:41 — Why the constraint created the use case Why an always-true equivalence only became operationally decisive in the world of sealed models, and what that means for democratizing frontier-agent optimization. RECOMMENDED READING * Levine: Reinforcement Learning and Control as Probabilistic Inference [https://arxiv.org/abs/1805.00909] — The tutorial that lays out the exact RL-as-Bayesian-inference equivalence this episode's method exploits, explaining why the optimal leashed policy is a posterior. * Sequential Monte Carlo Steering of Large Language Models using Probabilistic Programs [https://arxiv.org/abs/2306.03081] — Directly applies the particle-filter/SMC machinery the episode describes to steering frozen language models, the same 'sample the posterior instead of training' idea. * DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models [https://arxiv.org/abs/2402.03300] — Introduces GRPO, the gradient-based RL 'oracle' baseline the episode repeatedly measures the no-gradients method against. * Training Verifiers to Solve Math Word Problems [https://arxiv.org/abs/2110.14168] — The work that popularized training a separate verifier/value model to score and select among a base model's trajectories, the conceptual ancestor of this episode's frozen-chef-plus-coach setup.
119 episodios
Comentarios
0Sé la primera persona en comentar
¡Regístrate ahora y únete a la comunidad de AI Papers: A Deep Dive!