AI Papers: A Deep Dive

An AI Just Solved a 1996 Erdős Problem—and the Simplest Agent Won

31 min · 23 de may de 2026
portada del episodio An AI Just Solved a 1996 Erdős Problem—and the Simplest Agent Won

Descripción

AN AI JUST SOLVED A 1996 ERDŐS PROBLEM—AND THE SIMPLEST AGENT WON Source: Advancing Mathematics Research with AI-Driven Formal Proof Search [https://arxiv.org/abs/2605.22763] Paper was published on May 21, 2026 This episode was AI-generated on May 22, 2026. The script was written by an AI language model and the host voices were synthesized by Eleven Labs. The producer is not affiliated with Anthropic or Eleven Labs. A Google DeepMind system autonomously cracked nine open Erdős problems—including one that sat unsolved for thirty years—for a few hundred dollars each, with proofs verified by the Lean compiler. The twist: the team's elaborate evolutionary search system was beaten on most problems by a twenty-line script that just iterates an LLM against a compiler. The implications for AI engineering go well beyond mathematics. KEY TAKEAWAYS * Why coupling an LLM to the Lean proof checker dissolves the trust problem in AI-generated mathematics—and where that guarantee actually ends * How a 'Ralph loop' of LLM plus compiler plus retry matched a sophisticated evolutionary system with AlphaProof, tournament Elo ranking, and shared caches * The actual proof idea behind Erdős problem 125, including how irrationality of log(4)/log(3) gets weaponized to crush sumset density to zero * How the agent surfaced a thirty-year-old ambiguity in Erdős's original problem statement just by being forced to commit to a formal reading * Where the verification guarantee leaks: LLM judges scoring proof sketches reward confident-sounding hallucinated citations, biasing the search upstream of the compiler * Why the selection bias in the problem set, the cost of failed runs, and the human work of formalization make the headline numbers less clean than they look * 29:03 — The trust problem in AI-generated math Why plausible-looking LLM proofs have been economically useless to working mathematicians, and how Lean's compiler is supposed to fix that. * 03:52 — The Ralph loop and the basic agent A walkthrough of Agent A—the embarrassingly simple LLM-plus-compiler-plus-retry setup that did most of the work. * 07:44 — Inside Erdős 125 The metronome intuition behind the density-zero proof and how the agent decomposes subgoals and delegates to AlphaProof. * 11:37 — The fancy system that mostly didn't win Evolutionary search with Elo-ranked proof sketches, a shared cache, and AlphaProof calls—and why it only paid off on the hardest problems. * 15:29 — The ambiguity-surfacing side effect How formalizing Erdős 125 and 741 forced long-standing imprecisions in the informal statements into the open. * 19:21 — A geometric proof that feels like a magic trick Erdős 846 and the agent's translation of a collinearity problem into graph-theoretic Ramsey territory. * 23:14 — Steelmanning the skeptics Selection bias in the problem set, hidden costs of failed runs, the heavy lifting humans do in formalization, and the hallucinated-citation failure mode. * 27:06 — What actually changed How the bottleneck shifts from verifying proofs to verifying problem statements, and what the 'simple loops beat scaffolding' finding might mean beyond math. RECOMMENDED READING * AlphaEvolve: A coding agent for scientific and algorithmic discovery [https://arxiv.org/abs/2506.13131] — The evolutionary search ancestor of the Agent C/D system discussed in the episode, providing context for the 'fancy scaffolding' that the basic Ralph loop ended up matching. * Mathematical discoveries from program search with large language models (FunSearch) [https://doi.org/10.1038/s41586-023-06924-6] — The original DeepMind work establishing LLM-driven search for new mathematical results, which the episode positions as the lineage that Agent D descends from. * Solving olympiad geometry without human demonstrations (AlphaGeometry) [https://doi.org/10.1038/s41586-023-06747-5] — A useful contrast to the episode's framing of olympiad problems as 'the easier version' — shows what tightly-scaffolded, domain-specific provers achieved before frontier LLMs closed the gap. * The Lean Mathematical Library (Mathlib) [https://arxiv.org/abs/1910.09336] — The community formalization library whose maturity the episode credits as one of the four necessary ingredients for the paper's results.

Comentarios

0

Sé la primera persona en comentar

¡Regístrate ahora y forma parte de la comunidad de AI Papers: A Deep Dive!

Prueba gratis

Empieza 7 días de prueba

$99 / mes después de la prueba. · Cancela cuando quieras.

  • Podcasts solo en Podimo
  • 20 horas de audiolibros al mes
  • Podcast gratuitos

Todos los episodios

83 episodios

episode When Reasoning Models Decide Before They Think: Detecting and Fixing Premature Confidence artwork

When Reasoning Models Decide Before They Think: Detecting and Fixing Premature Confidence

WHEN REASONING MODELS DECIDE BEFORE THEY THINK: DETECTING AND FIXING PREMATURE CONFIDENCE Source: Understanding and Mitigating Premature Confidence for Better LLM Reasoning [https://arxiv.org/abs/2605.24396] Paper was published on May 23, 2026 This episode was AI-generated on May 26, 2026. The script was written by an AI language model and the host voices were synthesized by Eleven Labs. The producer is not affiliated with Anthropic or Eleven Labs. A new paper argues that much of the impressive-looking 'chain of thought' in reasoning models is decorative — the answer gets fixed at the first token and the rest is rationalization. The authors show how to detect this cheaply, turn the detection into a training signal that triples accuracy on hard problems, and — surprisingly — make models more honest about misleading inputs as a side effect. KEY TAKEAWAYS * A simple probing diagnostic: truncate a chain of thought at several points and check whether the model already commits to its final answer — flat-high confidence from the start reliably indicates 'premature' reasoning with ~2.8x more logical flaws * Why outcome-based RL converges on premature confidence as a local optimum, especially on hard problems where genuine reasoning rarely appears in the rollout distribution * How the confidence trajectory itself can replace expensive process reward models — yielding 19% → 61% accuracy on hard Countdown and matching vanilla GRPO with half the sampling budget * A striking scaling finding: larger pretrained Qwen3 models show monotonically more premature confidence, suggesting bigger models pattern-match harder rather than reason more * Faithfulness improves as a free side effect: rates of acknowledging misleading hints rise from ~15% to ~22% on AIME, with implications for chain-of-thought oversight * Honest limitations: the training reward uses the gold answer (partially an outcome signal in disguise), the weighting scheme assumes linear confidence growth, and absolute accuracies still leave large gaps * 00:00 — The diagnostic: probing confidence along the chain How truncating chains of thought at evenly-spaced checkpoints reveals two distinct shapes — progressive reasoning versus flat, premature commitment. * 03:04 — Evidence that premature chains are doing less work Across four benchmarks and two strong models, premature chains contain about 2.8x more logical flaws — even among chains that reach the correct answer. * 06:08 — Turning the diagnostic into a training signal How the authors collapse the confidence trajectory into a scalar penalty and bolt it onto GRPO without needing step-level human annotations. * 09:12 — Results: accuracy, reasoning quality, and sample efficiency Substantial gains on hard Countdown and AIME, a near-halving of flawed-chain rates, and effective doubling of sampling efficiency on math training. * 12:16 — The scaling finding and why bigger may mean worse Pretrained Qwen3 models at 1.7B, 4B, and 8B parameters show premature confidence rising monotonically with scale — a possible reframing of how scale interacts with reasoning. * 15:20 — Faithfulness as a side effect Why penalizing early commitment also makes models more likely to acknowledge misleading hints, connecting the result to chain-of-thought oversight debates. * 18:24 — Pushing back: where the paper might be overclaiming Entanglement with outcome reward, the single fixed weight vector, monitor dependence, and the gap between multiplicative gains and absolute performance. * 21:29 — The broader thread: models as their own supervisors How this paper fits into a growing line of work that uses a model's own intermediate behavior as a cheap, scalable supervision signal. RECOMMENDED READING * Measuring Faithfulness in Chain-of-Thought Reasoning [https://arxiv.org/abs/2307.13702] — Tamera Lanham et al.'s foundational work on early-answering interventions and CoT faithfulness, which the episode explicitly names as conceptually adjacent to this paper's probing trick. * Optimizing Test-Time Compute via Meta Reinforcement Fine-Tuning (MRT) [https://arxiv.org/abs/2503.07572] — The concurrent work the episode mentions that also uses intermediate confidence as an RL signal — but aimed at test-time efficiency rather than reasoning faithfulness, making for an instructive contrast. * DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models [https://arxiv.org/abs/2402.03300] — Introduces GRPO, the group-relative RL algorithm the episode's reward-shaping method modifies — useful background for understanding what 'progressive confidence shaping' is actually plugging into. * Let's Verify Step by Step [https://arxiv.org/abs/2305.20050] — OpenAI's canonical process reward model paper — the expensive annotation-heavy approach this episode's method tries to sidestep by mining the supervision signal from the model's own confidence trajectory.

26 de may de 202624 min
episode Training a Deep Research Agent on 8,000 Synthetic Tasks: The Rubric Tree Trick artwork

Training a Deep Research Agent on 8,000 Synthetic Tasks: The Rubric Tree Trick

TRAINING A DEEP RESEARCH AGENT ON 8,000 SYNTHETIC TASKS: THE RUBRIC TREE TRICK Source: QUEST: Training Frontier Deep Research Agents with Fully Synthetic Tasks [https://arxiv.org/abs/2605.24218] Paper was published on May 22, 2026 This episode was AI-generated on May 26, 2026. The script was written by an AI language model and the host voices were synthesized by Eleven Labs. The producer is not affiliated with Anthropic or Eleven Labs. An academic lab just matched OpenAI's Deep Research on several benchmarks using only 8,000 training examples — a recipe small enough to fit on one figure. The key move is a single data structure, the rubric tree, that unifies fact-seeking and report-writing into one training signal. We dig into how it works, what it actually unlocks, and the proprietary teacher stack quietly sitting underneath the word "fully synthetic." KEY TAKEAWAYS * Why the rubric tree works as one primitive for both obscure-fact tasks and open-ended report writing — and how it doubles as a synthesis target, a supervision filter, and an RL reward * The context condenser as an epistemic state machine: trusted facts, contradicted claims, and open leads paired with the next action * Why supervised fine-tuning actually hurts open-ended report quality, and how RL with GRPO recovers it * The capped reward design that prevents the agent from gaming citation credit against task completion * Why "fully synthetic" really means "no human annotation" — and why the open-weight model is effectively a distillation of an ensemble of closed frontier models * A 2B-parameter model that beats o3 on GAIA and HLE fact-seeking, with the asterisk that it collapses on report writing * 00:00 — What a deep research agent actually is How autonomous research loops differ from RAG, and why the field has been fragmented across three capabilities — fact seeking, citation grounding, and report synthesis. * 30:35 — The rubric tree as a unifying primitive How a hierarchical, auto-checkable rubric replaces the 'single verifiable answer' format and lets one pipeline cover both objective and open-ended tasks. * 09:31 — The context condenser Compressing hundreds of tool calls into a structured state of trusted, contradicted, and uncertain claims — and why training-time and inference-time artifacts have to match. * 11:40 — Mid-training, SFT, and RL Walking through the three-stage pipeline and the capped reward formula that ties task completion to citation faithfulness. * 15:34 — Ablations and the alignment tax Why SFT alone regresses report quality, why RL recovers it, and the unresolved tradeoff between open-ended performance and hard reasoning benchmarks. * 19:27 — What didn't work The unsuccessful attempts section — pointwise judge bias, DPO collapse on long reports, and why relative scoring against a reference was the move that survived. * 23:21 — The teacher dependency steelman What 'fully synthetic' actually means when the rubric generator, judge, fact-checker, and SFT teacher are all proprietary frontier models. * 27:15 — What the paper unlocks Why the rubric tree is the durable contribution, what a laptop-deployable research agent enables, and how to read future papers in this line. RECOMMENDED READING * Mind2Web 2: Evaluating Agentic Search with Agent-as-a-Judge [https://arxiv.org/abs/2506.21506] — The Ohio State group's earlier hand-crafted precursor to QUEST, where the rubric-tree evaluation primitive was first developed before being automated in this episode's paper. * DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning [https://arxiv.org/abs/2501.12948] — Introduces the GRPO algorithm whose relative-advantage scoring QUEST uses, and which the episode connects to QUEST's relative-to-reference judging trick. * Tongyi DeepResearch Technical Report [https://arxiv.org/abs/2510.24701] — Alibaba's open deep research agent, which serves as QUEST's SFT teacher and is the fact-seeking-strong, report-weak baseline the episode contrasts against. * GAIA: A Benchmark for General AI Assistants [https://arxiv.org/abs/2311.12983] — The hard-reasoning agent benchmark the episode repeatedly cites when discussing QUEST's small-model results and the alignment-tax tradeoff from RL training.

26 de may de 202631 min
episode Why Long-Context Models Might Need Compute, Not Capacity, Before Eviction artwork

Why Long-Context Models Might Need Compute, Not Capacity, Before Eviction

WHY LONG-CONTEXT MODELS MIGHT NEED COMPUTE, NOT CAPACITY, BEFORE EVICTION Source: Language Models Need Sleep [https://arxiv.org/abs/2605.26099] Paper was published on May 25, 2026 This episode was AI-generated on May 26, 2026. The script was written by an AI language model and the host voices were synthesized by Eleven Labs. The producer is not affiliated with Anthropic or Eleven Labs. For two years the long-context modeling community has been arguing about how much information you can squeeze into a fixed-size memory. A new paper says that's the wrong axis entirely — the bottleneck isn't how big the whiteboard is, it's how much thinking happened while writing on it. The fix is a 'sleep' phase that loops compute over context right before the cache gets cleared, with no cost at answer time. KEY TAKEAWAYS * The reframe at the heart of the paper: a hybrid model's fast weight isn't a storage device, it's the residue of a one-pass computation — and shallow computation produces shallow residue regardless of capacity * Why the Rule 110 cellular automaton experiment is unusually clean: it holds stored information constant while varying required computation, isolating compute-for-reasoning from memory-for-storage * The deployment win: extra 'sleep' compute is paid during ingestion, not at answer time, so inference latency is unchanged while training cost scales linearly with loop count N * Concrete gains: two-operation GSM-Infinite problems jump from ~60% to ~90% accuracy with four sleep loops in the sliding-window setting; harder six-operation problems on Ouro go from ~42% to ~62% * The honest limits: the real-task gains tangle 'reasoning' with 'retrieval under constrained windows,' comparisons are mostly against the no-loop version of the same architecture, and the method needs careful two-stage training to work * Why the conceptual contribution may outlast the specific mechanism: it splits inference into a compute-rich ingestion phase and a latency-constrained answer phase, a framing likely to show up in other architectures * 00:00 — The Polaroid problem and the notebook-vs-whiteboard setup A chess thought experiment introduces the gap between storing a position and computing forward from it, then frames how attention's exact notebook and SSMs' lossy whiteboard have been combined in hybrid models. * 03:27 — The reframe: fast weights are computations, not storage The authors' core move — that the community has been optimizing capacity when the real bottleneck is how much thinking went into producing the compressed state. * 06:55 — Sleep as depth-recurrence at eviction time How looping the network N times over a context chunk before clearing the cache buys reasoning depth, with hippocampal consolidation and kitchen-prep analogies for why offline work pays off. * 10:23 — The Rule 110 experiment A walkthrough of the cellular automaton test bed that holds storage requirements constant while varying required computation, and why the result is unusually clean for deep learning. * 13:50 — Does the result transfer to real tasks? Graph traversal and GSM-Infinite results on Jet-Nemotron and Ouro show the same pattern, with a candid look at how 'reasoning gain' starts to blur with 'retrieval gain' outside synthetic settings. * 17:18 — The skeptic's checklist Where the evidence is weaker: tautology concerns on Rule 110, missing comparisons against alternative uses of the same compute budget, and a method that requires careful two-stage training warm-up. * 20:46 — What changes about how we think about inference Why the conceptual contribution — splitting inference into compute-rich ingestion and latency-bound answering — may outlive the specific mechanism, and how it connects to related sleep-time compute work. RECOMMENDED READING * Universal Transformers [https://arxiv.org/abs/1807.03819] — The canonical depth-recurrence paper the episode references — loops transformer layers at inference time, which this episode contrasts with loops at ingestion time. * Sleep-time Compute: Beyond Inference Scaling at Test-time [https://arxiv.org/abs/2504.13171] — The Lin et al. work Bella name-checks as a parallel 'do offline work before queries arrive' proposal with a totally different mechanism. * Mamba: Linear-Time Sequence Modeling with Selective State Spaces [https://arxiv.org/abs/2312.00752] — Background on the state-space 'whiteboard' that the episode's hybrid models rely on, useful for understanding what the fast weight actually is. * Deep Equilibrium Models [https://arxiv.org/abs/1909.01377] — Another point of reference for depth-recurrent architectures, helpful for situating the paper's loop-until-converged framing within a broader research lineage.

26 de may de 202624 min
episode Terminal Agents Get Free Supervision From The Tokens We've Been Throwing Away artwork

Terminal Agents Get Free Supervision From The Tokens We've Been Throwing Away

TERMINAL AGENTS GET FREE SUPERVISION FROM THE TOKENS WE'VE BEEN THROWING AWAY Source: ECHO: Terminal Agents Learn World Models for Free [https://arxiv.org/abs/2605.24517] Paper was published on May 23, 2026 This episode was AI-generated on May 26, 2026. The script was written by an AI language model and the host voices were synthesized by Eleven Labs. The producer is not affiliated with Anthropic or Eleven Labs. Standard agent RL throws away 85% of rollouts because the task didn't succeed — but the terminal's responses inside those failed runs contain dense, gradable supervision that nobody was using. A new Microsoft Research paper shows that adding a simple next-token loss on environment outputs roughly doubles task success, recovers most of the value of expensive expert demonstrations, and in some cases lets models improve with no reward signal at all. KEY TAKEAWAYS * Why agent RL's reward sparsity is partly an artifact of which tokens we compute loss on, not a property of the task * How ECHO's one-line addition — cross-entropy on terminal output tokens — roughly doubles TerminalBench 2.0 pass rates at 8B and 14B scale * The lambda=0.2 collapse: when the auxiliary weight is too high, models learn to issue boring commands whose outputs are easy to predict * Why ECHO can substitute for the 'interaction prior' half of expert demonstrations but not the 'strategy prior' half * The verifier-free result — improvement with no reward signal on some held-out tasks, and active regression on others — and what that tells us about when prediction-as-learning works * Honest limits: small absolute numbers, untested at higher base capability, and a 'world model' claim that rests on a single transfer experiment * 00:00 — The supervision that was already in the rollout Framing the core observation: failed agent trajectories contain thousands of environment tokens whose gradients GRPO masks out. * 03:13 — What ECHO actually changes The one-line addition of next-token loss on terminal outputs, and the chess-student analogy for why predicting the environment forces understanding. * 06:27 — The headline numbers, honestly Roughly doubled pass rates at 8B and 14B on TerminalBench 2.0 — on a baseline of 2-5%, with timeouts cut in half and faster convergence. * 09:41 — Which tokens to predict, and the lambda collapse Why warning messages had to be excluded, and how setting the auxiliary loss weight too high causes models to game the prediction objective with trivial commands. * 12:55 — Substituting for expert demonstrations ECHO from a raw base model recovers most of the value of 15,000 GLM-4.6 demonstrations — but only the interaction-prior half, not the strategy half. * 16:09 — Transfer evidence and the world-modeling claim ECHO models predict Qwen3-32B's trajectories far better than GRPO baselines, suggesting transferable knowledge of terminal dynamics — though what specifically transferred isn't probed. * 15:59 — The verifier-free experiment Turning off the reward signal entirely and letting environment prediction alone drive improvement — which works on PyTerm, fails on TBLite, and reveals when the method needs action-linked feedback. * 22:36 — Steelman, limits, and what to test next Five honest caveats about the result and the open question of whether ECHO generalizes beyond terminals and beyond low-capability base models. RECOMMENDED READING * Group Relative Policy Optimization (DeepSeekMath) [https://arxiv.org/abs/2402.03300] — Introduces the GRPO algorithm that ECHO modifies — essential background for understanding what 'masking out the terminal tokens' actually means in the baseline. * Curiosity-driven Exploration by Self-supervised Prediction [https://arxiv.org/abs/1705.05363] — The canonical prior work on learning from prediction error as an intrinsic signal, which the episode's verifier-free result echoes in a language-model setting. * Reinforcement Learning with Unsupervised Auxiliary Tasks (UNREAL) [https://arxiv.org/abs/1611.05397] — A foundational example of adding auxiliary prediction losses to RL agents, useful for contextualizing ECHO against the deeper history of dense-supervision methods the paper doesn't directly compare to. * SWE-bench: Can Language Models Resolve Real-World GitHub Issues? [https://arxiv.org/abs/2310.06770] — Sets the benchmark context for the kind of terminal-agent task ECHO is trying to improve, and frames why doubling a 5% pass rate matters even though the absolute numbers stay small.

26 de may de 202625 min
episode How a Two-Agent Trick Unlocked Large-Scale Training for Computer-Use Agents artwork

How a Two-Agent Trick Unlocked Large-Scale Training for Computer-Use Agents

HOW A TWO-AGENT TRICK UNLOCKED LARGE-SCALE TRAINING FOR COMPUTER-USE AGENTS Source: CUA-Gym: Scaling Verifiable Training Environments and Tasks for Computer-Use Agents [https://arxiv.org/abs/2605.25624] Paper was published on May 25, 2026 This episode was AI-generated on May 26, 2026. The script was written by an AI language model and the host voices were synthesized by Eleven Labs. The producer is not affiliated with Anthropic or Eleven Labs. Computer-use agents have been stuck while math and code models have soared — and a new paper argues the bottleneck was never the algorithm, it was the missing data pipeline. The fix turns on one elegant design choice: put an information barrier between the AI that builds the training environment and the AI that writes the reward function. The result is the largest open verified dataset for GUI agents, big benchmark gains, and an unexpected behavior the agents picked up entirely on their own. KEY TAKEAWAYS * Why verifiable RL has scaled beautifully in math and code but stalled in computer-use agents — and why that's an environment problem, not an algorithm problem * The Generator/Discriminator information-barrier trick that prevents AI-written reward functions from secretly checking the construction procedure instead of the task * How 94 synthesized mock applications (Slack, Jira, Salesforce, EHR, etc.) get built and verified, grounded in real software-usage data rather than convenience * An ablation suggesting environment diversity is its own scaling axis — same trajectory count spread across more environments meaningfully outperforms * An unprompted emergent behavior: trained agents learn which UI actions are safe to batch and which (like right-click) must stay atomic, cutting trajectory length 33–45% with no efficiency reward * Where the paper's framing is hotter than its evidence — transfer to out-of-distribution benchmarks is modest, runs are single-seed, and reward functions verify end-state only * 00:00 — Why GUI agents are harder than math problems The structural reason computer-use training data is orders of magnitude smaller than math or code: each example requires a task, an executable environment, and a programmatic reward — all coupled and expensive. * 29:00 — The information barrier between Generator and Discriminator The paper's central design move — keeping the agent that builds the environment separate from the one that writes the reward, so the reward describes the outcome rather than the construction procedure. * 07:03 — The inner loop and the reward-hacking scanner How the iteration between agents converges on a verified tuple, and the six forbidden code patterns a static scanner catches before a tuple is accepted. * 10:34 — Synthesizing 94 mock applications at scale Why real websites can't host RL training, how the team picks which apps to build using occupational and software-usage data, and how a uniform state API lets the same pipeline drop in across all of them. * 14:06 — Training results and the data-vs-model tradeoff The Qwen MoE backbones, the GRPO-style algorithm, the OSWorld-Verified gains, and the striking finding that a 10x smaller trained model matches a much larger untrained one. * 27:36 — Environment diversity as a separate scaling axis The ablation showing that spreading the same number of trajectories across more environments beats concentrating them, and what that implies about hidden ceilings in prior work. * 21:09 — The emergent action-batching behavior How trained agents spontaneously learn to bundle predictable action sequences while keeping unpredictable ones (right-click, double-click) atomic — with no efficiency signal in the reward. * 24:40 — Limitations and honest caveats Modest out-of-distribution transfer, simplified mocks that omit auth and failure modes, single-seed RL runs, and reward functions that check end state but not process. * 28:12 — Why this paper reframes the agenda The broader shift from algorithmic cleverness to environment infrastructure, and why the information-barrier idea is likely to keep reappearing wherever AI agents generate training data for other AI agents. RECOMMENDED READING * OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments [https://arxiv.org/abs/2404.07972] — The OSWorld-Verified benchmark that the episode's headline results are measured against — useful context for what these GUI agents are actually being tested on. * WebArena: A Realistic Web Environment for Building Autonomous Agents [https://arxiv.org/abs/2307.13854] — The out-of-distribution browser benchmark used to test transfer in the paper — relevant to the episode's discussion of how much the trained skills actually generalize. * DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning [https://arxiv.org/abs/2501.12948] — A canonical example of the verifiable-rewards RL recipe the episode argues is now being ported to GUI agents, with the same group-relative algorithm family.

26 de may de 202631 min