AI Papers: A Deep Dive
WHY MORE HUMAN DEMONSTRATIONS MADE A COMPUTER-USE AGENT WORSE Source: ProCUA-SFT Technical Report [https://arxiv.org/abs/2606.17321] Paper was published on June 15, 2026 This episode was AI-generated on June 18, 2026. The script was written by an AI language model and the host voices were synthesized by Eleven Labs. The producer is not affiliated with Anthropic or Eleven Labs. An NVIDIA team fed their computer-use agent the largest pile of real human demonstrations ever released — and watched its success rate fall from one task in four to one in ten. Then they threw the human data out entirely, let a single model generate its own training set, and nearly doubled the baseline. This episode digs into why the obvious fix backfired, and what the more defensible version of "synthetic beats human" actually is. KEY TAKEAWAYS * Why 22,500 real human demonstrations made the model substantially worse — too-easy single-app tasks, annotation noise, and negative transfer away from the cross-application reasoning the benchmark demands * The structural fix at the heart of the paper: collapsing the planner and actor into one model so it never proposes goals it can't carry out, closing the capability gap by construction rather than by filtering * How a 'mise en place' precondition-verification step stops the model from inventing tasks involving files and apps that don't exist — and why hallucinated tasks breed a hallucinating agent * The counterintuitive diversity result: balancing training data by action type actively hurt, while balancing by application combination was the only strategy that beat the baseline * Why the synthetic data teaches a more robust interaction style (more keyboard shortcuts, fewer brittle pixel-perfect clicks) * The case for skepticism: the 45% gain is really distillation from a strong teacher, everything is measured on OSWorld using data partly seeded from OSWorld's own configs, and the most novel idea — the verifier — has the least clean evidence behind it * 00:00 — The collapse: gold-standard human data poisons the model Fine-tuning on 22,500 human demonstrations drops the agent from a 26% baseline to around 10% on OSWorld, setting up the puzzle the paper tries to solve. * 02:11 — Why real human data caused negative transfer Three compounding reasons — tasks too easy, single-app, and crowd-sourced noise — explain why more real practice footage produced a worse agent. * 04:23 — Infeasible tasks and the mise-en-place fix How a precondition-checklist and verification pass stops the model from generating tasks involving files and apps that don't exist on the machine. * 06:35 — Seeding realistic, cluttered desktops Loading machines with hundreds of messy real spreadsheets and thousands of clustered presentations to make hard cross-referencing tasks possible. * 08:47 — Collapsing planner and actor into one model The central design move — one vision-language model proposes, judges, and executes — closes the planner-actor capability gap by construction. * 10:59 — Turning one trajectory into many training samples Why each step of a run becomes its own example, matching the exact screen-and-history context the agent sees at inference rather than padding the dataset. * 13:11 — The results and what diversity actually helps The model climbs to 45%, and a diversity experiment reveals that covering application combinations matters far more than balancing action types. * 15:23 — Complexity, robustness, and the keyboard shift Why difficulty comes from cross-referencing patterns rather than app count, and why synthetic data's lean toward keyboard actions makes the agent more robust. * 17:35 — The caveats: distillation, benchmark fit, and weak evidence for the verifier A measured critique that the gain may be distillation tuned to one benchmark, with the most novel idea — the precondition verifier — lacking a clean ablation. RECOMMENDED READING * OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments [https://arxiv.org/abs/2404.07972] — The exact benchmark this episode's results live and die on — essential for assessing the hosts' worry that gains might be fitting one test's app distribution. * Distilling the Knowledge in a Neural Network [https://arxiv.org/abs/1503.02531] — The foundational distillation paper behind Finn's central reframing that 'synthetic beats human' is really a strong teacher transferring competence into a smaller student. * STaR: Bootstrapping Reasoning With Reasoning [https://arxiv.org/abs/2203.14465] — A canonical example of a model generating its own training data and learning only from what it can successfully solve — the same self-bounded loop the episode credits as the structural fix.
150 jaksot
Kommentit
0Ole ensimmäinen kommentoija
Rekisteröidy nyt ja liity AI Papers: A Deep Dive-yhteisöön!