AI Papers: A Deep Dive

Same Tokens, Same Cost, Wildly Different Results: What Actually Scales in AI Agents

25 min · Gisteren

Beschrijving

SAME TOKENS, SAME COST, WILDLY DIFFERENT RESULTS: WHAT ACTUALLY SCALES IN AI AGENTS Source: Scaling Laws for Agent Harnesses via Effective Feedback Compute [https://arxiv.org/abs/2605.29682] Paper was published on May 28, 2026 This episode was AI-generated on May 29, 2026. The script was written by an AI language model and the host voices were synthesized by Eleven Labs. The producer is not affiliated with Anthropic or Eleven Labs. Two AI agent runs spend identical tokens, make identical tool calls, and cost the same penny — yet one succeeds 27% of the time and the other 90%. A new paper argues the resource that actually scales agents isn't compute at all, but feedback that's validated, novel, and remembered. If they're right, the reflex to throw more budget at a struggling agent is often just buying more waste. KEY TAKEAWAYS * Why counting tokens, tool calls, and cost measures activity, not progress — and on real agent traces actually predicts worse than guessing the average (negative R-squared) * Effective Feedback Compute: the four-factor score (informative, valid, non-redundant, retained) that's multiplied, not averaged, so missing any one factor zeroes out the whole event * The matched-budget experiment that makes the causal case: identical spend on every axis, quality varied alone, success jumps from 27% to 90% * Why there's no universally best agent harness — the fanciest scaffolding wins on code tasks but loses to simpler ones on software-engineering tasks * The honest limitations: author-constructed feedback conditions, a curated slice of real benchmarks, and fitted task-demand weights — and the prospective holdout that defends against curve-fitting * The forward-looking payoff: because the metric can be estimated mid-run from the trace, you could cut off agents that are spinning and pour budget into the ones genuinely learning * 00:00 — The 27-versus-90 puzzle Two runs that are twins on every spending meter produce radically different success rates, setting up the central question of what the difference actually is. * 02:32 — Why training scaling laws don't transfer to agents The clean, predictable scaling curves of pretraining break down once you wrap a model in a harness that loops through plans, actions, and tool calls. * 05:04 — Activity is not progress Why counting tokens can't tell a learning agent apart from one churning in place, dooming raw spending as a predictor. * 07:36 — Effective Feedback Compute and the four-factor product The paper's core metric scores each feedback event on being informative, valid, non-redundant, and retained — and multiplies them so weak links snap the whole chain. * 10:08 — Task demand: feedback relative to thirst Dividing the feedback score by how feedback-hungry a task is turns raw quantity into sufficiency, letting easy and hard tasks share one axis. * 12:40 — From a cloud of dots to a clean curve In a controlled sandbox, activity measures explain only a third of the variance while the single feedback scalar fits the data nearly perfectly — including a planted high-budget-but-useless harness. * 15:12 — The matched-budget causal test Pairs of runs with identical spending but different feedback quality move success by 63 points, ruling out the 'they just spent more' explanation. * 17:45 — Surviving contact with reality An estimated trace-only version, real mixed benchmarks where activity metrics go negative, and a pre-registered prospective holdout each close off an excuse — though real soft spots remain. * 21:41 — No universally best harness Efficiency turns out to be a harness-task interaction: deep harnesses dominate on code, everyone struggles on terminal tasks, and simpler harnesses win on software engineering. * 22:49 — Practical upshot and the adaptive-budget dream Why more budget often buys more waste, and how a mid-run feedback estimate could let systems cut off dead runs and feed the ones actually making progress. RECOMMENDED READING * Scaling Laws for Neural Language Models [https://arxiv.org/abs/2001.08361] — The original pretraining scaling-law paper that this episode uses as its baseline analogy — predictable curves from spending more compute — before arguing harness scaling needs a different x-axis. * Training Compute-Optimal Large Language Models [https://arxiv.org/abs/2203.15556] — The Chinchilla paper that refined how to read scaling-law curves and tradeoffs, useful background for the episode's discussion of putting the right quantity on the x-axis. * ReAct: Synergizing Reasoning and Acting in Language Models [https://arxiv.org/abs/2210.03629] — Defines the plan-act-observe loop that the episode calls the 'harness,' making it the concrete agent architecture whose feedback quality this paper measures. * Reflexion: Language Agents with Verbal Reinforcement Learning [https://arxiv.org/abs/2303.11366] — A closed-loop agent that explicitly retains feedback across attempts — a direct instance of the 'retained' and 'non-redundant' factors the episode argues are multiplicative.

Reacties

Wees de eerste die een reactie plaatst

Meld je nu aan en word lid van de AI Papers: A Deep Dive community!

Begin hier

Same Tokens, Same Cost, Wildly Different Results: What Actually Scales in AI Agents

Beschrijving

Reacties

2 maanden voor € 1

Alle afleveringen