Imagen de portada del programa AI Papers: A Deep Dive

AI Papers: A Deep Dive

Podcast de paperdive.ai

inglés

Tecnología y ciencia

Empieza 7 días de prueba

$99 / mes después de la prueba.Cancela cuando quieras.

  • 20 horas de audiolibros al mes
  • Podcasts solo en Podimo
  • Podcast gratuitos
Prueba gratis

Acerca de AI Papers: A Deep Dive

Long-form deep dives into new research on Artificial Intelligence, AI agents and the engineering practice of building them - one paper per episode. We unpack the motivating problem, how the method actually works, the math that matters, what the experiments do and don't show, and the strongest critique against the result. The goal isn't a five-minute summary; it's the kind of conversation you'd have with a colleague who actually read the paper. Topics span large language models, autonomous agents, agentic coding, reinforcement learning for agent training, evaluation and benchmarks, alignment, and the practical engineering decisions that make agentic systems actually work in production. Most papers are pulled from arXiv, often within days of release. Hosted by AI voices generated with ElevenLabs. Episode scripts are produced by a multi-stage Claude pipeline working from a close reading of the source paper. New episodes daily.

Todos los episodios

69 episodios

episode When Smarter Models Forecast Worse: The Hidden Failure Mode in LLM Predictions artwork

When Smarter Models Forecast Worse: The Hidden Failure Mode in LLM Predictions

WHEN SMARTER MODELS FORECAST WORSE: THE HIDDEN FAILURE MODE IN LLM PREDICTIONS Source: Is Capability a Liability? More Capable Language Models Make Worse Forecasts When It Matters Most [https://arxiv.org/abs/2605.22672] Paper was published on May 21, 2026 This episode was AI-generated on May 22, 2026. The script was written by an AI language model and the host voices were synthesized by Eleven Labs. The producer is not affiliated with Anthropic or Eleven Labs. Claude Opus 4.6 looked at Brazil's 1986 hyperinflation, correctly named the regime-change risk, then forecast a number seven million times too high. A new paper shows this isn't a fluke — it's a structural pattern across epidemics, housing bubbles, and decades of measles data, and the standard way the field grades LLM forecasts can't see it. KEY TAKEAWAYS * Why the same model outputs can earn opposite verdicts — capable models look best under Brier-style scoring and worst under CRPS — and what that means for every existing LLM forecasting benchmark * The specific trigger for the inversion: superlinear growth followed by a regime change, confirmed by a clean linear-growth control where the effect vanishes entirely * A within-family Llama experiment showing scale and post-training each independently make the overcommitment worse, and compound when combined * The unselected pre-vaccine US measles cohort (1,339 state-seasons) that rules out the 'you cherry-picked the crashes' objection, plus flu as a pre-registered negative control * Why naming the historical episode rescues calibration for COVID and housing but fails completely for hyperinflation — the knowledge is in the model, but it doesn't reach the tails * The one-line fix: report a tail-integrating proper scoring rule alongside threshold metrics, using forecasts benchmarks have already collected * 00:00 — The Opus 4.6 hyperinflation moment A frontier model articulates the regime-change possibility, weighs it on the page, and then commits to extrapolation anyway — overshooting reality by a factor of seven million. * 03:47 — Two ways to grade a forecast Distributional forecasts, threshold-based Brier scoring versus tail-integrating CRPS, and the weather-map analogy that makes the asymmetry click. * 07:35 — The Freeciv benchmark and the first crack A clean, unseen forecasting setup where binary and continuous versions of the same question yield opposite capability-accuracy correlations at long horizons. * 11:22 — The synthetic epidemic and its linear-growth control An exponential-then-crash simulator reproduces the inversion, and swapping in linear growth makes it vanish — pinning the mechanism to the bend-then-break shape. * 11:32 — Competence-driven overcommitment Per-quantile decomposition shows the lower tail stays flat while the upper tail balloons with capability, and the within-family Llama 2x2 confirms scale and post-training each contribute. * 18:57 — Real-world replications and the measles test COVID, housing, and hyperinflation replicate the pattern, but the unselected measles cohort and a pre-registered flu negative control are what make the result hard to dismiss. * 22:45 — The verdict flip and what the model knows The same forecasts graded two ways reverse the sign of the capability correlation, and a knowledge probe reveals models can name the crisis they're forecasting yet still produce extreme tail overshoots. * 26:33 — Limitations, the steelman, and the fix Honest pushback on the capability axis, the bundled post-training treatment, and the small hyperinflation sample — followed by the embarrassingly simple methodological recommendation. RECOMMENDED READING * Are Emergent Abilities of Large Language Models a Mirage? [https://arxiv.org/abs/2304.15004] — The Schaeffer et al. paper the episode invokes directly — argues that metric choice can manufacture apparent emergent abilities, setting up this episode's darker mirror claim that metric choice can also hide failures. * Inverse Scaling Prize: Second Round Winners (McKenzie et al.) [https://arxiv.org/abs/2306.09479] — The original taxonomy of inverse-scaling failures the episode contrasts with — useful for understanding why this paper's forecasting failure is structurally different from earlier adversarial cases. * Inverse Scaling Can Become U-Shaped [https://arxiv.org/abs/2211.02011] — Wei et al.'s follow-up showing many inverse-scaling tasks recover at frontier scale — the precise counterpoint to this episode's claim that the forecasting failure is monotonic all the way to the frontier.

Ayer - 30 min
episode When Models Know the Answer But Say the Wrong Thing Anyway artwork

When Models Know the Answer But Say the Wrong Thing Anyway

WHEN MODELS KNOW THE ANSWER BUT SAY THE WRONG THING ANYWAY Source: Hallucination as Commitment Failure: Larger LLMs Misfire Despite Knowing the Answer [https://arxiv.org/abs/2605.22007] Paper was published on May 21, 2026 This episode was AI-generated on May 22, 2026. The script was written by an AI language model and the host voices were synthesized by Eleven Labs. The producer is not affiliated with Anthropic or Eleven Labs. A new paper shows that up to 47% of hallucinations from large instruction-tuned LLMs happen when the model already has the correct answer sitting in its probability distribution — it just commits to a wrong token instead. Even stranger: the bigger the model, the more often this happens, and the mechanism that causes it is the same one that makes these models feel helpful and decisive in the first place. KEY TAKEAWAYS * Why 'Saint,' 'St,' and 'C-athedral' splitting the vote can hand the wrong answer a plurality win under greedy decoding * How P-mass — summing probability across spellings of the same concept — reframes hallucinations from a knowledge problem to a commitment problem * Why the fraction of these 'commitment failure' hallucinations climbs monotonically with scale, from 16% to 47% * The Instruct-vs-base comparison showing instruction tuning, not scale itself, is what sharpens confidence in wrong tokens * Why uncertainty-based hallucination detection has a structural ceiling it cannot cross * Why the obvious fix — concept-aware decoding — only recovers about 2% of failures in the Instruct models we actually deploy * 01:20 — The Saint Basil's example A concrete case where the model's belief is concentrated on the right answer but greedy decoding picks the wrong token anyway. * 02:41 — From token entropy to P-mass Reframing the analysis by pooling probability across surface forms of the same concept, and defining 'commitment failure' hallucinations. * 05:23 — The dispersion mechanism Same total belief in the right concept, but in failures it's spread across spellings 26% vs 78% — a clean three-to-one signature. * 08:04 — Instruction tuning as the causal variable Comparing Instruct and base models of identical size shows the sharpening of wrong tokens is driven by helpfulness training, not scale. * 10:46 — The decisive witness and the high-gain microphone Two analogies for why confident correctness and confident hallucination are the same disposition viewed from different angles. * 13:27 — Limitations and scope P-mass requires ground truth, the 0.2 threshold is a choice, and the setup is short-form QA with greedy decoding — the cleanest regime for the effect. * 16:09 — A unifying account of the alignment tax How commitment sharpening offers a single mechanism behind accuracy loss, calibration loss, mode collapse, and confident hallucination. * 18:50 — Phrase-level commitment and what comes next Evidence that sharpening cascades across tokens, why simple concept-clustering won't fix it, and what a meaning-level training objective might require. RECOMMENDED READING * Training language models to follow instructions with human feedback [https://arxiv.org/abs/2203.02155] — The Ouyang et al. paper that introduced the 'alignment tax' framing the episode invokes when connecting instruction tuning to sharpened, sometimes-wrong commitments. * Language Models (Mostly) Know What They Know [https://arxiv.org/abs/2207.05221] — Anthropic's investigation of LLM self-knowledge and calibration, a natural counterpoint to the episode's claim that uncertainty-based detection has a structural ceiling. * Semantic Uncertainty: Linguistic Invariances for Uncertainty Estimation in Natural Language Generation [https://arxiv.org/abs/2302.09664] — Kuhn, Gal, and Farquhar's proposal to cluster generations by meaning before measuring uncertainty — the closest existing analog to the 'count meanings, not tokens' move at the heart of P-mass. * How Language Model Hallucinations Can Snowball [https://arxiv.org/abs/2305.13534] — Zhang et al. on how early committed tokens lock models into wrong continuations, directly relevant to the episode's discussion of phrase-level commitment cascading past the first token.

Ayer - 21 min
episode The OS Trick That Makes Tree Search Practical for Coding Agents artwork

The OS Trick That Makes Tree Search Practical for Coding Agents

THE OS TRICK THAT MAKES TREE SEARCH PRACTICAL FOR CODING AGENTS Source: DeltaBox: Scaling Stateful AI Agents with Millisecond-Level Sandbox Checkpoint/Rollback [https://arxiv.org/abs/2605.22781] Paper was published on May 21, 2026 This episode was AI-generated on May 22, 2026. The script was written by an AI language model and the host voices were synthesized by Eleven Labs. The producer is not affiliated with Anthropic or Eleven Labs. Almost nobody runs Monte Carlo tree search on real coding agents, even though it could add 30 points of accuracy on SWE-bench. The reason isn't the models — it's that sandbox checkpoint and rollback take seconds, and a new paper from Shanghai Jiao Tong and Huawei closes that gap with a couple of clever OS tricks that hide checkpointing inside the LLM call you were already waiting on. KEY TAKEAWAYS * Why agent capability gaps are sometimes OS limits, not model limits — and how DeltaBox closes a 30-point accuracy gap on SWE-bench by making checkpoint/rollback cheap * How DeltaFS hijacks OverlayFS plus XFS reflinks to version a filesystem at runtime without ever duplicating unchanged data * The fork() + CRIU combination that gives you 5-millisecond rollback by keeping a frozen 'body double' of the process with almost no memory cost * The inference-masking trick: hiding 15ms of checkpoint work inside the 1-20 second LLM call the agent was already waiting on * Why RL training GPU utilization jumps from about 51% to 99% when you replace shutil.copytree with forked sandbox templates * Where the design might creak: very large processes, faster LLM inference shrinking the masking window, and side effects that can't be rolled back * 00:00 — The capability gap tree search leaves on the floor Why MCTS adds 5-30 points of SWE-bench accuracy but almost nobody deploys it, and the 1.5-second-per-rollback OS cost that explains why. * 02:59 — The diary and the room: why checkpointing is hard Framing the core requirement that filesystem and process memory must be captured and restored atomically or tree search breaks. * 05:59 — DeltaFS and the stack of acetate sheets How the paper coerces OverlayFS into swapping layers at runtime and uses XFS reflinks so storage cost tracks actual edits. * 08:59 — DeltaCR: fork() as a frozen body double Combining CRIU dumps with a stopped, copy-on-write fork to get 5ms restores while keeping a durable disk-based safety net. * 11:58 — Inference-masking: cooking while the microwave runs Why hiding the 15ms checkpoint inside the LLM round-trip is what makes the architecture practical rather than just clever. * 14:58 — End-to-end SWE-bench results DeltaBox brings tree-search trajectory time to within 3-6% of the pure-LLM floor, versus 1.9x-4.3x for Firecracker and CubeSandbox. * 17:58 — The RL training story: 51% to 99% GPU utilization How the same fork-based template mechanism eliminates the sandbox setup idle time that wastes half a GPU during synchronous RL. * 20:57 — Steelman critiques and where the design might creak Honest pushback on process-size scaling, dependence on slow LLM inference, network side effects, MCTS-specific GC, and a reconstructed CubeSandbox baseline. * 23:57 — The bigger reframe: OS substrates for agent workloads Why this work fits a broader pattern of co-designing decades-old kernel primitives for high-frequency agent state, not just human users. RECOMMENDED READING * SWE-bench: Can Language Models Resolve Real-World GitHub Issues? [https://arxiv.org/abs/2310.06770] — The benchmark the episode repeatedly anchors to when discussing the five-to-thirty-point accuracy gains tree search unlocks for coding agents. * ReAct: Synergizing Reasoning and Acting in Language Models [https://arxiv.org/abs/2210.03629] — The linear agent loop the episode frames as the default that exists partly because richer OS-level branching was too expensive — useful context for why DeltaBox's substrate matters. * Language Agent Tree Search Unifies Reasoning, Acting, and Planning in Language Models [https://arxiv.org/abs/2310.04406] — A concrete instantiation of the MCTS-style agent search that the episode argues was theoretically attractive but practically blocked by sandbox overhead.

Ayer - 26 min
episode An AI Just Solved a 1996 Erdős Problem—and the Simplest Agent Won artwork

An AI Just Solved a 1996 Erdős Problem—and the Simplest Agent Won

AN AI JUST SOLVED A 1996 ERDŐS PROBLEM—AND THE SIMPLEST AGENT WON Source: Advancing Mathematics Research with AI-Driven Formal Proof Search [https://arxiv.org/abs/2605.22763] Paper was published on May 21, 2026 This episode was AI-generated on May 22, 2026. The script was written by an AI language model and the host voices were synthesized by Eleven Labs. The producer is not affiliated with Anthropic or Eleven Labs. A Google DeepMind system autonomously cracked nine open Erdős problems—including one that sat unsolved for thirty years—for a few hundred dollars each, with proofs verified by the Lean compiler. The twist: the team's elaborate evolutionary search system was beaten on most problems by a twenty-line script that just iterates an LLM against a compiler. The implications for AI engineering go well beyond mathematics. KEY TAKEAWAYS * Why coupling an LLM to the Lean proof checker dissolves the trust problem in AI-generated mathematics—and where that guarantee actually ends * How a 'Ralph loop' of LLM plus compiler plus retry matched a sophisticated evolutionary system with AlphaProof, tournament Elo ranking, and shared caches * The actual proof idea behind Erdős problem 125, including how irrationality of log(4)/log(3) gets weaponized to crush sumset density to zero * How the agent surfaced a thirty-year-old ambiguity in Erdős's original problem statement just by being forced to commit to a formal reading * Where the verification guarantee leaks: LLM judges scoring proof sketches reward confident-sounding hallucinated citations, biasing the search upstream of the compiler * Why the selection bias in the problem set, the cost of failed runs, and the human work of formalization make the headline numbers less clean than they look * 29:03 — The trust problem in AI-generated math Why plausible-looking LLM proofs have been economically useless to working mathematicians, and how Lean's compiler is supposed to fix that. * 03:52 — The Ralph loop and the basic agent A walkthrough of Agent A—the embarrassingly simple LLM-plus-compiler-plus-retry setup that did most of the work. * 07:44 — Inside Erdős 125 The metronome intuition behind the density-zero proof and how the agent decomposes subgoals and delegates to AlphaProof. * 11:37 — The fancy system that mostly didn't win Evolutionary search with Elo-ranked proof sketches, a shared cache, and AlphaProof calls—and why it only paid off on the hardest problems. * 15:29 — The ambiguity-surfacing side effect How formalizing Erdős 125 and 741 forced long-standing imprecisions in the informal statements into the open. * 19:21 — A geometric proof that feels like a magic trick Erdős 846 and the agent's translation of a collinearity problem into graph-theoretic Ramsey territory. * 23:14 — Steelmanning the skeptics Selection bias in the problem set, hidden costs of failed runs, the heavy lifting humans do in formalization, and the hallucinated-citation failure mode. * 27:06 — What actually changed How the bottleneck shifts from verifying proofs to verifying problem statements, and what the 'simple loops beat scaffolding' finding might mean beyond math. RECOMMENDED READING * AlphaEvolve: A coding agent for scientific and algorithmic discovery [https://arxiv.org/abs/2506.13131] — The evolutionary search ancestor of the Agent C/D system discussed in the episode, providing context for the 'fancy scaffolding' that the basic Ralph loop ended up matching. * Mathematical discoveries from program search with large language models (FunSearch) [https://doi.org/10.1038/s41586-023-06924-6] — The original DeepMind work establishing LLM-driven search for new mathematical results, which the episode positions as the lineage that Agent D descends from. * Solving olympiad geometry without human demonstrations (AlphaGeometry) [https://doi.org/10.1038/s41586-023-06747-5] — A useful contrast to the episode's framing of olympiad problems as 'the easier version' — shows what tightly-scaffolded, domain-specific provers achieved before frontier LLMs closed the gap. * The Lean Mathematical Library (Mathlib) [https://arxiv.org/abs/1910.09336] — The community formalization library whose maturity the episode credits as one of the four necessary ingredients for the paper's results.

Ayer - 31 min
episode When the Model Is Fine and the Plumbing Is Broken: Fixing Agents at the Interface artwork

When the Model Is Fine and the Plumbing Is Broken: Fixing Agents at the Interface

WHEN THE MODEL IS FINE AND THE PLUMBING IS BROKEN: FIXING AGENTS AT THE INTERFACE Source: Adapting the Interface, Not the Model: Runtime Harness Adaptation for Deterministic LLM Agents [https://arxiv.org/abs/2605.22166] Paper was published on May 21, 2026 This episode was AI-generated on May 22, 2026. The script was written by an AI language model and the host voices were synthesized by Eleven Labs. The producer is not affiliated with Anthropic or Eleven Labs. A four-billion-parameter model can score 74% on olympiad math but fail half the time at microwaving a virtual apple — and a new paper argues the problem isn't the model, it's the layer between the model and the world. The authors build a harness that fixes the interface instead of retraining the model, then show it improves 116 out of 126 model-environment combinations, including beating a model specifically fine-tuned for the task. If they're right, a lot of the engineering we've been pouring into model weights actually belongs somewhere else. KEY TAKEAWAYS * Why agent failures are dominated by interface bugs — malformed tool calls, contract violations, and loops — not reasoning failures * The four-layer harness taxonomy (action realization, environment contract, trajectory regulation, procedural skill) and which layer carries which environment * How a harness evolved from one 4B model's failures transfers, unchanged, to seventeen other models from 7B to 70B * The xLAM comparison: a base model with a good harness beats the same base model fine-tuned specifically for the benchmark — and generalizes better too * Where the method's scope ends: deterministic, rule-governed environments yes; open-ended web browsing probably not * The honest limits — environment-specific patches, untested robustness of the Codex-in-the-loop evolution, and ablations only run on the source model * 00:00 — The apple gap and what failure actually looks like Concrete examples of how strong models fail at simple embodied tasks — prose instead of tool calls, malformed arguments, and repeated invalid commands. * 02:34 — Reframing the agent as model plus environment plus harness The paper's core conceptual move: treating the plumbing between model and environment as a first-class system component. * 05:09 — Classifying failures in priority order The four failure categories — action realization, contract, trajectory, reasoning — and why classifying in the right order matters for diagnosis. * 07:44 — The four-layer harness architecture How each lifecycle moment gets its own intervention, with form-validation and GPS-recalculation analogies for the two most load-bearing layers. * 10:19 — Evolving the harness with a coding agent in the loop How Codex generates patches from failed trajectories within the four-layer scaffolding, and why that structural constraint matters. * 12:54 — The transfer result across 17 models and 7 environments Freezing the harness built on a 4B model and seeing it improve 92% of model-environment pairs, including a 15x jump on Llama-3.1-8B in ALFWorld. * 15:29 — Beating a model trained for the task The xLAM comparison: base Qwen plus harness outperforms the specifically fine-tuned variant on its own benchmark and generalizes better off-distribution. * 18:04 — Steelmanning the pushback Honest limits on benchmark scope, the environment-specificity of patches, robustness of the evolution process, and incomplete per-model ablations. * 20:39 — Why this matters for where agent engineering goes next The broader shift toward taking the system around the model seriously — and what that implies for deployment economics and future work. RECOMMENDED READING * ReAct: Synergizing Reasoning and Acting in Language Models [https://arxiv.org/abs/2210.03629] — The foundational paper establishing the LLM-agent loop that this episode's harness wraps around — useful background for understanding what 'the model emits, the environment executes' actually means in practice. * ALFWorld: Aligning Text and Embodied Environments for Interactive Learning [https://arxiv.org/abs/2010.03768] — The household-tasks benchmark that opens the episode with the embarrassing apple-microwaving gap, and where removing the trajectory regulation layer crashes performance by 86%. * τ-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains [https://arxiv.org/abs/2406.12045] — The customer-service benchmark behind the episode's xLAM comparison, including the pass^k reliability metric the hosts flag as the bar that matters for production agents. * GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning [https://arxiv.org/abs/2507.19457] — The prompt-optimization baseline the episode contrasts with harness adaptation, illustrating the ceiling of what you can fix by rewriting prompts alone.

Ayer - 23 min
Muy buenos Podcasts , entretenido y con historias educativas y divertidas depende de lo que cada uno busque. Yo lo suelo usar en el trabajo ya que estoy muchas horas y necesito cancelar el ruido de al rededor , Auriculares y a disfrutar ..!!
Muy buenos Podcasts , entretenido y con historias educativas y divertidas depende de lo que cada uno busque. Yo lo suelo usar en el trabajo ya que estoy muchas horas y necesito cancelar el ruido de al rededor , Auriculares y a disfrutar ..!!
Fantástica aplicación. Yo solo uso los podcast. Por un precio módico los tienes variados y cada vez más.
Me encanta la app, concentra los mejores podcast y bueno ya era ora de pagarles a todos estos creadores de contenido

Elige tu suscripción

Más populares

Premium

20 horas de audiolibros

  • Podcasts solo en Podimo

  • Disfruta los shows de Podimo sin anuncios

  • Cancela cuando quieras

Empieza 7 días de prueba
Después $99 / mes

Prueba gratis

Sólo en Podimo

Audiolibros populares

Prueba gratis

Empieza 7 días de prueba. $99 / mes después de la prueba. Cancela cuando quieras.