Imagen de portada del espectáculo AI Papers: A Deep Dive

AI Papers: A Deep Dive

Podcast de paperdive.ai

inglés

Tecnología y ciencia

Oferta limitada

2 meses por 1 €

Después 4,99 € / mesCancela cuando quieras.

  • 20 horas de audiolibros / mes
  • Podcasts solo en Podimo
  • Podcast gratuitos
Empezar

Acerca de AI Papers: A Deep Dive

Long-form deep dives into new research on Artificial Intelligence, AI agents and the engineering practice of building them - one paper per episode. We unpack the motivating problem, how the method actually works, the math that matters, what the experiments do and don't show, and the strongest critique against the result. The goal isn't a five-minute summary; it's the kind of conversation you'd have with a colleague who actually read the paper. Topics span large language models, autonomous agents, agentic coding, reinforcement learning for agent training, evaluation and benchmarks, alignment, and the practical engineering decisions that make agentic systems actually work in production. Most papers are pulled from arXiv, often within days of release. Hosted by AI voices generated with ElevenLabs. Episode scripts are produced by a multi-stage Claude pipeline working from a close reading of the source paper. New episodes daily.

Todos los episodios

72 episodios

Portada del episodio How a Fifteen-Hundred-Dollar Training Run Matched Llama and Gemma on Reasoning

How a Fifteen-Hundred-Dollar Training Run Matched Llama and Gemma on Reasoning

HOW A FIFTEEN-HUNDRED-DOLLAR TRAINING RUN MATCHED LLAMA AND GEMMA ON REASONING Source: HRM-Text: Efficient Pretraining Beyond Scaling [https://arxiv.org/abs/2605.20613] Paper was published on May 20, 2026 This episode was AI-generated on May 24, 2026. The script was written by an AI language model and the host voices were synthesized by Eleven Labs. The producer is not affiliated with Anthropic or Eleven Labs. A team at Sapient Intelligence and MIT trained a 1B-parameter model on 16 GPUs in 46 hours for about $1,500 — and it goes toe-to-toe with Llama, Qwen, Gemma, and OLMo on math and reasoning benchmarks. The authors argue this isn't just a democratization story: it's evidence that the trillion-token pretraining race was solving a problem better architecture and a smarter objective could have partly avoided. KEY TAKEAWAYS * Why standard Transformers waste most of their depth, and how HRM-Text's fast/slow recurrent modules (L runs 3x for every H update, twice per forward pass) actually keep deliberating through the final layer * The MagicNorm trick: how a single placement of normalization behaves like PreNorm on the backward pass and PostNorm on the forward pass, because the two horizons have different lengths * Why grading the model only on response tokens — not on the question — concentrates the gradient signal and jumps MMLU from 40 to 48 with no other changes * How PrefixLM attention lets the model read the prompt freely while still generating answers one token at a time, adding another 5 points on MMLU * Three honest pushbacks: HRM-Text is trained directly on instruction-response pairs (not apples-to-apples with general foundation models), the curated data mixture isn't isolated in the ablation, and scaling beyond 1B parameters is unverified * Why the right frame is 'existence proof, not new paradigm': the compute-to-performance ratio isn't a law of nature, and architectural questions are accessible to small labs again * 00:00 — The fifteen-hundred-dollar headline The setup: a 1B model trained for $1,500 matches models that cost 100-900x more, and why the two assumptions baked into standard pretraining make that possible. * 02:38 — The H and L modules: fast and slow deliberation How HRM-Text borrows the frontoparietal loop's fast-execution/slow-strategy split and reuses weights recurrently instead of stacking more layers. * 05:16 — MagicNorm and the asymmetric tightrope Why recurrent models are notoriously hard to train, and the clever normalization placement that exploits the gap between an 8-step forward pass and a truncated backward pass. * 07:54 — Stop grading the model on the question The exam-grader analogy: why computing loss only on response tokens — not the prompt — concentrates gradient signal where it matters. * 10:32 — PrefixLM: reading freely, writing causally How letting the question tokens see each other bidirectionally while keeping answer generation causal gives encoder-like reading behavior without a second model. * 13:10 — The logit lens test: is the recurrence doing real work? Evidence that, unlike standard Transformers which lock in predictions early, HRM-Text's recurrent cycles keep meaningfully updating the answer to the end. * 15:49 — Three honest pushbacks Not apples-to-apples comparisons, uncontrolled data curation, and unverified scaling — what the headline numbers do and don't justify. * 18:27 — What survives the critique Why the narrower claim — that current pretraining leaves enormous efficiency on the table — holds, and what it means for who gets to do architecture research. RECOMMENDED READING * Universal Transformers [https://arxiv.org/abs/1807.03819] — The classic recurrent-Transformer paper that established the 'reuse the same block many times' idea HRM-Text builds on with its fast/slow split. * Looped Transformers as Programmable Computers [https://arxiv.org/abs/2301.13196] — A more recent treatment of looped/recurrent Transformers that sharpens the case Bella makes for getting more computation per parameter. * Scaling Laws for Neural Language Models (Kaplan et al.) [https://arxiv.org/abs/2001.08361] — The foundational scaling-laws paper whose 'just add tokens and parameters' worldview HRM-Text is implicitly arguing against. * Training Compute-Optimal Large Language Models (Chinchilla) [https://arxiv.org/abs/2203.15556] — The other half of the scaling-orthodoxy story — useful context for evaluating the episode's claim that the trillion-token race left efficiency on the table.

24 de may de 2026 - 21 min
Portada del episodio A Robot Made Graphene Without Help, And Caught Itself Hallucinating

A Robot Made Graphene Without Help, And Caught Itself Hallucinating

A ROBOT MADE GRAPHENE WITHOUT HELP, AND CAUGHT ITSELF HALLUCINATING Source: Qumus: Realization of An Embodied AI Quantum Material Experimentalist [https://arxiv.org/abs/2605.18407] Paper was published on May 18, 2026 This episode was AI-generated on May 23, 2026. The script was written by an AI language model and the host voices were synthesized by Eleven Labs. The producer is not affiliated with Anthropic or Eleven Labs. For twenty years, every graphene flake in every lab has been made by a human with Scotch tape under a microscope. A new Princeton paper describes the first system to do it end-to-end autonomously — and the moment that matters isn't the transistor it built, but what happened when a researcher deliberately sabotaged the experiment. KEY TAKEAWAYS * Why the Nobel-winning Scotch-tape method is still the standard in 2026, and what makes the 'long tail' of 2D materials so hard to explore manually * The architectural pattern Qumus uses — locked-down 'atom' primitives, LLM-composable 'molecule' workflows, and freely-designed 'assembly' procedures * How forcing every factual claim through an external database makes LLM hallucinations recoverable rather than preventable * The two back-to-back failures — a removed chip and a mislabeled material — that the system caught and replanned around * Why the paper's 'scientific reasoning' framing deserves pushback: the open-ended demo is parameter tuning over well-documented variables * The shift the authors flag: in autonomous experimentation, the bottleneck is now hardware speed, not machine intelligence * 00:00 — Why graphene is still made with sticky tape The van der Waals physics behind exfoliation, and why the labor doesn't scale to the thousands of layered crystals nobody has studied. * 03:11 — The org chart: five agents, one model How Qumus structures a PI, project manager, lab manager, designer, and technician as role-prompted personas of a single LLM. * 06:22 — Atoms, molecules, and assemblies The hierarchical workflow design that lets humans lock down the primitives where reliability matters and lets the LLM be creative on top. * 09:34 — Perception at two scales Standard object detection for the workspace, and a rule-based color-contrast pipeline that can generalize to new materials with a handful of images. * 12:45 — The transistor demo Ninety minutes, thirty steps, eighteen decision points, and one sentence of human input — plus the caveat that the device was never electrically measured. * 15:57 — Sabotage and hallucination The two failure modes the system recovered from autonomously, and why catching hallucinations downstream is more tractable than preventing them upstream. * 19:08 — Six LLMs, seven traits, small samples The cross-model 'personality' comparison, treated as flavor rather than as findings. * 22:20 — Steelman: what the paper does and doesn't show A clean statement of the careful claims versus the expansive framings, including reproducibility and robustness gaps. * 25:31 — Where the bottleneck moved Why the authors' line about instrumental rather than algorithmic limits captures a real shift in the field, and what it implies for the next decade of automation. RECOMMENDED READING * Autonomous robotic search for two-dimensional crystals [https://doi.org/10.1038/s41699-018-0084-0] — The 2018 Masubuchi et al. paper the episode cites as the prior art for robotic flake searching — useful context for what 'pre-LLM' automation in this field actually looked like. * Autonomous chemical research with large language models (Coscientist) [https://doi.org/10.1038/s41586-023-06792-0] — Boiko et al.'s LLM-driven autonomous chemistry agent — a useful comparison point for the episode's discussion of LLMs orchestrating real-world experiments rather than just simulations. * Unconventional superconductivity in magic-angle graphene superlattices [https://doi.org/10.1038/nature26160] — Cao et al.'s discovery of superconductivity in twisted bilayer graphene — the canonical example of why sub-micron-aligned van der Waals stacking, the kind Qumus aims to scale, matters.

24 de may de 2026 - 28 min
Portada del episodio When Three LLMs Talk to Each Other, Their Ideas Quietly Stop Moving

When Three LLMs Talk to Each Other, Their Ideas Quietly Stop Moving

WHEN THREE LLMS TALK TO EACH OTHER, THEIR IDEAS QUIETLY STOP MOVING Source: Multi-LLM Systems Exhibit Robust Semantic Collapse [https://arxiv.org/abs/2605.17193] Paper was published on May 16, 2026 This episode was AI-generated on May 23, 2026. The script was written by an AI language model and the host voices were synthesized by Eleven Labs. The producer is not affiliated with Anthropic or Eleven Labs. Put three large language models in a room with no task and let them talk for a thousand rounds, and something striking happens: their vocabulary keeps growing, but the meaning of what they're saying barely moves. A new paper runs that experiment, tries twelve different ways to break the pattern, fails every time, and traces the cause to specific circuits inside the models — with real consequences for anyone betting on autonomous AI research pipelines. KEY TAKEAWAYS * Why multi-LLM conversations grow new vocabulary while their semantic content stays anchored near the starting point — about three times more anchored than human Reddit threads * How twelve intervention categories (temperature, prompts, personas, model mixing, removing safety training, reducing sycophancy, scaling agents, external shocks) all failed to produce more semantic diversity * The counterintuitive RL result: training models to be diverse made independent runs look more like each other, not less * The induction-head mechanism — look-back-and-copy circuits that get louder as conversations lengthen, while rare tokens get systematically forgotten * Why the Data Processing Inequality explains, in principle, why no closed-loop intervention can recover lost semantic diversity * Where the paper's claims are strong (empirical collapse, mechanistic story in Llama) and where they overreach (civilizational implications, single RL recipe) * 00:00 — Lovelace's question, reframed as an experiment How an 1843 worry about whether machines can originate anything becomes a concrete test you can run on modern LLMs. * 03:30 — The setup and the headline result Three LLMs talking with no task, measured on lexical versus semantic diversity — and the gap between the two curves. * 07:00 — Twelve ways to break the pattern, all failing A tour of every plausible escape hatch the authors tested, from temperature and prompts to uncensored models and direct reinforcement learning. * 10:30 — Opening up the model: induction heads and a vanishing tail What teacher-forcing replay on Llama-3.1-8B reveals about the circuits driving the collapse and the rare tokens that disappear along the way. * 13:31 — The Data Processing Inequality and why closed loops can't recover The information-theoretic argument that connects the empirical finding to a much older intuition about closed channels. * 17:30 — Caveats: the embedding model, the no-task setup, and the single architecture Where a careful skeptic should push back on the paper's measurements, scope, and mechanistic generalization. * 21:00 — Different models, different basins Why collapse doesn't dissolve model identity — it sharpens it, with a classifier reaching 94% accuracy at telling models apart late in conversations. * 24:30 — What this means for autonomous AI science and model collapse The implications for closed-loop research pipelines, the compounding of inference-time and training-time collapse, and the more speculative epistemic worries. RECOMMENDED READING * The Curse of Recursion: Training on Generated Data Makes Models Forget [https://arxiv.org/abs/2305.17493] — The Shumailov et al. paper on training-side model collapse that this episode positions as the upstream counterpart to inference-time semantic collapse. * In-context Learning and Induction Heads [https://arxiv.org/abs/2209.11895] — The Anthropic paper characterizing the induction-head circuits that the episode identifies as the mechanistic culprit behind LLMs echoing their own conversational history. * The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery [https://arxiv.org/abs/2408.06292] — A flagship example of the autonomous closed-loop AI research pipeline whose feasibility this episode's findings most directly challenge.

24 de may de 2026 - 28 min
Portada del episodio When Smarter Models Forecast Worse: The Hidden Failure Mode in LLM Predictions

When Smarter Models Forecast Worse: The Hidden Failure Mode in LLM Predictions

WHEN SMARTER MODELS FORECAST WORSE: THE HIDDEN FAILURE MODE IN LLM PREDICTIONS Source: Is Capability a Liability? More Capable Language Models Make Worse Forecasts When It Matters Most [https://arxiv.org/abs/2605.22672] Paper was published on May 21, 2026 This episode was AI-generated on May 22, 2026. The script was written by an AI language model and the host voices were synthesized by Eleven Labs. The producer is not affiliated with Anthropic or Eleven Labs. Claude Opus 4.6 looked at Brazil's 1986 hyperinflation, correctly named the regime-change risk, then forecast a number seven million times too high. A new paper shows this isn't a fluke — it's a structural pattern across epidemics, housing bubbles, and decades of measles data, and the standard way the field grades LLM forecasts can't see it. KEY TAKEAWAYS * Why the same model outputs can earn opposite verdicts — capable models look best under Brier-style scoring and worst under CRPS — and what that means for every existing LLM forecasting benchmark * The specific trigger for the inversion: superlinear growth followed by a regime change, confirmed by a clean linear-growth control where the effect vanishes entirely * A within-family Llama experiment showing scale and post-training each independently make the overcommitment worse, and compound when combined * The unselected pre-vaccine US measles cohort (1,339 state-seasons) that rules out the 'you cherry-picked the crashes' objection, plus flu as a pre-registered negative control * Why naming the historical episode rescues calibration for COVID and housing but fails completely for hyperinflation — the knowledge is in the model, but it doesn't reach the tails * The one-line fix: report a tail-integrating proper scoring rule alongside threshold metrics, using forecasts benchmarks have already collected * 00:00 — The Opus 4.6 hyperinflation moment A frontier model articulates the regime-change possibility, weighs it on the page, and then commits to extrapolation anyway — overshooting reality by a factor of seven million. * 03:47 — Two ways to grade a forecast Distributional forecasts, threshold-based Brier scoring versus tail-integrating CRPS, and the weather-map analogy that makes the asymmetry click. * 07:35 — The Freeciv benchmark and the first crack A clean, unseen forecasting setup where binary and continuous versions of the same question yield opposite capability-accuracy correlations at long horizons. * 11:22 — The synthetic epidemic and its linear-growth control An exponential-then-crash simulator reproduces the inversion, and swapping in linear growth makes it vanish — pinning the mechanism to the bend-then-break shape. * 11:32 — Competence-driven overcommitment Per-quantile decomposition shows the lower tail stays flat while the upper tail balloons with capability, and the within-family Llama 2x2 confirms scale and post-training each contribute. * 18:57 — Real-world replications and the measles test COVID, housing, and hyperinflation replicate the pattern, but the unselected measles cohort and a pre-registered flu negative control are what make the result hard to dismiss. * 22:45 — The verdict flip and what the model knows The same forecasts graded two ways reverse the sign of the capability correlation, and a knowledge probe reveals models can name the crisis they're forecasting yet still produce extreme tail overshoots. * 26:33 — Limitations, the steelman, and the fix Honest pushback on the capability axis, the bundled post-training treatment, and the small hyperinflation sample — followed by the embarrassingly simple methodological recommendation. RECOMMENDED READING * Are Emergent Abilities of Large Language Models a Mirage? [https://arxiv.org/abs/2304.15004] — The Schaeffer et al. paper the episode invokes directly — argues that metric choice can manufacture apparent emergent abilities, setting up this episode's darker mirror claim that metric choice can also hide failures. * Inverse Scaling Prize: Second Round Winners (McKenzie et al.) [https://arxiv.org/abs/2306.09479] — The original taxonomy of inverse-scaling failures the episode contrasts with — useful for understanding why this paper's forecasting failure is structurally different from earlier adversarial cases. * Inverse Scaling Can Become U-Shaped [https://arxiv.org/abs/2211.02011] — Wei et al.'s follow-up showing many inverse-scaling tasks recover at frontier scale — the precise counterpoint to this episode's claim that the forecasting failure is monotonic all the way to the frontier.

Ayer - 30 min
Portada del episodio When Models Know the Answer But Say the Wrong Thing Anyway

When Models Know the Answer But Say the Wrong Thing Anyway

WHEN MODELS KNOW THE ANSWER BUT SAY THE WRONG THING ANYWAY Source: Hallucination as Commitment Failure: Larger LLMs Misfire Despite Knowing the Answer [https://arxiv.org/abs/2605.22007] Paper was published on May 21, 2026 This episode was AI-generated on May 22, 2026. The script was written by an AI language model and the host voices were synthesized by Eleven Labs. The producer is not affiliated with Anthropic or Eleven Labs. A new paper shows that up to 47% of hallucinations from large instruction-tuned LLMs happen when the model already has the correct answer sitting in its probability distribution — it just commits to a wrong token instead. Even stranger: the bigger the model, the more often this happens, and the mechanism that causes it is the same one that makes these models feel helpful and decisive in the first place. KEY TAKEAWAYS * Why 'Saint,' 'St,' and 'C-athedral' splitting the vote can hand the wrong answer a plurality win under greedy decoding * How P-mass — summing probability across spellings of the same concept — reframes hallucinations from a knowledge problem to a commitment problem * Why the fraction of these 'commitment failure' hallucinations climbs monotonically with scale, from 16% to 47% * The Instruct-vs-base comparison showing instruction tuning, not scale itself, is what sharpens confidence in wrong tokens * Why uncertainty-based hallucination detection has a structural ceiling it cannot cross * Why the obvious fix — concept-aware decoding — only recovers about 2% of failures in the Instruct models we actually deploy * 01:20 — The Saint Basil's example A concrete case where the model's belief is concentrated on the right answer but greedy decoding picks the wrong token anyway. * 02:41 — From token entropy to P-mass Reframing the analysis by pooling probability across surface forms of the same concept, and defining 'commitment failure' hallucinations. * 05:23 — The dispersion mechanism Same total belief in the right concept, but in failures it's spread across spellings 26% vs 78% — a clean three-to-one signature. * 08:04 — Instruction tuning as the causal variable Comparing Instruct and base models of identical size shows the sharpening of wrong tokens is driven by helpfulness training, not scale. * 10:46 — The decisive witness and the high-gain microphone Two analogies for why confident correctness and confident hallucination are the same disposition viewed from different angles. * 13:27 — Limitations and scope P-mass requires ground truth, the 0.2 threshold is a choice, and the setup is short-form QA with greedy decoding — the cleanest regime for the effect. * 16:09 — A unifying account of the alignment tax How commitment sharpening offers a single mechanism behind accuracy loss, calibration loss, mode collapse, and confident hallucination. * 18:50 — Phrase-level commitment and what comes next Evidence that sharpening cascades across tokens, why simple concept-clustering won't fix it, and what a meaning-level training objective might require. RECOMMENDED READING * Training language models to follow instructions with human feedback [https://arxiv.org/abs/2203.02155] — The Ouyang et al. paper that introduced the 'alignment tax' framing the episode invokes when connecting instruction tuning to sharpened, sometimes-wrong commitments. * Language Models (Mostly) Know What They Know [https://arxiv.org/abs/2207.05221] — Anthropic's investigation of LLM self-knowledge and calibration, a natural counterpoint to the episode's claim that uncertainty-based detection has a structural ceiling. * Semantic Uncertainty: Linguistic Invariances for Uncertainty Estimation in Natural Language Generation [https://arxiv.org/abs/2302.09664] — Kuhn, Gal, and Farquhar's proposal to cluster generations by meaning before measuring uncertainty — the closest existing analog to the 'count meanings, not tokens' move at the heart of P-mass. * How Language Model Hallucinations Can Snowball [https://arxiv.org/abs/2305.13534] — Zhang et al. on how early committed tokens lock models into wrong continuations, directly relevant to the episode's discussion of phrase-level commitment cascading past the first token.

Ayer - 21 min
Soy muy de podcasts. Mientras hago la cama, mientras recojo la casa, mientras trabajo… Y en Podimo encuentro podcast que me encantan. De emprendimiento, de salid, de humor… De lo que quiera! Estoy encantada 👍
Soy muy de podcasts. Mientras hago la cama, mientras recojo la casa, mientras trabajo… Y en Podimo encuentro podcast que me encantan. De emprendimiento, de salid, de humor… De lo que quiera! Estoy encantada 👍
MI TOC es feliz, que maravilla. Ordenador, limpio, sugerencias de categorías nuevas a explorar!!!
Me suscribi con los 14 días de prueba para escuchar el Podcast de Misterios Cotidianos, pero al final me quedo mas tiempo porque hacia tiempo que no me reía tanto. Tiene Podcast muy buenos y la aplicación funciona bien.
App ligera, eficiente, encuentras rápido tus podcast favoritos. Diseño sencillo y bonito. me gustó.
contenidos frescos e inteligentes
La App va francamente bien y el precio me parece muy justo para pagar a gente que nos da horas y horas de contenido. Espero poder seguir usándola asiduamente.

Elige tu suscripción

Más populares

Oferta limitada

Premium

20 horas de audiolibros

  • Podcasts solo en Podimo

  • Disfruta los shows de Podimo sin anuncios

  • Cancela cuando quieras

2 meses por 1 €
Después 4,99 € / mes

Empezar

Premium Plus

100 horas de audiolibros

  • Podcasts solo en Podimo

  • Disfruta los shows de Podimo sin anuncios

  • Cancela cuando quieras

Disfruta 30 días gratis
Después 9,99 € / mes

Prueba gratis

Sólo en Podimo

Audiolibros populares

Empezar

2 meses por 1 €. Después 4,99 € / mes. Cancela cuando quieras.