AI Papers: A Deep Dive

About AI Papers: A Deep Dive

Long-form deep dives into new research on Artificial Intelligence, AI agents and the engineering practice of building them - one paper per episode. We unpack the motivating problem, how the method actually works, the math that matters, what the experiments do and don't show, and the strongest critique against the result. The goal isn't a five-minute summary; it's the kind of conversation you'd have with a colleague who actually read the paper. Topics span large language models, autonomous agents, agentic coding, reinforcement learning for agent training, evaluation and benchmarks, alignment, and the practical engineering decisions that make agentic systems actually work in production. Most papers are pulled from arXiv, often within days of release. Hosted by AI voices generated with ElevenLabs. Episode scripts are produced by a multi-stage Claude pipeline working from a close reading of the source paper. New episodes daily.

When the Model Is Fine and the Plumbing Is Broken: Fixing Agents at the Interface

WHEN THE MODEL IS FINE AND THE PLUMBING IS BROKEN: FIXING AGENTS AT THE INTERFACE Source: Adapting the Interface, Not the Model: Runtime Harness Adaptation for Deterministic LLM Agents [https://arxiv.org/abs/2605.22166] Paper was published on May 21, 2026 This episode was AI-generated on May 22, 2026. The script was written by an AI language model and the host voices were synthesized by Eleven Labs. The producer is not affiliated with Anthropic or Eleven Labs. A four-billion-parameter model can score 74% on olympiad math but fail half the time at microwaving a virtual apple — and a new paper argues the problem isn't the model, it's the layer between the model and the world. The authors build a harness that fixes the interface instead of retraining the model, then show it improves 116 out of 126 model-environment combinations, including beating a model specifically fine-tuned for the task. If they're right, a lot of the engineering we've been pouring into model weights actually belongs somewhere else. KEY TAKEAWAYS * Why agent failures are dominated by interface bugs — malformed tool calls, contract violations, and loops — not reasoning failures * The four-layer harness taxonomy (action realization, environment contract, trajectory regulation, procedural skill) and which layer carries which environment * How a harness evolved from one 4B model's failures transfers, unchanged, to seventeen other models from 7B to 70B * The xLAM comparison: a base model with a good harness beats the same base model fine-tuned specifically for the benchmark — and generalizes better too * Where the method's scope ends: deterministic, rule-governed environments yes; open-ended web browsing probably not * The honest limits — environment-specific patches, untested robustness of the Codex-in-the-loop evolution, and ablations only run on the source model * 00:00 — The apple gap and what failure actually looks like Concrete examples of how strong models fail at simple embodied tasks — prose instead of tool calls, malformed arguments, and repeated invalid commands. * 02:34 — Reframing the agent as model plus environment plus harness The paper's core conceptual move: treating the plumbing between model and environment as a first-class system component. * 05:09 — Classifying failures in priority order The four failure categories — action realization, contract, trajectory, reasoning — and why classifying in the right order matters for diagnosis. * 07:44 — The four-layer harness architecture How each lifecycle moment gets its own intervention, with form-validation and GPS-recalculation analogies for the two most load-bearing layers. * 10:19 — Evolving the harness with a coding agent in the loop How Codex generates patches from failed trajectories within the four-layer scaffolding, and why that structural constraint matters. * 12:54 — The transfer result across 17 models and 7 environments Freezing the harness built on a 4B model and seeing it improve 92% of model-environment pairs, including a 15x jump on Llama-3.1-8B in ALFWorld. * 15:29 — Beating a model trained for the task The xLAM comparison: base Qwen plus harness outperforms the specifically fine-tuned variant on its own benchmark and generalizes better off-distribution. * 18:04 — Steelmanning the pushback Honest limits on benchmark scope, the environment-specificity of patches, robustness of the evolution process, and incomplete per-model ablations. * 20:39 — Why this matters for where agent engineering goes next The broader shift toward taking the system around the model seriously — and what that implies for deployment economics and future work. RECOMMENDED READING * ReAct: Synergizing Reasoning and Acting in Language Models [https://arxiv.org/abs/2210.03629] — The foundational paper establishing the LLM-agent loop that this episode's harness wraps around — useful background for understanding what 'the model emits, the environment executes' actually means in practice. * ALFWorld: Aligning Text and Embodied Environments for Interactive Learning [https://arxiv.org/abs/2010.03768] — The household-tasks benchmark that opens the episode with the embarrassing apple-microwaving gap, and where removing the trajectory regulation layer crashes performance by 86%. * τ-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains [https://arxiv.org/abs/2406.12045] — The customer-service benchmark behind the episode's xLAM comparison, including the pass^k reliability metric the hosts flag as the bar that matters for production agents. * GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning [https://arxiv.org/abs/2507.19457] — The prompt-optimization baseline the episode contrasts with harness adaptation, illustrating the ceiling of what you can fix by rewriting prompts alone.

Yesterday - 23 min

AI Papers: A Deep Dive

2 months for 19 kr.

About AI Papers: A Deep Dive

All episodes

Only on Podimo

Popular audiobooks