The Agent Failed — But Did the Instructions Deserve to Be Followed?

Beskrivelse

THE AGENT FAILED — BUT DID THE INSTRUCTIONS DESERVE TO BE FOLLOWED? Source: SkillAxe: Sharpening LLM-Authored Agent Skills Through Evaluation-Guided Self-Refinement [https://arxiv.org/abs/2606.10546] Paper was published on June 09, 2026 This episode was AI-generated on June 11, 2026. The script was written by an AI language model and the host voices were synthesized by Eleven Labs. The producer is not affiliated with Anthropic or Eleven Labs. When human experts write instruction documents for AI agents, pass rates jump sixteen points. When the model writes its own, the improvement is exactly zero — even though the documents look great. Microsoft's SkillAxe paper diagnoses why, with a fault-attribution trick that separates 'the instructions were bad' from 'the agent ignored good instructions' — and an honest look at how much the fix actually buys. KEY TAKEAWAYS * Why LLM-authored agent skills score zero improvement despite looking fluent and detailed — the valuable content is failure-derived trivia the model can't generate from general knowledge * How SkillAxe runs the agent twice (with and without the skill) and uses the skill's own stated rules as the grading rubric, so no external answer key is needed * The fault-attribution principle: an identical failure demands opposite repairs depending on whether the violated rule was precise enough to deserve following * The surprising decomposition: refined skills added nothing to per-attempt correctness — the entire gain came from helping the agent finish tasks at all, suggesting skills are institutional knowledge, not extra IQ * In the streaming SpreadsheetBench experiment, refinement bought compression and discoverability (22 skills loaded twice as often) rather than accuracy — the naive 69-skill library hit the same 52% pass rate * Where the headline claims weaken: under the benchmark's native scoring SkillAxe closes only ~11% of the gap to human skills, confidence intervals are wider than the effect, and the authors' own 'fair grader' is an LLM judge they built themselves * 00:00 — The zero-improvement puzzle Human-written skills lift agent pass rates by sixteen points, but LLM-authored skills — fluent and plausible-looking — help exactly as much as nothing. * 03:47 — What a skill is, and why one bit of feedback isn't enough Skills are runtime documentation the agent may consult or ignore, and a single pass/fail signal collapses four distinct failure modes — making naive refinement actively erode good content. * 07:35 — The two-run differential diagnosis SkillAxe runs each task with and without the skill and grades the difference against the skill's own rules, asking four questions: did it help, did it fire on the right tasks, was it followed, and does it cover all valid solution paths. * 11:23 — Trigger geometry: measuring targeting with embeddings Skill descriptions are plotted on a semantic map to check activation zones and exclusion boundaries — revealing that humans almost never write exclusion clauses, while refined skills end up with three-times-wider discrimination margins. * 15:11 — Fault attribution: whose fault was the wrong shade of yellow? The paper's centerpiece — when a rule is violated, the system asks whether the instruction was precise enough to deserve following, producing separate compliance and skill-quality scores instead of one muddled signal. * 18:58 — Results: skills don't make agents smarter, they keep them from tripping The headline 28% relative gain comes entirely from task completion, not correctness — illustrated by a Word placeholder trap that no amount of reasoning solves without procedural trivia. * 22:46 — The flywheel experiment: compression, not accuracy Streaming 200 tasks into a self-organizing skill library more than triples the bare agent's pass rate — but a naive 69-skill library matches it, so refinement's real win is fewer, sharper skills the agent actually loads. * 26:34 — The steelman critique and the wrong-way flywheel Tyler unpacks how the gap-closing claim shrinks from 67% to 11% under native scoring, the statistical power problem, the LLM-judge calibration issue, and the authors' own warning that imperfect diagnostics could bake errors into persistent documents. RECOMMENDED READING * Voyager: An Open-Ended Embodied Agent with Large Language Models [https://arxiv.org/abs/2305.16291] — The skill-library agent Tyler cites directly in the episode as the contrast case — Voyager could refine its skills because Minecraft provided free verification, exactly the oracle SkillAxe has to do without. * Self-Refine: Iterative Refinement with Self-Feedback [https://arxiv.org/abs/2303.17651] — The foundational paper on having an LLM critique and rewrite its own outputs, which SkillAxe's evaluation-guided refinement loop extends from single responses to persistent skill documents. * Large Language Models Cannot Self-Correct Reasoning Yet [https://arxiv.org/abs/2310.01798] — A skeptical look at self-improvement without external feedback that directly echoes the episode's opening puzzle — why LLM-authored skills scored zero until a structured diagnostic signal was added. * Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena [https://arxiv.org/abs/2306.05685] — The key study of LLM-judge positional and calibration biases that underpins Tyler's central critique of SkillAxe's self-built 'fair grader' and judge-driven diagnostics.

What Diffusion Language Models Were Missing: A Map, Not an Algorithm

WHAT DIFFUSION LANGUAGE MODELS WERE MISSING: A MAP, NOT AN ALGORITHM Source: TextLDM: Language Modeling with Continuous Latent Diffusion [https://arxiv.org/abs/2605.07748] Paper was published on May 08, 2026 This episode was AI-generated on June 11, 2026. The script was written by an AI language model and the host voices were synthesized by Eleven Labs. The producer is not affiliated with Anthropic or Eleven Labs. A team built two text compressors with reconstruction accuracy identical to the second decimal place — and one produced a generative model eight times better than the other. The difference was invisible to every obvious metric, and the fix came from an unexpected place: borrowing the internal geometry of a frozen pretrained language model. The result is the first continuous latent diffusion model to pull level with GPT-2 on text continuation — trained from scratch in three days on eight GPUs — and a lesson about latent spaces that applies far beyond text. KEY TAKEAWAYS * Why 'can I reconstruct the data?' is the wrong test for a latent space — a representation can be a flawless lookup table and a hopeless landscape for a diffusion model to navigate * How a single added loss term (REPA) that aligns the VAE's latents to a frozen Qwen language model's third-from-last layer boosts the hardest benchmark's MAUVE score from 2.5 to 20.4 — without improving reconstruction at all * The Stable Diffusion 3 recipe — DiT, flow matching, classifier-free guidance — transfers to text with zero modification, generating an entire paragraph in 50 fixed denoising steps instead of one forward pass per token * TextLDM's 768M-parameter model beats size-matched GPT-2-large on most metrics, with the whole system trained from scratch in about three days on eight GPUs * Where the claims reach: GPT-2 is a seven-year-old baseline, the evaluation only tests text continuation, and the paper's own appendix samples show fluency developing while factual fidelity doesn't * The 'trained from scratch' asterisk — no pretrained component runs at inference, but the system distilled a foundation model's organization during training, and that borrowed geometry is the whole contribution * 00:00 — The puzzle: two identical compressors, wildly different generators A VAE that recovers 99.6% of words from compressed text turns out to be dramatically worse at generation than a twin with identical reconstruction numbers — the question that drives the whole episode. * 03:18 — Why force language into diffusion at all Generative AI is split between autoregressive text and diffusion-based images, and continuous latent diffusion is the only route to a single shared architecture — plus a fixed 50-step inference cost regardless of output length. * 06:36 — The TextVAE bridge — and where reconstruction saturates Stage one compresses each token into a continuous vector so diffusion has something to denoise, and reconstruction accuracy maxes out almost immediately across every configuration tried. * 09:54 — Warehouse vs. library: why retrieval isn't navigation An analogy for the paper's central insight — reconstruction only requires distinguishable addresses, but a diffusion model is a browser that needs meaningfully arranged neighborhoods to wander toward coherent text. * 13:12 — REPA: a frozen language model as geometry teacher A single loss term pulls the VAE encoder's representations into alignment with a frozen 1.7B Qwen model — at its third-from-last layer, not its final one — reshaping the latent space without touching reconstruction. * 16:30 — Running the image recipe unmodified Flow matching trained as a 'which way is home in the fog' direction field, plus classifier-free guidance lifted straight from image generation with a sweet-spot guidance scale of seven. * 19:48 — Results: matching GPT-2, crushing prior diffusion LMs, and the eightfold ablation TextLDM beats earlier diffusion language models, edges size-matched GPT-2-large on most metrics, and the REPA-versus-no-REPA comparison (20.4 vs 2.5 MAUVE) closes the loop on the opening puzzle — all on three days of compute. * 23:06 — Watching prose condense from static — and what doesn't develop The appendix denoising progressions go from word salad at step ten to fluent biography at step fifty, but the facts in those fluent outputs are frequently invented. * 26:24 — The steelman critique and what actually endures Dated baselines, a continuation-only evaluation, metric disagreements, and the 'from scratch' asterisk get weighed honestly — and the durable lesson lands: navigable latent geometry, not a better algorithm, was the missing ingredient. RECOMMENDED READING * Representation Alignment for Generation: Training Diffusion Transformers Is Easier Than You Think [https://arxiv.org/abs/2410.06940] — The original REPA paper from image diffusion — the 'load-bearing innovation' this episode spent its core segment on, here in its original vision-domain form before TextLDM repurposed it to shape a text VAE's latent geometry. * Diffusion-LM Improves Controllable Text Generation [https://arxiv.org/abs/2205.14217] — The pioneering continuous text diffusion work in the 'frustrating lineage' the episode described — useful for seeing what the field tried before the latent-geometry ingredient was identified. * Large Language Diffusion Models (LLaDA) [https://arxiv.org/abs/2502.09992] — The flagship of the discrete diffusion branch the episode contrasted with TextLDM's continuous approach — the competing answer to whether diffusion can absorb language. * Scaling Rectified Flow Transformers for High-Resolution Image Synthesis (Stable Diffusion 3) [https://arxiv.org/abs/2403.03206] — The exact recipe — DiT, flow matching, timestep sampling, classifier-free guidance — that the episode said TextLDM transplanted to text with zero modification.

12. juni 202629 min

The Agent Failed — But Did the Instructions Deserve to Be Followed?

Beskrivelse

Kommentarer

Prøv gratis i 14 dager

Alle episoder