GPT-5.5 vs Reality: Do Benchmarks Lie?

1 h 0 min · 25. apr. 2026

Description

Tim and Paul dissect the GPT-5.5 launch, weighing state-of-the-art benchmarks against real-world user vibes and token efficiency to determine if the upgrade is truly worth the increased cost for developers building production workloads at scale. They also unpack the groundbreaking HTML-in-Canvas proposal that promises to bridge the DOM and canvas rendering gap, unlocking new possibilities for accessibility, interactive web graphics, and shader-driven transitions without fragile hacks. Finally, Tim reveals exclusive results from a unique creative AI benchmark testing model taste and planning, exposing surprising winners beyond standard leaderboards and proving that real-world performance often diverges significantly from the spec sheet while highlighting which models possess the creative judgment required for complex multi-step tasks without hand-holding.

Comments

Be the first to comment

Get Started

All episodes

18 episodes

Fable 5 Banned: The Multi-Model Escape Plan

Anthropic launched Claude Fable 5 with huge expectations, only to see the US government order it pulled globally three days later. Tim and Paul dig into the swirling conspiracy theories: was it retaliation for refusing to arm the Pentagon? Did a competitor exploit a jailbreak report to kneecap a rival? And did Anthropic’s own transparency accidentally hand over the rope? Then the conversation pivots to token anxiety, ballooning API costs, and the open-source models like GLM 5.2 and DeepSeek V4 Pro that now rival proprietary giants at a fraction of the price. The episode’s core insight: a three-stage workflow—planning with a flagship model, implementing with a cheap or local one, and reviewing with a third—lets developers escape single-point-of-failure risks and spiraling bills, and it's already taking shape across the coding community.

19. juni 20261 h 0 min

AI Didn't Invent These Problems

Tim and Paul break down Anthropic's Fable 5 pricing disconnect, the dav1d assembly decoder that outraced higher-level implementations, and why Agile's 2001 playbook stumbles when agents build apps in hours. They critique the hype around autonomous agent loops, highlighting the real constraints—budgets, tests, and decision quality—that determine whether AI accelerates value or just incinerates tokens. It's a tight hour on the shifting boundaries of craft, process, and the problems AI reveals but can't solve on its own.

12. juni 20261 h 0 min

Why 95% of AI Pilots Fail (And How to Fix It)

Tim kicks things off with an AWS agent nightmare that couldn't tell dev from prod, sparking a deep dive into where deterministic pipelines end and true LLM reasoning begins. Using a clever flight-tracking case study, the hosts map out when to use frontier models, local open-weight models, or no AI at all—then connect it all to an MIT study showing 95% of generative AI pilots fail to deliver profit, often because companies treat the API bill itself as a success metric. If you're wrestling with agentic vs. scripted workflows, bloated AI spend, or just an editor that can't keep up, this conversation offers a clearer lens for building with intention.

5. juni 20261 h 0 min

The Tools That Got Us Here Won't Get Us There

Tim and Paul dissect why CI/CD pipelines are buckling under the speed of AI-generated code, sharing strategies like pre-commit hooks, intelligent test selection, and ephemeral preview environments to survive the new velocity. Then Paul makes the case that Cursor’s reliance on VS Code could doom it; and they debate whether SpaceX’s $60 billion option to acquire the editor will be a lifeline or a chaos bomb. This episode is a candid look at how the entire developer toolchain, from pipelines to editors, is being forced to reinvent itself for the AI era.

29. maj 20261 h 0 min

RAG Isn't Dead, You're Just New Here

In this episode, Tim and Paul dismantle popular 'RAG is dead' and 'MCP is over' hot takes, revealing what these declarations actually say about the author's place on the AI learning curve. They explore the overlooked depth of retrieval-augmented generation and model context protocol, weigh CLI-first agent workflows against structured tool use, and share practical tips like using tmux as an agent dashboard. The takeaway: every tool has limits, but declaring a paradigm dead often means you've stopped learning—and there's a smarter way forward.

23. maj 20261 h 0 min

GPT-5.5 vs Reality: Do Benchmarks Lie?

Description

Comments

1 month for 9 kr.

All episodes