GLM 5.2: why I’m replacing Opus in Claude Code with this new model

Kuvaus

I put GLM 5.2, the open-weight coding model from Z.AI, through four real tasks inside my actual codebase: a codebase architecture audit, a UI redesign, and a 45-minute autonomous bug-hunting session pulling from Sentry and Vercel logs. Total cost: $3.36 for roughly 6 million tokens, a prioritized bug-fix dashboard I’m actually shipping from, and a landing page redesign that matched Chat PRD’s design system on the first try. What you’ll learn: 1. What “open-weight” actually means and why it matters for cost and vendor independence 2. How to connect GLM 5.2 to Cursor and Claude Code 3. How it performs on codebase exploration and autonomous architecture summarization in a real production Next.js app 4. Whether GLM 5.2 can match an existing design system 5. How the model handles a 45-minute long-running autonomous task 6. Where GLM 5.2 stumbled 7. The actual cost breakdown — Brought to you by: Mercury [https://mercury.com/]—Radically different banking loved by over 300K entrepreneurs — In this episode, we cover: (00:00) What open-weight models are and why GLM 5.2 is worth testing (01:38) GLM 5.2 model overview (04:02) Capabilities and benchmark results (06:02) How to set up GLM 5.2 in Cursor (08:37) How to set up GLM 5.2 in Claude Code (11:04) Live test 1: codebase exploration and architecture audit on ChatPRD (12:43) Live test 2: generating an HTML architecture and roadmap page (16:37) Live test 3: redesigning the How I AI landing page in Cursor (20:57) Live test 4: 45-minute autonomous task, pulling Sentry errors and Vercel logs (22:35) Where it struggled (23:49) My verdict on the output (25:23) Cost breakdown — Tools referenced: * z.ai: https://z.ai [https://z.ai/] * GLM 5.2: https://z.ai/blog/glm-5.2 [https://z.ai/blog/glm-5.2] * OpenRouter: https://openrouter.ai [https://openrouter.ai/] * Cursor: https://cursor.com [https://cursor.com/] * Claude Code: https://docs.anthropic.com/en/docs/claude-code [https://docs.anthropic.com/en/docs/claude-code] * Sentry: https://sentry.io [https://sentry.io/] * Vercel: https://vercel.com [https://vercel.com/] — Other references: * SWE-Bench Pro leaderboard (coding benchmark scores referenced in episode): https://www.swebench.com [https://www.swebench.com/] * Frontier Suite and Post-Train Bench (additional benchmarks cited): https://scale.com/leaderboard [https://scale.com/leaderboard] * Use Claude Code with OpenRouter: https://openrouter.ai/docs/cookbook/coding-agents/claude-code-integration [https://openrouter.ai/docs/cookbook/coding-agents/claude-code-integration] — Where to find Claire Vo: ChatPRD: https://www.chatprd.ai/ [https://www.chatprd.ai/] Website: https://clairevo.com/ [https://clairevo.com/] LinkedIn: https://www.linkedin.com/in/clairevo/ [https://www.linkedin.com/in/clairevo/] X: https://x.com/clairevo [https://x.com/clairevo] — Production and marketing by https://penname.co/ [https://penname.co/]. For inquiries about sponsoring the podcast, email jordan@penname.co.

How Braintrust uses AI agents, evals, and CI to ship better software | Ankur Goyal

In this episode, I sit down with Ankur Goyal, founder and CEO of Braintrust, the AI evals and observability platform used by teams like Notion, Stripe, Vercel, and Zapier. This one is for the senior engineers, staff engineers, VPs of engineering, and CTOs in my audience. We get into how coding agents can take on deeply technical architecture and infrastructure work that no single human engineer could tackle before, and then we demystify evals so you can use them to make your AI products better without touching the implementation. What you’ll learn: 1. How Ankur uses Codex to run week-long benchmark experiments across database indexes, column store formats, and execution engines to speed up slow queries 2. Why he argues there’s no excuse to skip rigorous benchmarking now that agents can run them tirelessly 3. The “agent line” framework: how to decide which decisions, directions, and interactions you can hand off to an agent 4. How I think about the practical vs. theoretical quality of AI on hard technical problems, and why human attention decays on tedious work 5. Why evals are the modern version of a PRD, and how to encode “what good looks like” so a model can figure out the “how” 6. How to build a scoring function live and let an agent improve your prompt inside a safe playground 7. How Ankur turned his designer David’s taste into a repeatable eval so quality scales beyond one person 8. Why fixing your CI is the highest-leverage way to speed up engineering velocity — Brought to you by: Guru [https://www.getguru.com/?utm_source=howi_ai_podcast&utm_medium=podcast&utm_campaign=q1]—The AI layer of truth Persona [https://withpersona.com/lp/howiai]—Trusted identity verification for any use case — In this episode, we cover: (00:00) Introduction to Ankur Goyal (03:00) Using AI agents for database optimization (06:10) Running exhaustive benchmarks with coding agents (09:03) Why staff engineers are wrong about AI limitations (11:30) The “agent line” framework for delegation (14:00) Ankur’s workflow: running 4 to 6 concurrent agents (17:16) Technical setup: foreground agents, background agents, and cloud environments (20:32) Spending time with AI tools (23:06) Demystifying evals (26:02) Live demo: Building an eval for documentation answers (30:20) The alternative to evals: vibe checks and whack-a-mole (32:09) Capturing designer taste in scoring functions (33:13) Quick recap (33:44) Managing velocity and throughput (35:40) Why CI/CD investment is critical for AI-accelerated teams (37:30) Ankur’s prompting strategy when agents fail (39:10) Closing thoughts and how to connect — Tools referenced: • Braintrust: https://www.braintrust.dev/ [https://www.braintrust.dev/] • Codex: https://openai.com/codex/ [https://openai.com/codex/] • GPT 5.4: https://developers.openai.com/api/docs/models/gpt-5.4 [https://developers.openai.com/api/docs/models/gpt-5.4] • Claude: https://claude.ai/ [https://claude.ai/] — Other references: • GPT 5.5 just did what no other model could: https://www.lennysnewsletter.com/p/gpt-55-just-did-what-no-other-model [https://www.lennysnewsletter.com/p/gpt-55-just-did-what-no-other-model] • Paul Graham’s Maker vs. Manager Schedule: http://www.paulgraham.com/makersschedule.html [http://www.paulgraham.com/makersschedule.html] • tmux: https://github.com/tmux/tmux [https://github.com/tmux/tmux] • Chris Tate at Vercel: https://www.linkedin.com/in/ctatedev/ [https://www.linkedin.com/in/ctatedev/] — Where to find Ankur Goyal: LinkedIn: https://www.linkedin.com/in/ankrgyl/ [https://www.linkedin.com/in/ankrgyl/] — Where to find Claire Vo: ChatPRD: https://www.chatprd.ai/ [https://www.chatprd.ai/] Website: https://clairevo.com/ [https://clairevo.com/] LinkedIn: https://www.linkedin.com/in/clairevo/ [https://www.linkedin.com/in/clairevo/] X: https://x.com/clairevo [https://x.com/clairevo] — Production and marketing by https://penname.co/ [https://penname.co/]. For inquiries about sponsoring the podcast, email jordan@penname.co.

15. kesä 202640 min

GLM 5.2: why I’m replacing Opus in Claude Code with this new model

Kuvaus

Kommentit

14 vrk ilmainen kokeilu

Kaikki jaksot