How I AI

How I AI

GLM 5.2: why I’m replacing Opus in Claude Code with this new model

27 min · Ayer
Portada del episodio GLM 5.2: why I’m replacing Opus in Claude Code with this new model

Descripción

I put GLM 5.2, the open-weight coding model from Z.AI, through four real tasks inside my actual codebase: a codebase architecture audit, a UI redesign, and a 45-minute autonomous bug-hunting session pulling from Sentry and Vercel logs. Total cost: $3.36 for roughly 6 million tokens, a prioritized bug-fix dashboard I’m actually shipping from, and a landing page redesign that matched Chat PRD’s design system on the first try. What you’ll learn: 1. What “open-weight” actually means and why it matters for cost and vendor independence 2. How to connect GLM 5.2 to Cursor and Claude Code 3. How it performs on codebase exploration and autonomous architecture summarization in a real production Next.js app 4. Whether GLM 5.2 can match an existing design system 5. How the model handles a 45-minute long-running autonomous task 6. Where GLM 5.2 stumbled  7. The actual cost breakdown — Brought to you by: Mercury [https://mercury.com/]—Radically different banking loved by over 300K entrepreneurs — In this episode, we cover: (00:00) What open-weight models are and why GLM 5.2 is worth testing (01:38) GLM 5.2 model overview (04:02) Capabilities and benchmark results (06:02) How to set up GLM 5.2 in Cursor (08:37) How to set up GLM 5.2 in Claude Code (11:04) Live test 1: codebase exploration and architecture audit on ChatPRD (12:43) Live test 2: generating an HTML architecture and roadmap page (16:37) Live test 3: redesigning the How I AI landing page in Cursor (20:57) Live test 4: 45-minute autonomous task, pulling Sentry errors and Vercel logs (22:35) Where it struggled (23:49) My verdict on the output (25:23) Cost breakdown — Tools referenced: * z.ai: https://z.ai [https://z.ai/] * GLM 5.2: https://z.ai/blog/glm-5.2 [https://z.ai/blog/glm-5.2] * OpenRouter: https://openrouter.ai [https://openrouter.ai/] * Cursor: https://cursor.com [https://cursor.com/] * Claude Code: https://docs.anthropic.com/en/docs/claude-code [https://docs.anthropic.com/en/docs/claude-code] * Sentry: https://sentry.io [https://sentry.io/] * Vercel: https://vercel.com [https://vercel.com/] — Other references: * SWE-Bench Pro leaderboard (coding benchmark scores referenced in episode): https://www.swebench.com [https://www.swebench.com/] * Frontier Suite and Post-Train Bench (additional benchmarks cited): https://scale.com/leaderboard [https://scale.com/leaderboard] * Use Claude Code with OpenRouter: https://openrouter.ai/docs/cookbook/coding-agents/claude-code-integration [https://openrouter.ai/docs/cookbook/coding-agents/claude-code-integration] — Where to find Claire Vo: ChatPRD: https://www.chatprd.ai/ [https://www.chatprd.ai/] Website: https://clairevo.com/ [https://clairevo.com/] LinkedIn: https://www.linkedin.com/in/clairevo/ [https://www.linkedin.com/in/clairevo/] X: https://x.com/clairevo [https://x.com/clairevo] — Production and marketing by https://penname.co/ [https://penname.co/]. For inquiries about sponsoring the podcast, email jordan@penname.co.

Comentarios

0

Sé la primera persona en comentar

¡Regístrate ahora y únete a la comunidad de How I AI!

Empezar

2 meses por 1 €

Después 4,99 € / mes · Cancela cuando quieras.

  • Podcasts exclusivos
  • 20 horas de audiolibros / mes
  • Podcast gratuitos

Todos los episodios

87 episodios

Portada del episodio GLM 5.2: why I’m replacing Opus in Claude Code with this new model

GLM 5.2: why I’m replacing Opus in Claude Code with this new model

I put GLM 5.2, the open-weight coding model from Z.AI, through four real tasks inside my actual codebase: a codebase architecture audit, a UI redesign, and a 45-minute autonomous bug-hunting session pulling from Sentry and Vercel logs. Total cost: $3.36 for roughly 6 million tokens, a prioritized bug-fix dashboard I’m actually shipping from, and a landing page redesign that matched Chat PRD’s design system on the first try. What you’ll learn: 1. What “open-weight” actually means and why it matters for cost and vendor independence 2. How to connect GLM 5.2 to Cursor and Claude Code 3. How it performs on codebase exploration and autonomous architecture summarization in a real production Next.js app 4. Whether GLM 5.2 can match an existing design system 5. How the model handles a 45-minute long-running autonomous task 6. Where GLM 5.2 stumbled  7. The actual cost breakdown — Brought to you by: Mercury [https://mercury.com/]—Radically different banking loved by over 300K entrepreneurs — In this episode, we cover: (00:00) What open-weight models are and why GLM 5.2 is worth testing (01:38) GLM 5.2 model overview (04:02) Capabilities and benchmark results (06:02) How to set up GLM 5.2 in Cursor (08:37) How to set up GLM 5.2 in Claude Code (11:04) Live test 1: codebase exploration and architecture audit on ChatPRD (12:43) Live test 2: generating an HTML architecture and roadmap page (16:37) Live test 3: redesigning the How I AI landing page in Cursor (20:57) Live test 4: 45-minute autonomous task, pulling Sentry errors and Vercel logs (22:35) Where it struggled (23:49) My verdict on the output (25:23) Cost breakdown — Tools referenced: * z.ai: https://z.ai [https://z.ai/] * GLM 5.2: https://z.ai/blog/glm-5.2 [https://z.ai/blog/glm-5.2] * OpenRouter: https://openrouter.ai [https://openrouter.ai/] * Cursor: https://cursor.com [https://cursor.com/] * Claude Code: https://docs.anthropic.com/en/docs/claude-code [https://docs.anthropic.com/en/docs/claude-code] * Sentry: https://sentry.io [https://sentry.io/] * Vercel: https://vercel.com [https://vercel.com/] — Other references: * SWE-Bench Pro leaderboard (coding benchmark scores referenced in episode): https://www.swebench.com [https://www.swebench.com/] * Frontier Suite and Post-Train Bench (additional benchmarks cited): https://scale.com/leaderboard [https://scale.com/leaderboard] * Use Claude Code with OpenRouter: https://openrouter.ai/docs/cookbook/coding-agents/claude-code-integration [https://openrouter.ai/docs/cookbook/coding-agents/claude-code-integration] — Where to find Claire Vo: ChatPRD: https://www.chatprd.ai/ [https://www.chatprd.ai/] Website: https://clairevo.com/ [https://clairevo.com/] LinkedIn: https://www.linkedin.com/in/clairevo/ [https://www.linkedin.com/in/clairevo/] X: https://x.com/clairevo [https://x.com/clairevo] — Production and marketing by https://penname.co/ [https://penname.co/]. For inquiries about sponsoring the podcast, email jordan@penname.co.

Ayer27 min
Portada del episodio How Claude Mythos found a 15-year-old bug in Mozilla Firefox | Brian Grinstead

How Claude Mythos found a 15-year-old bug in Mozilla Firefox | Brian Grinstead

Brian Grinstead is a distinguished engineer at Mozilla, where he’s worked on Firefox and the web platform since 2013 (he joined to help launch Firefox DevTools). Recently he and his team pointed an agentic bug-finding pipeline at Firefox—a codebase with tens of thousands of files and tens of millions of lines of code—and shipped a record month of security fixes. The viral chart everyone saw gave the credit to Anthropic’s new Mythos model. Brian’s take is that the harness and pipeline did just as much of the work, and he walks through exactly how it runs and how anyone can build a starter version. What you’ll learn: 1. How to build a basic bug-finding harness by running Claude Code or Codex with one prompt and the -p flag, no SDK required 2. Why pointing an agent at a whole codebase fails, and how an LLM judge can score and rank files before you spend any compute 3. How a verifier subagent kills false positives by catching the agent when it cheats 4. The goal-loop pattern: give an agent a tightly scoped problem, a clear pass/fail signal, and let it retry far past the point a human would quit 5. Why teams that already invested in fuzzing, CI, and dev tooling are so far ahead 6. How to weigh model versus harness, and why Brian splits the credit close to 50-50 7. How a non-engineer can reuse the same score, verify, and fix the loop for design quality, conversion rate, or tech debt 8. Why AI-generated patches still can’t ship on their own, and where humans stay in the loop — Brought to you by: WorkOS [https://workos.com/?utm_source=lennys_howiai&utm_medium=podcast&utm_campaign=q22025]—Make your app enterprise-ready today Metaview [https://www.metaview.ai/home/how-i-ai]—The agentic recruiting platform for winning teams — In this episode, we cover: (00:00) Introduction to Brian Grinstead (02:43) The viral chart: Firefox Security Bug Fixes by Month (05:32) How the custom harness works (10:22) Goal loops and guardrails (14:45) How they built it (16:55) Real bugs, including a 15-year-old one (23:00) Open-sourcing it (26:26) Why humans still review every fix (32:30) Live demo and prioritizing files (40:18) Mobilizing the team and recap (42:33) Lightning round — Tools referenced: • Claude Code: https://claude.ai/code [https://claude.ai/code] • Claude Agent SDK: https://code.claude.com/docs/en/agent-sdk/overview [https://code.claude.com/docs/en/agent-sdk/overview] • Codex: https://openai.com/index/openai-codex/ [https://openai.com/index/openai-codex/] • OpenAI Agent SDK: https://developers.openai.com/api/docs/guides/agents [https://developers.openai.com/api/docs/guides/agents] • VS Code: https://code.visualstudio.com/ [https://code.visualstudio.com/] • Docker: https://www.docker.com/ [https://www.docker.com/] • Firefox: https://www.mozilla.org/firefox/ [https://www.mozilla.org/firefox/] • Address Sanitizer: https://github.com/google/sanitizers [https://github.com/google/sanitizers] • RLBox: https://rlbox.dev/ [https://rlbox.dev/] — Other references: • Mozilla Bug Bounty Program: https://www.mozilla.org/security/bug-bounty/ [https://www.mozilla.org/security/bug-bounty/] • Mozilla GitHub: https://github.com/mozilla [https://github.com/mozilla] — Where to find Brian Grinstead: LinkedIn: https://www.linkedin.com/in/bgrins/ [https://www.linkedin.com/in/bgrins/] GitHub: https://github.com/bgrins [https://github.com/bgrins] — Where to find Claire Vo: ChatPRD: https://www.chatprd.ai/ [https://www.chatprd.ai/] Website: https://clairevo.com/ [https://clairevo.com/] LinkedIn: https://www.linkedin.com/in/clairevo/ [https://www.linkedin.com/in/clairevo/] X: https://x.com/clairevo [https://x.com/clairevo] — Production and marketing by https://penname.co/ [https://penname.co/]. For inquiries about sponsoring the podcast, email jordan@penname.co.

22 de jun de 202648 min
Portada del episodio How to design AI agent loops: schedules, goals, and subagents in Claude Code and Codex

How to design AI agent loops: schedules, goals, and subagents in Claude Code and Codex

I break down every loop type from scratch—what a heartbeat, cron, hook, and goal loop actually are, when each one fits, and the five things any effective loop needs before it touches production. Then I build two live loops: a daily aging-PR reviewer in Claude Code that schedules itself at 10:15 a.m. and spins off its own subagents, and a weekly skills-identification loop in Codex that spawns goal-based subagents to validate its own output in real time. What you’ll learn: 1. The plain-English definition of a loop—and why it’s just an automated prompt, not a scary new paradigm 2. The four loop types (heartbeat, cron, hook, and goal) and when each one actually fits your workflow 3. How to think about loop design using the “onboarding an employee” mental model 4. The five things every effective loop needs: work trees, skills, plugins/connectors, subagents, and state tracking 5. How to build a scheduled PR-review routine in Claude Code that babysits aging PRs and alerts your team 6. How to set up a weekly skills-identification automation in Codex that spawns its own validating subagents 7. Why goal-based loops are the hardest to write well—and where most people burn tokens for nothing 8. The two warning signs that your loop is going to get expensive before it gets useful — Brought to you by: WorkOS [https://workos.com/?utm_source=lennys_howiai&utm_medium=podcast&utm_campaign=q22025]—Make your app enterprise-ready today Runway [https://runwayml.com/howIAI]—The creative AI platform for images, video, and more — In this episode, we cover: (00:00) Prompts are out and loops are in (02:30) Defining a loop (03:03) The four ways to automate a prompt: heartbeat, cron, hooks, and goals (06:03) Five things every effective loop needs (09:26) The “onboarding an employee” framework for designing loops (11:58) Live build #1: Daily aging PR loop in Claude Code (17:08) Subagents inside loops (19:00) Live build #2: Weekly skills identification loop in Codex (22:57) Watching subagents spin up in real time (25:28) Warning signals around loops (27:31) What listeners are doing with loops — Tools referenced: • Claude Code: https://claude.ai/code [https://claude.ai/code] • Codex: https://chatgpt.com/codex [https://chatgpt.com/codex] • OpenClaw: https://openclaw.ai/ [https://openclaw.ai/] — Other references: • Claire’s article “Why OpenClaw Feels Alive Even Though It’s Not”: https://x.com/clairevo/article/2017741569521271175 [https://x.com/clairevo/article/2017741569521271175] • Addy Osmani’s article on loop engineering: https://addyosmani.com/blog/loop-engineering/ [https://addyosmani.com/blog/loop-engineering/] • Using Goals in Codex: https://developers.openai.com/cookbook/examples/codex/using_goals_in_codex [https://developers.openai.com/cookbook/examples/codex/using_goals_in_codex] — Where to find Claire Vo: ChatPRD: https://www.chatprd.ai/ [https://www.chatprd.ai/] Website: https://clairevo.com/ [https://clairevo.com/] LinkedIn: https://www.linkedin.com/in/clairevo/ [https://www.linkedin.com/in/clairevo/] X: https://x.com/clairevo [https://x.com/clairevo] — Production and marketing by https://penname.co/ [https://penname.co/]. For inquiries about sponsoring the podcast, email jordan@penname.co.

17 de jun de 202629 min
Portada del episodio How Braintrust uses AI agents, evals, and CI to ship better software | Ankur Goyal

How Braintrust uses AI agents, evals, and CI to ship better software | Ankur Goyal

In this episode, I sit down with Ankur Goyal, founder and CEO of Braintrust, the AI evals and observability platform used by teams like Notion, Stripe, Vercel, and Zapier. This one is for the senior engineers, staff engineers, VPs of engineering, and CTOs in my audience. We get into how coding agents can take on deeply technical architecture and infrastructure work that no single human engineer could tackle before, and then we demystify evals so you can use them to make your AI products better without touching the implementation. What you’ll learn: 1. How Ankur uses Codex to run week-long benchmark experiments across database indexes, column store formats, and execution engines to speed up slow queries 2. Why he argues there’s no excuse to skip rigorous benchmarking now that agents can run them tirelessly 3. The “agent line” framework: how to decide which decisions, directions, and interactions you can hand off to an agent 4. How I think about the practical vs. theoretical quality of AI on hard technical problems, and why human attention decays on tedious work 5. Why evals are the modern version of a PRD, and how to encode “what good looks like” so a model can figure out the “how” 6. How to build a scoring function live and let an agent improve your prompt inside a safe playground 7. How Ankur turned his designer David’s taste into a repeatable eval so quality scales beyond one person 8. Why fixing your CI is the highest-leverage way to speed up engineering velocity — Brought to you by: Guru [https://www.getguru.com/?utm_source=howi_ai_podcast&utm_medium=podcast&utm_campaign=q1]—The AI layer of truth Persona [https://withpersona.com/lp/howiai]—Trusted identity verification for any use case — In this episode, we cover: (00:00) Introduction to Ankur Goyal (03:00) Using AI agents for database optimization (06:10) Running exhaustive benchmarks with coding agents (09:03) Why staff engineers are wrong about AI limitations (11:30) The “agent line” framework for delegation (14:00) Ankur’s workflow: running 4 to 6 concurrent agents (17:16) Technical setup: foreground agents, background agents, and cloud environments (20:32) Spending time with AI tools (23:06) Demystifying evals (26:02) Live demo: Building an eval for documentation answers (30:20) The alternative to evals: vibe checks and whack-a-mole (32:09) Capturing designer taste in scoring functions (33:13) Quick recap (33:44) Managing velocity and throughput (35:40) Why CI/CD investment is critical for AI-accelerated teams (37:30) Ankur’s prompting strategy when agents fail (39:10) Closing thoughts and how to connect — Tools referenced: • Braintrust: https://www.braintrust.dev/ [https://www.braintrust.dev/] • Codex: https://openai.com/codex/ [https://openai.com/codex/] • GPT 5.4: https://developers.openai.com/api/docs/models/gpt-5.4 [https://developers.openai.com/api/docs/models/gpt-5.4] • Claude: https://claude.ai/ [https://claude.ai/] — Other references: • GPT 5.5 just did what no other model could: https://www.lennysnewsletter.com/p/gpt-55-just-did-what-no-other-model [https://www.lennysnewsletter.com/p/gpt-55-just-did-what-no-other-model] • Paul Graham’s Maker vs. Manager Schedule: http://www.paulgraham.com/makersschedule.html [http://www.paulgraham.com/makersschedule.html] • tmux: https://github.com/tmux/tmux [https://github.com/tmux/tmux] • Chris Tate at Vercel: https://www.linkedin.com/in/ctatedev/ [https://www.linkedin.com/in/ctatedev/] — Where to find Ankur Goyal: LinkedIn: https://www.linkedin.com/in/ankrgyl/ [https://www.linkedin.com/in/ankrgyl/] — Where to find Claire Vo: ChatPRD: https://www.chatprd.ai/ [https://www.chatprd.ai/] Website: https://clairevo.com/ [https://clairevo.com/] LinkedIn: https://www.linkedin.com/in/clairevo/ [https://www.linkedin.com/in/clairevo/] X: https://x.com/clairevo [https://x.com/clairevo] — Production and marketing by https://penname.co/ [https://penname.co/]. For inquiries about sponsoring the podcast, email jordan@penname.co.

15 de jun de 202640 min
Portada del episodio Claude Fable 5 review: what the new Mythos model gets right (and very wrong)

Claude Fable 5 review: what the new Mythos model gets right (and very wrong)

Claude Fable 5 is the first Mythos-class intelligence model to be generally available, and I got early access to test it before launch. In this episode, I walk through what Anthropic is promising, what actually stood out when I used it on real work, and where I think it fits in your AI stack. — In this episode, we cover: (00:00) Introduction: Fable 5 is finally here (00:31) What Anthropic says about the model (05:14) Token-intensive by design (06:28) Safety classifiers and the new fallback concept (07:46) Is this or is this not Mythos? (08:30) New product launches: Managed Agents and more (09:20) Crushing benchmarks (09:55) What it’s actually like to use (the good and the bad) (11:40) Test 1: product graph spec (12:56) Test 2: designing a skills registry (14:04) Conservative on execution (14:43) Test 3: multi-agent orchestration (15:39) My takeaways — Tools referenced: • Claude Fable 5: https://www.anthropic.com/news/claude-fable-5-mythos-5 [https://www.anthropic.com/news/claude-fable-5-mythos-5] • Claude Managed Agents: https://platform.claude.com/docs/en/managed-agents/overview [https://platform.claude.com/docs/en/managed-agents/overview] — Other reference: • SWBench Pro benchmark: https://www.swebench.com/ [https://www.swebench.com/] — Where to find Claire Vo: ChatPRD: https://www.chatprd.ai/ [https://www.chatprd.ai/] Website: https://clairevo.com/ [https://clairevo.com/] LinkedIn: https://www.linkedin.com/in/clairevo/ [https://www.linkedin.com/in/clairevo/] X: https://x.com/clairevo [https://x.com/clairevo] — Production and marketing by https://penname.co/ [https://penname.co/]. For inquiries about sponsoring the podcast, email jordan@penname.co.

9 de jun de 202617 min