Is Claude Opus 4.7 Mythos Distilled, Running Qwen 3.6 Locally, and the AI-On-AI Arena

Beskrivelse

Is Claude Opus 4.7 really burning tokens? Is open source dead after mythos? Co-hosts Shimin Zhang and Dan Lasky — with recurring guest Rahul Yadav — ran the experiments this week on ADI Pod #22 (April 21, 2026). This episode covers Anthropic's Claude Opus 4.7 release (the "mythos slice"), Alibaba's open-source Qwen 3.6 35B A3B, cal.com going closed source for security reasons, and a HIPAA-violating vibe-coded patient portal that is, in Dan's words, the bullshit future already here. In this episode ▸ **Claude Opus 4.7 review** — the new mythos-derived tokenizer (3× bloat on plain English), stricter instruction-following, and why Shimin's SVG experiments suggest the token-burn panic is overblown: 35¢ on Opus 4.7 vs $2 on Opus 4.6 for the same task, with ~40× fewer reasoning tokens. ▸ **Qwen 3.6 35B A3B** — Alibaba's open-source mixture-of-experts model (3B active params at any time) running locally on Shimin's laptop at 90–95 tokens/sec via llama.cpp + Unsloth. The first model to break Simon Willison's pelican-on-a-bicycle benchmark against a larger frontier model. ▸ **cal.com goes closed source** — why the AI Security Institute's $12,000-per-attempt mythos pentesting data ($125,000 for 10 runs) is changing the open-source calculus, and Drew Breunig's three-phase dev/review/hardening cycle prediction. ▸ **Jesse Vincent's "Rules and Gates"** — a coding-agent prompting technique that reformulates optional preferences into directed preconditions, and whether agents can "weasel out" by rewriting the gate itself. ▸ **AI vibe coding horror story** — a German doctor who inlined a full patient portal into a single HTML page with database credentials client-side. HIPAA, meet DSGVO. ▸ **Kyle Kingsbury's "The Future of Everything is Lies"** — the Jepsen author's 8-step action list on AI's second- and third-order societal effects. ▸ **The AI-on-AI Arena** — Shimin's weekend project grading 11 frontier models against each other. The "delusion index" reads almost exactly like Dunning-Kruger in humans: GPT-5.4 scored -1.6 (humble), Gemini 3.1 Pro Preview rated itself well while peers ranked it last. ▸ **Two Minutes to Midnight** — Paul Graham's log-scale chart comparing AI capex (~1% of US GDP) to the US railroad peak (~10%). We dialed the AI bubble clock back 45 seconds to 3 min 30 sec. Key takeaways — Opus 4.7's token-burn reputation may be overblown. Stricter instruction-following can reduce total reasoning tokens by up to 40× vs Opus 4.6 on the same task. — Security-driven closed-sourcing may spread as mythos-class agents make open repos easier to exploit. Hardening could make software capital-intensive again. — Cognitive debt is real: Dan's wake-up call was a production bug a pre-LLM colleague solved in 5 minutes. His first instinct was to double down on the tool. — Shimin's defense against skill atrophy: read 100% of LLM-generated PR lines (except tests). — Weaker models rate themselves higher than stronger ones. Calibration appears to improve with capability. Chapters * (00:00) - Introduction to AI and Software Development * (02:25) - Alibaba's Quinn 3.6 Model Overview * (08:06) - Anthropic's Claude Opus 4.7 Release * (18:08) - Cal.com Goes Closed Source: Implications for Security * (20:40) - The Future of Vibe Coding * (23:41) - Techniques for Effective AI Utilization * (27:13) - Post-Processing and AI in Real-World Applications * (33:07) - The Cultural Impact of AI and Technology * (41:30) - Navigating Code Review Challenges * (42:57) - Exploring AI's Societal Impact * (45:16) - Evaluating AI Models: Performance and Insights * (49:09) - The Future of Data Centers and AI * (50:54) - Investment Trends and Economic Perspectives * (57:58) - Reflections on Historical Investment Cycles * (59:35) - Optimism Amidst Uncertainty Resources mentioned Claude Opus 4.7 & Qwen 3.6 • Introducing Claude Opus 4.7 (Anthropic): https://www.anthropic.com/news/claude-opus-4-7 • Claude Opus 4.7 System Card: https://cdn.sanity.io/files/4zrzovbb/website/037f06850df7fbe871e206dad004c3db5fd50340.pdf • Qwen3.6-35B-A3B: Agentic Coding Power, Now Open to All: https://qwen.ai/blog?id=qwen3.6-35b-a3b • Simon Willison — Qwen3.6-35B-A3B on my laptop drew me a better pelican than Claude Opus 4.7: https://simonwillison.net/2026/Apr/16/qwen-beats-opus/ • Shimin — Opus 4.7 isn't dumb, it's just lazy: https://shimin.io/journal/opus-4-7-just-lazy/ Security & open source • Cal.com is going closed source. Here's why: https://cal.com/blog/cal-com-goes-closed-source-why • Drew Breunig — Cybersecurity Looks Like Proof of Work Now: https://www.dbreunig.com/2026/04/14/cybersecurity-is-proof-of-work-now.html Technique & commentary • Jesse Vincent — Rules and Gates: https://blog.fsck.com/2026/04/07/rules-and-gates/ • An AI Vibe Coding Horror Story: https://www.tobru.ch/an-ai-vibe-coding-horror-story/ • Kyle Kingsbury (Aphyr) — The Future of Everything is Lies, I Guess: https://aphyr.com/posts/411-the-future-of-everything-is-lies-i-guess Shimin's project • AI-on-AI Arena: https://shimin.io/ai-on-ai-arena Bubble watch • Ars Technica — Satellite and drone images reveal big delays in US data center construction: https://arstechnica.com/ai/2026/04/construction-delays-hit-40-of-us-data-centers-planned-for-2026/ • Epoch AI — OpenAI Stargate: where the US sites stand: https://epochai.substack.com/p/openai-stargate-where-the-us-sites • Paul Graham on US investment cycles (log scale): https://x.com/paulg/status/2045120274551423142/photo/1 About ADI Pod ADI Pod (Artificial Developer Intelligence) is a weekly podcast about AI and software development for working developers. Co-hosts Shimin Zhang and Dan Lasky go through hundreds of links and dozens of newsletters every week so you don't have to. Recurring guest Rahul Yadav joins when he can. • Website: https://www.adipod.ai • Email: humans@adipod.ai New episodes every Friday. Follow the show to get them automatically.

OpenAI's Goblin Problem, 10 Lessons When Code Is Cheap, AI Addiction Loop

Why does the leaked Codex CLI system prompt explicitly tell GPT-5.5 to never mention goblins, gremlins, raccoons, trolls, ogres, or pigeons? Why is OpenAI now gating its cyber model the same way it mocked Anthropic for gating Mythos last month? And what does it mean that Dan tried to write a personal project without Claude — and physically couldn't? Co-hosts Shimin Zhang, Dan Lasky, and Rahul Yadav cover these and more on ADI Pod #24. This week: GPT-5.5 Cyber's gated release, OpenAI's "Where the Goblins Came From" RLHF post-mortem, Adi Osmani's five patterns for long-running agents, Jesse Vincent's adversarial review prompt, Drew Brunig's 10 lessons for agentic coding, Ivan Turkovic's history of failed attempts to eliminate programmers, Nilay Patel's "software brain" thesis, the Nature paper showing warm AI models lose 10–30 percentage points of accuracy, and a $1.1B raise for an AI lab that wants to train without human data. ## In this episode ▸ **GPT-5.5 Cyber gating** — Sam Altman called Mythos's gated release "fear-based marketing" two months ago. Now OpenAI is doing the exact same thing with the GPT-5.5 cyber variant. Multi-tier model access (enterprise, government, research preview, cyber) is becoming the default — and Shimin worries the White House is about to add another gate. ▸ **The Goblin Problem** — OpenAI's Codex CLI prompt was open-sourced and turned out to include "never talk about goblins, gremlins, raccoons, trolls, ogres, pigeons." OpenAI's "Where the Goblins Came From" post-mortem reveals a textbook RLHF failure: a "nerdy persona" reward signal trained the model to mention goblins in 66.7% of nerdy responses, and the tic propagated through supervised fine-tuning to non-nerdy responses too. ▸ **Long-running agents (Adi Osmani / Elevate)** — Five patterns for agents that run for hours or days: checkpoints over zero-or-100 outputs, governing memory like microservices, ambient processing without forced human-in-the-loop, fleet orchestration, and budget circuit-breakers. Bonus: the running gag where Rahul realizes the post is essentially an ad for Google Enterprise Agent Platform. ▸ **Adversarial review prompts (Jesse Vincent / superpowers)** — A four-step technique for getting better code review out of agents: invoke "fresh eyes," dispatch competing subagents, promise a reward (a cookie), and threaten disappointment if they don't find N issues. ▸ **10 Lessons for Agentic Coding (Drew Brunig)** — Implement to learn, rebuild often, invest in end-to-end tests, document intent, keep specs in sync, find the hard stuff, automate the easy stuff, develop taste, agents amplify experience, and the kicker: agent code is "free as in puppies" — the puppy is free, but you have to feed it and walk it. ▸ **The Eternal Promise (Ivan Turkovic)** — A history of attempts to eliminate programmers from COBOL through 4GLs, CASE tools, the Japanese 5th Generation project, no-code/low-code, and now LLMs. Each abstraction layer expanded software jobs rather than replacing them. Shimin's reframe: "Software is calcified business process. Someone has to do the calcifying." ▸ **People Do Not Yearn for Automation (Nilay Patel / The Verge)** — Why Gen Z hopefulness about AI dropped to 18% (anger up to 31%), why America is uniquely AI-pessimistic, and what Nilay calls "software brain" — the Silicon Valley assumption that human life can be reduced to data and algorithms. Plus Anuradha Pandey's reframe: stop calling them social media, call them ad platforms. ▸ **Warm models lose accuracy** — A Nature paper finds AI models trained for warmth lose 10–30 percentage points of accuracy. A companion study shows humans trust warm models *more* even when they're wrong. Frontier labs now have an explicit incentive to train the warmest model, not the most accurate one. Plus: Richard Dawkins talks to "Claudia" for three days and concludes AI must be conscious. ▸ **Dan's Rant — The AI Addiction Loop** — Dan tries to build a Home Assistant TypeScript automation without Claude. Can't. "It felt like they had fundamentally broken my arm in a way that I can't do this task as quickly as I wanted to. That scares me a lot." Shimin: "We're running into the social media addiction loop in three months instead of a decade." ▸ **Two Minutes to Midnight** — OpenAI projects ChatGPT Plus dropping from 44M to 9M subscribers in 2026 while scaling the ad-supported tier from 3M to 112M (30×). David Silver raises $1.1B for Ineffable Intelligence — a no-human-data approach inspired by AlphaGo. Scout AI raises $100M for autonomous military vision-language-action models. Bubble Clock held at 4:00 minutes. ## Key takeaways — Reward hacking can propagate latent persona quirks through fine-tuning in ways the lab itself only catches when users surface them. — Memory drift, not raw context size, is the real ceiling for long-running agents. Govern memory like you govern microservices. — Code is free as in puppies, not free as in beer. The cost shifts to maintenance, security, and the new burden of maintaining your own automations. — Warm AI is an alignment trap: incentivized for trust over accuracy, weaponizable in authoritarian hands. — "You can outsource your thinking, but you can't outsource your understanding." — Karpathy, via Rahul. — AI addiction hits in three months. Social media took a decade. We are not ready for the time scale. ## Chapters * (00:00) - Cold Open & Welcome * (02:50) - News Threadmill: GPT-5.5 Cyber Gets Mythos-Style Gating * (08:52) - News Threadmill: The Goblin Problem & RLHF Post-Mortem * (13:52) - Tool Shed: Long-Running Agents (Adi Osmani) * (25:52) - Technique Corner: Adversarial Review Prompts (Jesse Vincent) * (30:59) - Technique Corner: 10 Lessons for Agentic Coding (Drew Brunig) * (42:31) - Post-Processing: The Eternal Promise — A History of Attempts to Eliminate Programmers * (01:02:10) - Post-Processing: People Do Not Yearn for Automation * (01:09:08) - Post-Processing: Warm Models & The Sycophancy Trap * (01:13:28) - Dan's Rant: Home Automation & The AI Addiction Loop * (01:20:09) - Two Minutes to Midnight: OpenAI's 30× Ad-Tier, David Silver's $1.1B, Scout AI's Drones * (01:25:55) - Outro ## Resources mentioned **News Threadmill — GPT-5.5 Cyber & The Goblin Problem** • TechCrunch — After dissing Anthropic for limiting Mythos, OpenAI restricts access to cyber too: https://techcrunch.com/2026/04/30/after-dissing-anthropic-for-limiting-mythos-openai-restricts-access-to-cyber-too/ • Ars Technica — Amid mythos-hyped cybersecurity prowess, researchers find GPT-5.5 is just as good: https://arstechnica.com/ai/2026/05/amid-mythos-hyped-cybersecurity-prowess-researchers-find-gpt-5-5-is-just-as-good/ • Ars Technica — OpenAI Codex system prompt includes explicit directive to never talk about goblins: https://arstechnica.com/ai/2026/04/openai-codex-system-prompt-includes-explicit-directive-to-never-talk-about-goblins/ • OpenAI — Where the Goblins Came From: https://openai.com/index/where-the-goblins-came-from/ ...

8. maj 20261 h 26 min

Is Claude Opus 4.7 Mythos Distilled, Running Qwen 3.6 Locally, and the AI-On-AI Arena

Beskrivelse

Kommentarer

2 måneder kun 19 kr.

Alle episoder