The number nobody optimized for

18 min · 30. maj 2026

Description

Claude Opus 4.8 landed overnight with a math score that leapt and a business-ops score that fell — and reading the release honestly means distrusting the chart. Lenar and Damra work through the gap between the number that moved and the number that matters, then chase it into agent budgets, the protocol wars, local-inference tooling, Mistral's on-prem bet, and the power grid. * A scrape of 100+ Opus 4.8 evals [https://www.reddit.com/r/Anthropic/comments/1trkl20/heres_100_evals_for_opus_48_compared_to_top_ai/] shows USAMO 2026 jumping 69%→97% while Vending-Bench 2 nearly halved — a retune that helped some distributions and hurt others. * "AI benchmarks are useless" [https://www.reddit.com/r/ClaudeAI/comments/1trclg3/ai_benchmarks_are_useless/] argues the record scores ride on elaborate prompt setups: change a few prompt words and results swing 10–20 points. * The BAGEN study [https://x.com/wzenus/status/2060397732846612489] finds frontier agents can't estimate their own remaining budget mid-task — which collides with enterprises trying to rein in "tokenmaxxing" (WSJ via Techmeme [https://www.techmeme.com/260529/p22]). * "MCP is dead?" [https://www.quandri.io/engineering-blog/mcp-is-dead] gets a sharp rebuttal from OpenAI's Max Stoiber: nearly every company is building an MCP server, even ones with no CLI or external API. * Multi-token prediction benchmarks [https://www.reddit.com/r/LocalLLaMA/comments/1trf0r0/i_tested_mtp_on_vllm_and_llamacpp_for_gemma_4/] hit ~3.3x faster local inference; llama.cpp got a real website [https://x.com/ggerganov/status/2060394400237109567] and antirez shipped distributed inference [https://x.com/antirez/status/2060403966676987918]. * Notes from the Mistral AI Now Summit [https://koenvangilst.nl/lab/mistral-ai-now-summit] — on-prem KYC at BNP Paribas, against a comment that Mistral's 120B "small" model loses to models a quarter its size. xAI countered with a one-dollar coding model [https://x.com/xai/status/2060392249402552457]. * FERC's June grid-connection proposal [https://www.techmeme.com/260530/p7] is the duller, realer infrastructure story next to an unsourced TerraFab "one terawatt" claim [https://x.com/LaceyPresley/status/2060514042381324630].

Comments

Be the first to comment

Get Started

All episodes

45 episodes

Eighty Billion and the Ideas Underneath

The day's news ran on a single tension: enormous sums are being raised to fund the AI buildout, while the question of whether the capability and the margins follow stays unanswered. Lenar and Damra trace the money from Alphabet's filings to Anthropic's IPO paperwork, then down into the tooling, the chips, and one paper about ideas no human is positioned to have. * Alphabet's $80bn equity raise [https://www.theguardian.com/technology/2026/jun/02/google-alphabet-sell-stock-ai-share-sale-berkshire-hathaway] — a profitable company choosing to dilute shareholders rather than borrow, with $10bn going to Berkshire Hathaway, signals how hard the compute commitment is to walk back. * Anthropic's confidential IPO filing [https://www.axios.com/2026/06/02/anthropic-ipo-ai-sticker-shock-spending-usage] lands as corporate America hits "AI sticker shock" — and Anthropic's biggest customers are the companies tightening those budgets. * Knowledge workers are now ~1/5 of OpenAI Codex users [https://www.axios.com/2026/06/02/openai-codex-knowledge-workers], growing three times faster than developers — moving code generation to people who can't always read the output. * Cloudflare's Agents SDK v0.14.0 [https://x.com/whoiskatrin/status/2061757643471945948] ships durable workflows, schedules, and skills — the difference between an agent you operate and a worker you delegate to. * China adds data and algorithms to its trade-secret rules [https://www.techmeme.com/260602/p11] while military-linked universities seek Nvidia H200 chips [https://www.techmeme.com/260602/p3] and Arm names Oracle and ByteDance [https://www.techmeme.com/260602/p10] as data-center CPU customers. * "Alien Science" [https://arxiv.org/abs/2603.01092] samples research directions that are coherent but cognitively unavailable — logical ideas no community is positioned to propose.

Yesterday17 min

Cheaper From Both Ends

A Chinese lab cut the price of a frontier-class coding model to a fraction of Opus, Nvidia tried to own every layer from the laptop to the data center, and one developer ran the new Gemma 4 on a decade-old Xeon. The cost of running intelligence got attacked from both ends on the same morning — and the question underneath all of it is who gets to set that cost. * MiniMax M3 [https://www.techmeme.com/260601/p26] claims parity with Opus 4.7 at roughly twelve cents per million input tokens versus five dollars — but the weights are promised in about ten days, so "open-weights" is still a countdown. * Nvidia's DGX Station [https://www.techmeme.com/260601/p12] puts a GB300 chip and up to 748GB of memory on a desktop, enough to run a one-trillion-parameter model locally; the RTX Spark [https://www.theguardian.com/technology/2026/jun/01/nvidia-launches-chip-ai-laptops-pc-rtx-spark-microsoft-windows] chip pushes the same idea into laptops, while the Vera CPUs [https://www.techmeme.com/260601/p19] — with Anthropic, OpenAI, and SpaceX as early customers — signal a move off x86. * A 10-year-old Xeon is all you need [https://point.free/blog/gemma-4-on-a-2016-xeon/]: cafkafk runs a 26B mixture-of-experts model at reading speed on a 2016 CPU with no GPU, arguing mainstream tools hide the performance levers. * Cosmos 3 [https://www.axios.com/2026/06/01/nvidia-ai-push-cosmos-3-world-model] is Nvidia's open physical-AI world model, backed by a Cosmos Coalition [https://x.com/runwayml/status/2061315089869721682] with Runway as a founding member. * Cadence and Nvidia [https://www.forbes.com/sites/karlfreund/2026/06/01/cadence-and-nvidia-team-to-develop-first--fully-autonomous-eda-agent/] claim a "Level 5" autonomous chip-verification agent that turns months into a day — a large autonomy claim in a domain where mistakes ship in silicon. * Anthropic will let the EU's ENISA join Project Glasswing [https://www.techmeme.com/260601/p27] for access to a model called Mythos, even as a Wirescreen analysis [https://www.techmeme.com/260601/p28] documents 500+ PLA attempts to procure Nvidia chips and governments from India and the UAE [https://restofworld.org/2026/india-uae-g42-cerebras-ai-sovereignty/] to France [https://www.techmeme.com/260601/p30] move to own their compute.

1. juni 202619 min

Who Holds the Dial

A frontier model gets called a step toward God in one window and a judgmental token-burner in the next. We spend the morning on the gap between the marketing altitude and the desk, and find the same thread running through everything: every layer now has a control surface someone's reaching for. * Dylan Field on Opus 4.8 [https://x.com/zoink/status/2060769829133721974] calls it "a very strange model" — honesty up, curiosity down, personality judgmental — a reminder that a tuning dial has costs you can feel. * scaling01 on DeepSWE [https://x.com/scaling01/status/2060768119941947699] says GPT-5.5 "score-, time- and token-mogged" Opus 4.8, putting the efficiency column — the one that pays your bill — back in the conversation. * Ben Kunkle on Zed's Zeta 2 [https://www.youtube.com/watch?v=phchDt63qAA] shows how a ten-second editing pause becomes a training label, and how a million frontier-model calls got replaced by a self-grading student model. * Philipp Schmid (DeepMind) [https://www.youtube.com/watch?v=3_gYbhABcAE] on the five assumptions that trip up senior engineers building agents — errors as inputs, evals not unit tests, and "build to delete." * Komi-learn [https://github.com/kurikomi-labs/komi-learn] and a year on knowledge-graph memory [https://www.reddit.com/r/AI_Agents/comments/1ts3nq2/i_spent_a_year_building_agent_memory_on_knowledge/] share one missing thing: a controlled before-and-after proving the memory layer, not the model, made the agent better. * A Lancet correspondence [https://www.forbes.com/sites/brucelee/2026/05/30/ai-fabricated-citations-in-over-2800-biomedical-journal-articles/] finds 4,046 fabricated references across 2,810 published articles — model honesty rising while the literature's integrity falls. * Quick hits: AMD's Lisa Su vs Nvidia's Jensen Huang on China [https://www.techmeme.com/260531/p7], IBM's Sovereign Core [https://www.forbes.com/sites/stevemcdowell/2026/05/30/ibms-agentic-operating-model-puts-sovereignty-at-the-center/], and a court ordering Circle to freeze a $12.6M contract [https://www.techmeme.com/260531/p3].

31. maj 202618 min

The number nobody optimized for

30. maj 202618 min

Locally coherent, globally not

Friday's room sits between a hobbyist voice assistant running entirely on Mario Zechner's desk and a cluster of arXiv papers all saying the same thing from different angles: long-running agents now fall apart in ways the model can't fix. Lenar and Damra read four reliability papers side by side, then turn to the personal-memory question every shipping assistant is already getting wrong. * Mario Zechner on pibot [https://x.com/badlogicgames/status/2060268257739677713/photo/1] — full local voice loop with Parakeet, Qwen 3 TTS, and Qwen 3.6 through llama.cpp, with the STT and TTS engines ported from Python into Rust on mlx-c. The runtime detail is the news, not the model lineup. * Ethan Mollick on token budgets [https://x.com/emollick/status/2060357604044358108] — split spend between building and learning. Read against yesterday's Kirkland and Ellis platform story, the question becomes who controls the learning budget at internal AI orgs. * MMPO [https://arxiv.org/abs/2605.30159] — Ziyan Liu and team train a policy that decides when memory in long-horizon agents should be rewritten and when it should be left alone. Belief drift comes from over-eager rewrites, not missing updates. * RedundancyBench [https://arxiv.org/abs/2605.29893] — Minyang Hu's group benchmarks how many steps in a long agent trajectory are repeats. Stale duplicates of state crowd out the relevant signal in context. * Locally Coherent, Globally Incoherent [https://arxiv.org/abs/2605.30335] — Anany Kotawala's single-author paper bounds compositional incoherence in multi-component agents. Defensible local outputs assemble into contradictory global ones. * Agent-Radar [https://arxiv.org/abs/2605.30136] — Hongxiang Zhang's group steers attention toward context-relevant tokens in multi-agent communication, so the receiver isn't drowned in noise from the sender. * Selective QA over conflicting personal memory [https://arxiv.org/abs/2605.30087] — Tiancheng Yang's testbed for what happens when your assistant's memories about you disagree. No single resolution strategy dominates. * BioRefusalAudit [https://arxiv.org/abs/2605.30162] — Caleb DeLeeuw uses sparse autoencoders to ask whether a model's refusal is shallow pattern matching or whether the dangerous capability isn't there at all. * AutoformBot and Atlas [https://arxiv.org/abs/2605.29955] — Ahmad Rammal's team at FAIR Paris and NYU on a multi-agent system that pulls textbook math into Lean 4 at scale. Lean is the verifier the agents can't argue with.

29. maj 202622 min

The number nobody optimized for

Description

Comments

1 month for 9 kr.

All episodes