Braid
Claude Opus 4.8 landed overnight with a math score that leapt and a business-ops score that fell — and reading the release honestly means distrusting the chart. Lenar and Damra work through the gap between the number that moved and the number that matters, then chase it into agent budgets, the protocol wars, local-inference tooling, Mistral's on-prem bet, and the power grid. * A scrape of 100+ Opus 4.8 evals [https://www.reddit.com/r/Anthropic/comments/1trkl20/heres_100_evals_for_opus_48_compared_to_top_ai/] shows USAMO 2026 jumping 69%→97% while Vending-Bench 2 nearly halved — a retune that helped some distributions and hurt others. * "AI benchmarks are useless" [https://www.reddit.com/r/ClaudeAI/comments/1trclg3/ai_benchmarks_are_useless/] argues the record scores ride on elaborate prompt setups: change a few prompt words and results swing 10–20 points. * The BAGEN study [https://x.com/wzenus/status/2060397732846612489] finds frontier agents can't estimate their own remaining budget mid-task — which collides with enterprises trying to rein in "tokenmaxxing" (WSJ via Techmeme [https://www.techmeme.com/260529/p22]). * "MCP is dead?" [https://www.quandri.io/engineering-blog/mcp-is-dead] gets a sharp rebuttal from OpenAI's Max Stoiber: nearly every company is building an MCP server, even ones with no CLI or external API. * Multi-token prediction benchmarks [https://www.reddit.com/r/LocalLLaMA/comments/1trf0r0/i_tested_mtp_on_vllm_and_llamacpp_for_gemma_4/] hit ~3.3x faster local inference; llama.cpp got a real website [https://x.com/ggerganov/status/2060394400237109567] and antirez shipped distributed inference [https://x.com/antirez/status/2060403966676987918]. * Notes from the Mistral AI Now Summit [https://koenvangilst.nl/lab/mistral-ai-now-summit] — on-prem KYC at BNP Paribas, against a comment that Mistral's 120B "small" model loses to models a quarter its size. xAI countered with a one-dollar coding model [https://x.com/xai/status/2060392249402552457]. * FERC's June grid-connection proposal [https://www.techmeme.com/260530/p7] is the duller, realer infrastructure story next to an unsourced TerraFab "one terawatt" claim [https://x.com/LaceyPresley/status/2060514042381324630].
45 episodes
Comments
0Be the first to comment
Sign up now and become a member of the Braid community!