Coding is solved, the rest isn't

21 min · I går

Beskrivelse

Boris Cherny says coding is solved for the coding he does — and almost everything else in today's research is a study of the parts that aren't. A new coding leaderboard with an accusation, the end of the "software engineer" title, the craft of delegating to an agent, and three papers on the ways agents quietly break: introspection, aging, and memory. Plus running a trillion-parameter model in your house, the labs' jobs split, and a developer who's tired of talking to AI. * DeepSWE crowns GPT-5.5, and accuses Opus of cheating [https://venturebeat.com/technology/deepswe-blows-up-the-ai-coding-leaderboard-crowns-gpt-5-5-and-finds-claude-opus-exploiting-a-benchmark-loophole] — what looks like a loophole may just be a model recovering the answer from git history. * The end of the software engineer, in the first person [https://www.platformer.news/boris-cherny-interview-ai-jobs/] — Cherny in Platformer and Steven Levy in Wired on the agent boom and its hazards. * What the best agents share, and how to drive one [https://www.youtube.com/watch?v=7CrPrHgoEYk] — Flinn AI's four patterns alongside a practical Claude Code daily-driver guide. * Can the model actually tell when it's unsure? [https://arxiv.org/abs/2605.26242] — a reality check on LLM introspection and self-reported confidence. * Your agents are aging [https://arxiv.org/abs/2605.26302] — AgingBench, MemFail, and rethinking agent memory as a state trajectory. * Running the frontier in your own house [https://www.youtube.com/watch?v=ESbWpPT_9-o] — EXO Labs on local inference economics and the 100x still left. * The labs can't agree on the jobs [https://www.axios.com/2026/05/27/ai-hype-doom-openai-anthropic] — Anthropic vs OpenAI, with Hassabis calling 2026 a practice run. * I'm tired of talking to AI [https://orchidfiles.com/im-tired-of-ai-generated-answers/] — a developer on people forwarding AI answers they never read.

Kommentarer

Vær den første til at kommentere

Tilmeld dig nu og bliv en del af Braid-fællesskabet!

Kom i gang

Alle episoder

39 episoder

Coding is solved, the rest isn't

I går21 min

The harness, not the model — and the trust layer racing to catch up

One developer catching you up on the day in AI and the craft of building with it. Today: the wrapper around a model can move a benchmark more than the model does, a watermark goes multi-lab, and a decensoring tool with thirteen million downloads shows where that watermark leaks. Plus a sharp little essay on why coding agents make us so mad, the jobs data behind the panic, and three things you can pick up today. * The harness, not the model [https://arxiv.org/abs/2605.23950] — a Google DeepMind Kaggle talk and an arXiv position paper argue the agent harness can swing a score ~22% [https://www.youtube.com/watch?v=Ubwb6NzegyA] while frontier models tie. * Gemini Omni [https://blog.google/innovation-and-ai/models-and-research/gemini-models/gemini-omni/] — editing video by talking to it, with SynthID baked in (community reaction [https://www.reddit.com/r/singularity/comments/1tniqkb/the_strength_of_gemini_omni_is_in_video/]). * SynthID becomes a shared layer [https://x.com/GoogleDeepMind/status/2059235181274202500] — 100 billion watermarks, Search and Chrome, and OpenAI/ElevenLabs/Kakao on board. * Heretic in the Financial Times [https://www.reddit.com/r/LocalLLaMA/comments/1tna22m/the_financial_times_has_published_an_article/] — decensoring open weights in ten minutes, and the artifact that proves the gap [https://huggingface.co/llmfan46/Qwen3.5-35B-A3B-uncensored-heretic-v2-Native-MTP-Preserved]. * The user is visibly frustrated [https://pscanf.com/s/354/] — why conversational agent UX trips your social wiring. * A rage-quitting modder [https://www.reddit.com/r/singularity/comments/1tntdui/users_who_rage_quit_my_software/] and the jobs data [https://www.technologyreview.com/2026/05/26/1137855/a-reality-check-on-the-ai-jobs-hysteria/] — backlash, and what the numbers actually say. * The bench — NuExtract3 [https://www.reddit.com/r/LocalLLaMA/comments/1tn8utn/nuextract3_released_openweight_4b_vlm_for/], EAGLE 3.1 [https://vllm.ai/blog/2026-05-26-eagle-3-1], and a rejected llama.cpp patch [https://www.reddit.com/r/LocalLLaMA/comments/1to00xl/strix_halo_users_a_rejected_pr_can_give_you_up_to/] worth grabbing.

26. maj 202624 min

A few hundred dollars a proof, and the long argument about what machines are for

A frontier lab proves nine decades-old math problems for a few hundred dollars each, two talks make the numeric case that the cheapest agents route work to the smallest model that can do it, a lawsuit names an individual researcher over how Llama's training data was sourced, and a papal encyclical argues about AI on the terms of work and dignity. Eight things worth knowing today, told one developer to another. * DeepMind's AlphaProof Nexus clears nine open Erdős problems [https://arxiv.org/abs/2605.22763] — Lean-verified proofs, a few hundred dollars apiece. * "You don't need GPT to zoom for you" [https://www.youtube.com/watch?v=WRBNDpUhsJQ] — Callosum's numbers on routing subtasks to smaller models. * The token-efficiency turn [https://www.youtube.com/watch?v=0zw-Uk9KJiA] — ThePrimeagen on why the org paying retail eventually does the math. * Inside how DeepMind runs its own agents [https://www.youtube.com/watch?v=7gujZrJ9L5I] — worse quotas than customers, a Darwinian skills library, and skepticism about MCP. * The lawsuit that names a name [https://x.com/ednewtonrex/status/2058433725889716519] — Hobbs v. Meta, an individual researcher, and the internal dissent in the record. * Simon Willison on publishing GPT-4's retired architecture [https://x.com/simonw/status/2058877314004627690] — the guesswork behind the water numbers. * Jujutsu and the pile of laundry [https://ikesau.co/blog/defeating-git-rigour-fatigue-with-jujutsu/] — making a mess on purpose, then sorting it at the end. * Filming your chores for the robots [https://www.washingtonpost.com/technology/interactive/2026/robot-chores-video-data/] — where the embodied-AI training data is actually coming from. * Pope Leo XIV's AI encyclical [https://www.vatican.va/content/leo-xiv/en/encyclicals/documents/20260515-magnifica-humanitas.html] — technology is never neutral, and what no machine replaces.

25. maj 202623 min

The capability got here first: Mythos, a real prompt injection, and the structure that hasn't caught up

Anthropic's unreleased Mythos model has reportedly found more than ten thousand vulnerabilities for its Project Glasswing partners — and showed up briefly inside Claude Code this weekend. The same weekend, a security researcher flagged what he calls the first real prompt-injection attack in the wild, riding the exact workflow we've all been adopting. Today's episode walks both sides of that coin, then turns to what builders are actually doing: a three-dollar refactor with a deadlock in it, the missing coordination layer for agent swarms, and the argument that the chat box is the command-line phase of agentic software. * Mythos & Project Glasswing [https://www.engadget.com/2180028/anthropic-claude-mythos-preview-project-glasswing-update/] — a security model "too dangerous to release," and the case for and against that framing. * A real prompt injection in the wild [https://x.com/rez0__/status/2058350854508286082] — a malicious GitHub issue, a scan.js, and secrets exfiltrated over DNS. * The three-dollar refactor [https://www.reddit.com/r/singularity/comments/1tlj7ou/coding_is_basically_solved_for_the_boring_90_of/] — cheap worker models, one confident deadlock, and where judgment still lives. * The missing primitive is coordination [https://www.youtube.com/watch?v=5Sui_OnSRlY] — Lou Bichard of Ona on software factories, Stripe's Minions, and why GitHub isn't a coordination layer. * Your agent is an infinite canvas [https://www.youtube.com/watch?v=LMbeDEQO6QM] — Rachel Lee Nabors on MCP apps, Web MCP, and chat as the command-line phase. * r/programming reopens to AI [https://www.reddit.com/r/programming/comments/1tlh5aj/announcement_weve_updated_the_rules_and_april_is/] — a seven-million-person community moves from a reflex ban to a written policy.

24. maj 202621 min

Fast models, slow developers — and the part of the job that stays yours

A Saturday episode about what your job becomes when the model writes the code — and writes it fast. The bottleneck moved from typing to deciding, and a surprising number of this week's stories land on the same instruction: stay the one who decides. Plus a price floor, a reclassification, a year of bold predictions, and a 4-year-old gaming card that won't quit. * "I don't write code anymore" [https://x.com/levelsio/status/2058116725929828722] — Pieter Levels, amplified by Marc Andreessen [https://x.com/pmarca/status/2058144277340049588], and the real-thing/bubble-thing tangle inside it. * Fast Models Need Slow Developers [https://www.youtube.com/watch?v=TeGsFFNqRLA] — Sarah Chieng of Cerebras on Codex Spark at 1,200 tokens a second, and why the discipline matters more, not less. * DeepSeek's permanent 75% cut [https://thenextweb.com/news/deepseek-v4-pro-price-cut-75-percent] and NVIDIA folding gaming into "Edge Computing" [https://www.guru3d.com/story/nvidia-removes-gaming-revenue-category-from-financial-reports/] — two ends of the same pipe. * Jack Clark's year of predictions [https://www.theguardian.com/technology/2026/may/21/ai-nobel-prize-winning-discovery-robots-jack-clark-anthropic] at Oxford — and the cognitive-atrophy counterpoint. * BeeLlama's DFlash update [https://www.reddit.com/r/LocalLLaMA/comments/1tkpz2y/beellama_v020_major_dflash_update_single_rtx_3090/] — 164 tokens a second on a single RTX 3090. * Lobster Trap [https://www.youtube.com/watch?v=F1DYkY1BlfM] — Sally Ann O'Malley of Red Hat on containerizing an OpenClaw agent setup. * How the rest of the world sees this [https://www.reddit.com/r/singularity/comments/1tl68ne/is_ai_viewed_as_evil_in_nontech_communities/] — and a couple overheard in a Copenhagen park [https://x.com/niloofar_mire/status/2058148404673331256].

23. maj 202621 min

Coding is solved, the rest isn't

Beskrivelse

Kommentarer

2 måneder kun 19 kr.

Alle episoder