The Human in the Loop

Counting Keystrokes to Prove the Team Can Write

24 min · 17. mai 2026
episode Counting Keystrokes to Prove the Team Can Write cover

Beskrivelse

Counting accepted Copilot suggestions to prove AI works is like counting keystrokes to prove the team can write. It is the cleanest number on the dashboard. It is also the one that tells you nothing. Forty years ago Fred Brooks split software work into two parts. The accidental: syntax, boilerplate, scaffolding. The essential: what to build, why, for whom, what to trade off. The accidental is what AI tools are good at. That is why the dashboards look spectacular. Lines generated. Suggestions accepted. Prompts sent. The tools were always going to win that part. The numbers that should actually move sit one layer deeper. Cycle time. Change failure rate. Time to first PR review. Defect density. These were already telling you whether the team was shipping good software, long before AI showed up. AI either bends them or it does not. If cycle time has not moved, suggestion-acceptance is a vanity stat. If change failure rate has not dropped, you are not shipping faster. You are writing more code, faster. If time to first review has not shortened, your reviewers are the bottleneck and Copilot cannot fix that. GitHub shipped team-level Copilot metrics this week. It made the wrong question easier to ignore, not harder. Which second-order metric has actually moved on your team since you rolled out an AI coding tool? Full breakdown in this week's episode of The Human in the Loop.

Kommentarer

0

Vær den første til å kommentere

Registrer deg nå og bli medlem av The Human in the Loop sitt community!

Prøv gratis

Prøv gratis i 14 dager

99 kr / Måned etter prøveperioden. · Avslutt når som helst.

  • Eksklusive podkaster
  • 20 timer lydbøker i måneden
  • Gratis podkaster

Alle episoder

30 Episoder

episode The $36,000 Engineer: When Agentic AI Stops Being a Subscription cover

The $36,000 Engineer: When Agentic AI Stops Being a Subscription

Uber blew its whole 2026 AI budget in four months. Then it set a $1,500 monthly cap on each coding tool, per engineer. Claude Code, Cursor, a dashboard to watch the spend, an approval step to go over. Simon Willison did the math. Two tools, and one engineer runs about $36,000 a year. For years AI was a flat subscription. You paid once a month and you knew the number. Agentic coding turned that into a metered bill. And a metered bill does not warn you politely. It surprises you. This is the cloud invoice all over again. A team turns something on, forgets it is metered, and finds out at the end of the month. A paper this week put numbers on the risk. 63 real cases where agents blew past their limits. Often a single retry loop, quietly burning thousands before anyone looked. A cap you cannot enforce in code is just a wish. So before you scale agents across a team, the real question is not what the budget is. It is what happens, automatically, the second someone hits the ceiling. Most teams can answer the first. Almost none can answer the second. Full breakdown in this week's episode of The Human in the Loop.

I går23 min
episode AI Ships Faster Than Anyone Can Review It cover

AI Ships Faster Than Anyone Can Review It

Meta says AI writes 80% of new code. Their own reviewers can't keep up with their own AI. Straight from their engineering blog. They built RADAR to auto-review low-risk diffs because "the share of diffs receiving timely review has declined." Their words. AI-generated code outpaced human review capacity. Read that with the rest of the week's news. Cognition says Devin merged 7x more PRs year-over-year. AI-written commits inside customer codebases jumped from 16% to 80%. Anthropic shipped Opus 4.8 on Wednesday, and every IDE, gateway, and agent runner had it the same day. They also disclosed a $47B revenue run-rate. The "is this a real business" debate is over. But here is what keeps coming back to me: Shipping more code faster is only a win if the systems that catch problems scale at the same rate. This week, the evidence says they aren't. A new arXiv study of 20,574 real coding-agent sessions documents how often agents do something other than what was asked. ITBench-AA, the first serious benchmark for agentic IT work, scored every frontier model below 50%. Adoption is real. The guardrails are not. This week's episode of The Human in the Loop covers all of it: the shipping wave, the cost-control backlash starting inside eng departments, and why ITBench-AA matters more than the score suggests.

31. mai 202622 min
episode Why Your Coding Agent's PRs Keep Getting Rejected cover

Why Your Coding Agent's PRs Keep Getting Rejected

The model isn't the problem. I went back through 20 of my agent's pull requests and the failures looked exactly like a junior's first month. 3 of them tried to rewrite things nobody asked them to rewrite. 5 skipped the test, or wrote a test that would have passed either way. 4 fixed the bug but broke something else in the process. I used to assume model quality was the main driver. It isn't. The agent doesn't ship a one-line fix. It opens a change touching twelve files, half of them unrelated to the bug. It writes the code and skips the test. Or it writes a test that proves nothing. But notice that none of these are model problems. They're the same review failures a junior would ship. Just at higher volume and with more confidence. The practical move: stop logging "the agent failed" and start logging why. Counting changed how I prompt and how I scope tickets. It told me more about my codebase than any benchmark score ever has. If your top three match mine, you're seeing what everyone else is seeing. If they differ, that's signal about your codebase or your prompts. What's your top rejection reason? Full breakdown in this week's episode of The Human in the Loop. Link in the comments.

24. mai 202620 min
episode Counting Keystrokes to Prove the Team Can Write cover

Counting Keystrokes to Prove the Team Can Write

Counting accepted Copilot suggestions to prove AI works is like counting keystrokes to prove the team can write. It is the cleanest number on the dashboard. It is also the one that tells you nothing. Forty years ago Fred Brooks split software work into two parts. The accidental: syntax, boilerplate, scaffolding. The essential: what to build, why, for whom, what to trade off. The accidental is what AI tools are good at. That is why the dashboards look spectacular. Lines generated. Suggestions accepted. Prompts sent. The tools were always going to win that part. The numbers that should actually move sit one layer deeper. Cycle time. Change failure rate. Time to first PR review. Defect density. These were already telling you whether the team was shipping good software, long before AI showed up. AI either bends them or it does not. If cycle time has not moved, suggestion-acceptance is a vanity stat. If change failure rate has not dropped, you are not shipping faster. You are writing more code, faster. If time to first review has not shortened, your reviewers are the bottleneck and Copilot cannot fix that. GitHub shipped team-level Copilot metrics this week. It made the wrong question easier to ignore, not harder. Which second-order metric has actually moved on your team since you rolled out an AI coding tool? Full breakdown in this week's episode of The Human in the Loop.

17. mai 202624 min
episode The Vulnerability Your Agent Merged cover

The Vulnerability Your Agent Merged

The unit tests pass. The PR merges. And you won't find the problem for six months. Two papers landed this week — one on LLM-generated code, one on GitHub Actions workflows. Different researchers. Same finding. When agents write code, they pin library versions that trained well. Not versions that are safe. The mechanism is simple. A model has seen one popular version of a library thousands of times. It reaches for that version because it minimizes prediction loss. Pin-by-popularity and pin-by-safety are different jobs. The model only knows one of them. The GitHub Actions paper found the same shape. Right syntax. Wrong threat model. So the code looks clean. The tests pass. The PR merges. And six months later a security audit finds a CVE that was public before the agent ever touched the file. This is not a model problem. It is a workflow problem. Human PRs go through SCA. Agent PRs often don't. That gap is where the bill arrives. The fix is not complicated. Put pip-audit, npm audit, or OSV-Scanner between the agent and main. Same gate you'd use for any contributor. The agent has not finished the work when it merges. It has finished its part. Your security pipeline was designed for human contributors. Has anything changed since you started using agents?

10. mai 202613 min