The Human in the Loop
The model isn't the problem. I went back through 20 of my agent's pull requests and the failures looked exactly like a junior's first month. 3 of them tried to rewrite things nobody asked them to rewrite. 5 skipped the test, or wrote a test that would have passed either way. 4 fixed the bug but broke something else in the process. I used to assume model quality was the main driver. It isn't. The agent doesn't ship a one-line fix. It opens a change touching twelve files, half of them unrelated to the bug. It writes the code and skips the test. Or it writes a test that proves nothing. But notice that none of these are model problems. They're the same review failures a junior would ship. Just at higher volume and with more confidence. The practical move: stop logging "the agent failed" and start logging why. Counting changed how I prompt and how I scope tickets. It told me more about my codebase than any benchmark score ever has. If your top three match mine, you're seeing what everyone else is seeing. If they differ, that's signal about your codebase or your prompts. What's your top rejection reason? Full breakdown in this week's episode of The Human in the Loop. Link in the comments.
28 episodios
Comentarios
0Sé la primera persona en comentar
¡Regístrate ahora y forma parte de la comunidad de The Human in the Loop!