When Craft Meets Non-Determinism

Descripción

Superhuman built its reputation on a number: 100 milliseconds. Every interaction in the product has to feel instantaneous. Not fast. Instantaneous. That’s the threshold where the human brain stops perceiving lag and starts feeling like the software is an extension of thought. They’ve been engineering to that constraint for years, and it has shaped everything — the architecture, the hiring bar, the way even a billing email gets crafted like a product. Then they added AI. And for the first time, they were shipping something they couldn’t fully control. The feeling that built a company “Every single interaction needs to be below 100 milliseconds, because this is when you feel that things are instantaneous,” Loic says. The number didn’t come from a product spec. It came from game design. Rahul Vohra, Superhuman’s CEO, studied how games create the feeling of flow, and bet that people hate email because of how email works, not because email is email. The architecture follows from that constraint. Superhuman assumes the network will slow you down, so they build as if the network isn’t there — local-first, syncing in the background, optimistic UI throughout. “You need to build without a backend. How do you do that across multiple devices and make it crazy fast?” People pay $40 a month for email and feel it’s worth it. Their users — mostly executives and salespeople who average three hours a day in their inboxes — describe the experience the way people describe good tools: the software stops mattering and the work takes over. How taste becomes infrastructure Loic joined at the beginning of 2025 as an outsider. “I came in with genuine curiosity. I was blown away.” What surprised him wasn’t the rule but how thoroughly it had been internalized. “Even a backend engineer will think about the latency of their API and how this will reflect in the experience.” In most engineering organizations, backend engineers think about correctness and throughput. At Superhuman, they think about how the user will feel. It starts in hiring — product sense is a criterion for every role, not just product and design. The finance team applies the same scrutiny to the email a customer gets when they’re being told what they owe as the product team applies to the inbox. The offer letter is a product experience. “The offer is a ceremony. It’s not transactional — it’s already an experience.” Candidates who got that treatment show up acting like it. Rahul reviews everything going into production. “Within the organization, this is building a muscle in every single engineer, designer, product manager — everyone knows the bond is that high.” You can’t work at Superhuman long without developing an eye for when something feels off — a slightly slow animation, a misaligned pixel, an API call that’s a few milliseconds slower than it ought to be. Loic calls it sensation transference. Packaging changes how you experience the product inside. They take that idea seriously enough that the bill you get from the finance team is treated like part of the product. The part they can’t control For ten years, everything in Superhuman’s stack was deterministic. Same input, same output. That’s what made the 100ms promise keepable: you could engineer to it, measure it, hold it. AI broke that. “The consistency we were used to is not there anymore,” Loic says. “We all face the surprising change of behavior of a model that is technically not changing its version.” A model API doesn’t update its version number, but its outputs shift. The same query returns different results this week than last week. For most products, this is annoying. For Superhuman, it’s a more serious problem, because their users aren’t tolerant of inconsistency. “We are similar to Apple in the sense that people expect the best. They pay a bunch, so they always expect the best.” The specific problem is what happens when AI meets user-generated input. Superhuman can engineer every designed interaction. They cannot engineer how users phrase search queries. “We were controlling every single part of the interaction — feels fast, feels right, feels correct — and all of a sudden, the outcome of the search box is not what I was looking for. Garbage in, garbage out. But how do you control the garbage in?” There’s no bug to fix and no perf target to chase. The product was built on consistency, and now consistency is the thing they can’t fully promise. What the numbers don’t say Superhuman’s AI adoption numbers look good: 90% of engineers using AI daily, 70% of PRs AI-augmented, 90% of those interactions net positive, some engineers claiming 40% velocity gains. Loic is careful about how he explains this. The numbers work partly because of who their engineers are. “We have a very senior team — over-optimized on seniority. Those people tend to use AI with care. They know the outcome they want, and they just use AI to get faster to that outcome.” The 40% gains aren’t coming from code generation. They’re coming from everything before the code. “Coming into a new codebase, trying to understand what this library is doing — before, you had to find the entry point, map the dependencies, build your own mental model. Now Claude Code does that so much faster.” The win is in comprehension and orientation, not typing speed. But the same playbook doesn’t transfer automatically. “If you have a lot of junior engineers, vibe coding’s impact on code quality might be real. It’s not a problem for us — it’s not part of our DNA.” Taste filters the output. Senior engineers with strong judgment about what “right” looks like can catch what the model gets wrong. Engineers without that judgment can’t. Teams celebrating big AI velocity gains may be doing so because they have enough experienced judgment to catch the mistakes. Teams where most of the engineers are still building that judgment may be accumulating comprehension debt they don’t know about yet. The acquisition test The Grammarly acquisition tests the same question at a different scale: can Superhuman’s taste survive contact with mass distribution? Grammarly has the opposite profile. They’re embedded in Google Docs, Word, email clients, browsers. They have AI capabilities built over years of NLP work. What they’ve optimized for is breadth: supporting every kind of user, every context. Superhuman has been doing the opposite, going deep on one persona and refusing to compromise. Loic frames the challenge clearly: “How do we make Superhuman not this niche, very fancy application, but something brought to the mass — while keeping our identity?” He reaches for Apple as the reference point. “Learning from Grammarly’s scale and AI capabilities, keeping our culture and taste, and bringing that to the mass — that would be really interesting.” It’s a genuinely hard problem. Making things simple is hard. Linear built something delightful for small engineering teams, then got successful, then came the bigger companies, the feature requests, the complexity. The focus that made it work is what success makes hardest to maintain. What this means for you Superhuman is hitting a wall any product with a quality bar will hit. Three things their experience suggests are worth borrowing. Make your implicit promises explicit. Superhuman’s was 100ms and determinism — they had ten years of architecture built around it before AI made determinism optional. Most teams have a similar promise they’ve never said out loud: accuracy, consistency, availability, something. Find yours before the model finds it for you, because you can’t defend a contract you haven’t named. Treat the prompt box as a UX surface, not a backend problem. The moment that surprised Loic wasn’t a model bug — it was the search box. Users phrase queries badly. Prompts are now part of the interface the user sees, and “garbage in, garbage out” is no longer an engineering excuse. Better prompts and evals matter, but if the search box returns the wrong thing, the design team owns that, not the ML team. Don’t credit the tools for what your senior engineers are doing. Superhuman’s 40% velocity gains work because the people using AI know what right looks like and catch what the model gets wrong. If your team is junior, the same playbook will produce comprehension debt instead of speed. Once you can’t tell the tool’s contribution from the engineer’s, you’re not measuring AI productivity. You’re measuring how much taste you happened to hire. Loic spent time before tech in contexts where craft standards weren’t optional and the feedback was immediate — a French Navy vessel that had to be back at sea in six weeks, no extensions. The discipline from that kind of constraint is different from the kind you get from a style guide. You learn it because you have no choice, and then it doesn’t really leave. He thinks that’s what Superhuman has built. He’s been there less than a year. Whether the taste travels at Grammarly scale is the thing he’s actually being paid to find out. High Output is brought to you by Maestro AI [https://getmaestro.ai]. Loic’s AI numbers look good — 90% daily adoption, 40% velocity gains — but he’s the first to say the metrics don’t explain themselves. They work because his senior engineers have the judgment to catch what the model gets wrong. Most engineering leaders have no way to see that layer. You can see PR counts and cycle time. You can’t see whether your engineers are using AI well or just generating output faster. Maestro’s daily briefings reveal where your team’s time and energy actually go — not just what shipped, but the quality of the judgment behind it. Visit https://getmaestro.ai [https://getmaestro.ai] to see how we help engineering leaders understand what their AI adoption numbers actually mean. This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit maestroai.substack.com [https://maestroai.substack.com?utm_medium=podcast&utm_campaign=CTA_1]

Open the Barn Door

Twenty minutes into our conversation, I asked Charity Majors how engineering leaders should be finding good junior engineers right now. “God, I don’t fucking know.” She apologized, then doubled back. “Sorry. Excuse me. You do need them. They’re not hard to find.” That answer is the whole interview in miniature. How a junior breaks into engineering today, Charity will tell you, is genuinely unresolved. None of the paths that worked for her exist anymore. How an engineering org builds a healthy pipeline, on the other hand, is not particularly hard. The two questions sit next to each other, and she refused to collapse them into a tidy answer. Charity is the co-founder and CTO of Honeycomb, twenty years into the industry, two O’Reilly books behind her and the second edition of one in progress. Her career has been built on distributed systems — production engineering at Parse, then Linden Lab, then founding an observability company. But most of what she said over the next half hour was about people, and she came back to one idea four or five times: engineering teams are not social systems and they are not technical systems. They’re sociotechnical systems, and the way you reason about one shapes the way you have to reason about the other. Idaho Charity grew up in the backwoods of Idaho. No computers, no phone line for most of her childhood. She got to college on a classical piano scholarship and noticed something there. “People who studied music were still hanging out working minimum-wage jobs in their thirties, forties, and fifties. And I was like, I grew up being poor. I am not going to be a poor adult. And so I switched lanes.” She got into tech in the late nineties. “Any smart kid who is willing to work weird hours and try a lot of stuff could make a go of it.” She doesn’t romanticize that. Tech was a toy then, she said, and now powers nuclear power plants, so the bar going up is correct. But twenty years on, she’s worried about what’s happened to the door behind her. “I think we really risk it becoming the sort of ivory tower where we keep out anyone who has a non-traditional background. You need to think harder about crafting paths into technology to meet the moment.” I asked how she got into management. “I was a reluctant manager.” She drew a line between management and leadership before I could follow up. These are sociotechnical systems, she said, “they’re not social or technical, or we could just take the great managers from Starbucks and put them in charge of engineering teams.” The reason she ended up doing the job at all was anger. “I got into people management the same way a lot of people do, which was enraged, because I didn’t like the way it was being done. And I was like, *god damn it, I guess I will do it differently. I will not make any of these mistakes.* So I made different mistakes, of course.” The self-correction is constant in conversation with her. She said something close to it three more times over the next half hour. The freeze When I brought up the AI-killing-the-junior-pipeline discourse, she pointed to something specific. She’d just read a piece by Annie Lowrey in the Atlantic that morning. The Job Market Is Hell [https://www.theatlantic.com/ideas/archive/2025/09/job-market-hell/684133/]. Unemployment is around 4.7%, which is historically fine, but nobody is leaving their jobs and nobody is hiring. On both sides of the resume, AI is doing the talking. Recruiters feed inbound applications into screening tools. Candidates feed job listings into chatbots. “The result is there are no people talking to people. Nobody’s figured out how to do this.” The framing she rejected was the one that treats this as inevitable. “What I don’t like about the way people talk about bringing juniors into tech is they talk about it like it’s some force of nature that we have no control over, which is absolute horseshit. This is a world we create. It’s a world that we reinforce.” It’s a sequence of decisions made by people in rooms. And the people most responsible for those decisions, she would argue later, aren’t the ones the org chart suggests. Make friends with the discomfort Before she got to the operational claims, she walked me through what she thinks her generation of managers got wrong, because the failure mode shapes everything else. “My generation swung the other way and was like very rigorous about, *you should have work-life balance. Nobody should be pinging you after hours.*” The intent was correct. She was managing in reaction to the era of people sleeping under their desks. But she watched it overcorrect. “I see some managers being like, *you’re working more than 40 hours, stop.* And honestly, we live in a very complex, fast-changing world, and if you’re intrinsically motivated to be working, if you’re learning, if you’re having fun, nobody should be stopping you, because that really is the path to success.” She isn’t arguing for the swing back, either. I brought up 996, the Chinese nine-to-nine, six-days-a-week framing that’s been making the rounds on Hacker News. She had nothing nice to say about the swing-back. “It all swings back. It all swings back, doesn’t it?” Then, more bluntly: “That’s bullshit.” She read the cycle as a generational pattern, and she was harder on her own generation than on either pole. “If anyone had told me that, if I had followed that advice, I would not be where I am.” The piece of this that connects to junior hiring is the part most management writing skips. “You need to learn to make friends with the discomfort. You need to learn to find joy in the pain.” None of us, she said, evolved to handle data structures and algorithms, and the early years of an engineering career are genuinely agonizing. The juniors who make it through are the ones who learn to like the agony. A lot of senior engineers, looking back, have forgotten that they once lived through it. It’s a humanistic argument, not just an operational one. She talked for a while about school stamping out the curiosity children are born with. Twelve, twenty, twenty-five years of report cards, conditioning us to associate learning with extrinsic reward. What she loves about adulthood is the chance to rediscover the original instinct. Engineering is one of the few careers that pays you for it. 50 to 1 Her first operational claim was about team composition. “For every staff engineer that you have, let alone principal engineer, you need 50 intermediate engineers.” The number is a gesture. The shape of the argument is specific. Most companies have over-corrected toward senior hiring on the theory that they’ll get more leverage per dollar. The people who actually ship the bulk of features, she said, aren’t seniors. They’re intermediates. “Some of the most productive engineers that I’ve ever worked with have been intermediate engineers. They can just put on their headphones, beginning of the day, go deep, and just pound out the features and the bug fixes.” Heads down, pattern matching, finishing things. “Nobody who’s been in engineering for seven, ten years wants to do that. They’re sick of that.” The bored staff engineer is not a leverage win. “When people get bored, you do not get great work out of them. You get the best work out of people when they are working at that place that’s right on the edge of their ability.” And the supply chain only runs one direction. “Nobody stays a junior engineer for long, two years at most. So you’ve gotta keep feeding the system. You’ve gotta keep bringing new blood in.” Opening the barn door I asked what she’d recommend to companies that are paranoid about hiring right now. “I would advocate for opening the barn door a bit wider, giving more people a shot. Understanding that it means you will have to fire more of them. You will have to let more of them go. But I feel like it’s worse to never give people a shot.” The second half is the part she emphasized. A wider door costs you in faster, more honest performance management, and most engineering managers are bad at that part. “Nothing demoralizes a team more than when someone that they work with every day, who’s not pulling their weight, just hangs around forever.” The unsalvageable cases weren’t the ones that escalated. They were the ones that drifted. “Some of the most heartbreaking situations I’ve ever been in as a manager are when a person’s being let go after years of them doing exactly the same thing, and they’re legitimately dumbstruck.” There’s a side benefit she pointed out that I hadn’t considered. Junior engineers audit your systems in a way nobody else can. “If you’re an engineer joining a team where there is very low turnover, where people never join, where people never leave, that is not likely to be a very high functioning team either.” Old docs. Idiosyncratic mental models locked in three people’s heads. A dev environment that takes a month to set up because nobody’s tried in six. “If you’re used to bringing on junior engineers, oh boy, those kids will audit your systems like no one else.” That’s the sociotechnical argument in plain language. The team isn’t separable from the systems it owns, and the hiring policy isn’t separable from the operational health of the codebase. Both improve together or neither does. What she watches for in a junior The most optimistic moment came when I asked what she watches for in her own juniors. “Some of our junior engineers talk about how they are in conversation with Claude all day long. By the time they bring a question to their senior engineer, which they do very often, they have tried all the low-hanging fruit, they’ve tried a bunch of stuff, they’ve asked a lot of questions. So it is very well worth that senior engineer’s time.” That’s not the threatened-junior story most engineering leaders are telling right now. The juniors she described are using the model to exhaust the obvious before they ask, and arriving at the senior with the harder version of the question. I asked what the leading indicator is for a junior who’s going to make it. “Are they asking good questions? Are their questions getting better? Do they have a good sense of how to use their time and how to use their mentor’s time? That is the best leading indicator.” Not output. Not commit volume. Question quality, over time. She added, almost in passing, that her management chain handles the day-to-day evaluation. “I really trust Emily and all of them.” The broader discipline she described combines two things engineers tend to mistrust: the data, and the conversations. “It’s actually really important that there be data in addition to conversations, because the data and the conversations are bookends. They help you understand each other.” Lean on either alone, she said, and you get either a “people manager” with no technical judgment, or a manager who reads PR counts as a personality assessment. Both fail in different ways. Consent of the governed Near the end, I asked who she thought was actually responsible for fixing the junior pipeline. The pattern she described is counterintuitive. “The places that I know of that actually are successfully recruiting, hiring, bringing in junior engineers, and making them successful, it was *not* the engineering managers who pushed for that program. It was the senior engineers. They were the ones who were like: *we know what it takes to have a healthy, high-performing team. It takes a steady influx of new blood, and we feel this conviction so strongly that we’re gonna go make it happen ourselves.*” The senior ICs went to bat. The managers ran the mechanics afterward. Then the line that anchored the whole conversation: “There is no engineering leadership without the consent of the governed.” Charity has been an executive long enough to watch a lot of decisions get made about engineers, by engineers, with or around engineering management’s input. “If there’s anything that I have learned being in senior management, it’s how much power individual ICs have when they choose to flex it.” I asked whether she meant it literally. Were the senior ICs really the deciding force? She walked through the pattern again. The companies hiring juniors successfully are the ones where the senior engineers made it their problem. The ones not hiring are the ones where they didn’t. I came in expecting a programs-and-processes answer. Recruiting funnels, intern conversions, the mechanics of a pipeline. What I got back was about consent. The senior ICs in your org, the ones who don’t have manager in their title but have weight in every staffing conversation, are the people who decide whether the next generation gets in. Without their buy-in, no pipeline exists. With it, almost any pipeline works. The feedback loop of feedback loops I asked at the end what she’s working on now. She’s writing the second edition of *Observability Engineering*, and she was honest about how it’s going. “It’s not going super great.” She read the first edition recently and found it embarrassing, which is not how most authors I’ve talked to describe their own work. “But now I think my co-authors and I, we know who we’re writing for and we know what they need to hear.” Then she connected the book to the show in a way I wasn’t expecting. “It’s a true fact reality that high-performing engineering teams are about fast feedback loops, and observability is the feedback loop of feedback loops. It is the sense-making apparatus of engineering teams.” That landed for me. A lot of what she’d argued for over the prior half hour started looking like a feedback-loop argument. Open the barn door, but tighten the loop on managing out, so performance information moves fast. Watch question quality, because it’s a faster signal than output. Bring juniors in, because they shorten the loop on every undocumented assumption your team has accumulated. The senior ICs are the deciding force because they’re the only people positioned to keep all those loops short. A team is a sociotechnical system. The systems that team owns are sociotechnical systems too. The discipline of running both well is the same: short feedback, an honest signal, and the willingness to look at uncomfortable data. Charity’s question, the one I’ve been sitting with since we hung up: in your engineering org, who is actually deciding whether the next generation gets in? High Output is brought to you by Maestro AI [https://getmaestro.ai]. The thing Charity said that stuck with me most was about leading indicators. The junior worth investing in isn’t the one shipping the most code. It’s the one whose questions are getting better. The juniors using Claude well at Honeycomb are showing up to their senior engineers having already exhausted the obvious. That’s a different trajectory than the one most dashboards see. PR counts and cycle time can’t pick that distinction up. The work that builds judgment, or fails to, happens in the back-and-forth between an engineer and an AI agent, before any PR is opened. Your Anthropic bill tells you something is happening. Maestro tells you what. Maestro plugs into Claude Code and Cursor and looks at the work itself: how engineers scope a problem before they prompt, what they verify, what they accept on faith. Scored against shipped outcomes, not vibes. You can see which engineers are leveling up and which are accumulating comprehension debt. Visit https://getmaestro.ai [https://getmaestro.ai] to see how we help engineering leaders spot which engineers are developing real AI craft, and which are just generating more output. This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit maestroai.substack.com [https://maestroai.substack.com?utm_medium=podcast&utm_campaign=CTA_1]

3 de jun de 202633 min

Why AI Productivity Gains Are Context-Dependent | With Raju Matta

Some engineering teams are seeing real, measurable AI productivity gains. Cursor is transforming how frontend developers build React apps. AI-assisted code review is catching bugs before deployment. Prototypes that took weeks now take days. But not everyone’s seeing the same results. Raju Matta [https://www.linkedin.com/in/raju-matta-4067a7/] runs engineering for Cambridge Mobile Telematics [https://www.linkedin.com/company/cambridge-mobile-telematics/]—200+ engineers, three countries, petabytes of real-time sensor data processing driver safety. Six months ago, he formed a tiger team to systematically track AI tool adoption. Status reports every two weeks. Multiple tools tested: Copilot, Cursor, PR review bots. His finding? “I’ve not seen the measurable velocity increase that people are saying out in the market—but that doesn’t mean I have totally written off LLMs yet.” This isn’t skepticism. It’s measured evaluation. And the pattern Raju’s seeing reveals something important about when AI tools deliver and when they don’t. Where AI Tools Excel As part of their evaluation, CMT ran an internal hackathon to see what AI tools could do in practice. The results told a clear story. Eighteen projects, all using AI. Teams built fully working web apps—complete with datasets—in 2-4 hours. “For that purpose, it’s great. It’s not bad at all,” he says. The pattern: AI coding tools work brilliantly for rapid prototyping with established patterns, web development using well-documented frameworks, mechanical coding tasks like boilerplate and test generation, and quick experiments to validate product ideas. These are real productivity gains. The people claiming 2x-3x aren’t exaggerating—they’re working in contexts where AI capabilities align perfectly with task requirements. When your bottleneck is writing React components or generating CRUD endpoints, AI tools deliver measurable acceleration. But CMT’s production systems are different. The Complexity Multiplier They’re processing petabytes of data from gyroscopes, accelerometers, GPS sensors, video streams. They’re distinguishing potholes from crashes, sharp corners from reckless driving. They’ve been using AI and machine learning for this work for 13 years—long before LLMs became everyone’s productivity obsession. The engineering challenge isn’t writing code. It’s architecting systems that handle sensor fusion at scale, debugging why clusters fail under load, ensuring accuracy when lives depend on your classifications, and managing tech debt across distributed teams in six countries. “You can outsource your engineering and coding with AI tools, but not your thinking,” Raju explains. In complex production systems, the thinking is where the time goes. Code generation helps, but it’s not the bottleneck. The productivity multiplier drops from 3x to “incrementally helpful” because the constraint isn’t in the typing—it’s in the architectural decisions, the system design, the understanding of how everything fits together. This doesn’t make AI tools useless. They still catch bugs in PRs. They still help prototype solutions. They still accelerate certain tasks. But the overall velocity gain is modest because code generation often isn’t the long pole. The Tiger Team Approach Here’s what makes Raju’s perspective valuable: he’s not guessing. Six months ago, CMT’s CTO gathered the engineering leaders. “How are you guys thinking of AI?” The response: treat it like a first-class citizen. They formed a dedicated tiger team. Three people producing status reports every two weeks on tool adoption, usage patterns, and measurable impact. “We have about three or four tools that we are using all the way from PR review tools to tools like Copilot, Cursor.” This is systematic evaluation, not anecdotal impressions. And the data shows results that differ from the market narrative: “My general experience is that it’s good, it’s doing its job, but I haven’t seen the measurable velocity increase as much as what people are saying out in the market.” His peer conversations confirm the pattern isn’t unique to CMT: “Even other leaders and my peers that I speak with, who are working at big tech companies, have said similar things. So it’s not uncommon.” But Raju’s not dismissing the technology. “The tools are progressing at a very fast pace. I wouldn’t be surprised if it’s another six months or a year where we get to exhaust more pieces of the tool and get more done.” That “yet” matters. He’s still tracking, still evaluating, still expecting improvement. When Mistakes Have Consequences When Raju says “we have to save people’s lives,” he’s not being dramatic. CMT’s technology directly impacts driver safety. Their telematics platform processes sensor data to detect dangerous driving, assess risk, and potentially prevent accidents. This creates a different bar for “move fast and break things.” “We are a little bit more diligent because at the end of the day, we have to save people’s lives. So for us, we’d rather spend the time beforehand than reactively trying to address it.” The stakes are high—both financially and ethically. When your technology directly impacts human safety, you can’t afford to ship fast and fix later. The constraint isn’t just technical complexity—it’s consequence of failure. “AI tools can take you north, but with the same speed, they can take you south.” In safety-critical systems, the review time, the testing time, the verification time doesn’t compress even if code generation does. You can’t ship and iterate rapidly when mistakes could harm people. The overall productivity gain shrinks accordingly because the non-coding portions of the development cycle remain unchanged. This applies beyond telematics. Financial systems. Healthcare platforms. Infrastructure control. Any domain where errors have serious consequences faces the same limitation: AI can accelerate code generation, but it can’t compress the necessary validation and testing cycles. Where AI Struggles AI’s limitations show up in unexpected places. CMT uses AI to filter thousands of resumes for each job opening. The results? “50% makes sense. And 50% don’t make sense.” This split illustrates a broader pattern. AI works brilliantly for well-defined, repeatable tasks. It struggles with judgment calls, context-dependent decisions, and situations requiring nuanced understanding. The tool saves time on mechanical filtering. But the judgment about who’s actually right for the role? Still human. And critically, the humans can immediately spot when AI recommendations miss the mark—they don’t trust it blindly. This mirrors the coding experience. AI generates boilerplate quickly. But understanding whether the generated code fits the broader system architecture, handles edge cases properly, and follows team conventions? That requires human judgment that doesn’t compress. Where This Leaves Engineering Leaders The mistake isn’t believing AI tools work—they demonstrably do in many contexts. The mistake is assuming your context will see the same gains as someone in a completely different situation. Raju’s systematic evaluation reveals the variables that matter: Your problem domain determines gains. Web apps and prototypes with established patterns can see significant productivity improvements. Complex distributed systems with unique requirements tend to see incremental improvements. The difference isn’t the tool quality—it’s how much of your bottleneck typically sits in code generation versus system design. Your constraint defines the impact. If implementing features is your rate-limiting step, AI delivers massive value. If architectural decisions and system design are your constraint, AI helps less. Most production systems fall into the second category after the initial prototyping phase. Your risk tolerance changes the math. If you can ship and iterate rapidly, AI accelerates that cycle. If mistakes have serious consequences, the review and testing time doesn’t compress proportionally. The overall velocity gain depends heavily on how much of your process can safely be accelerated. Your system complexity matters. Greenfield projects with established patterns see huge gains. Legacy systems with unique constraints and interconnected dependencies see modest gains. The complexity of your codebase directly impacts how useful AI-generated code becomes. The Honest Assessment Raju isn’t claiming AI tools are overhyped. He’s providing the nuanced reality: they work extremely well for specific contexts and deliver modest improvements in others. His 6-month tiger team experiment with dedicated tracking hasn’t found a productivity revolution. They’ve found incremental gains with clear constraints. That’s the honest number engineering leaders need for planning. “LLMs can help us experiment and prototype features faster. They can help developers catch mistakes in our pull requests. They can help us find answers faster, and we are constantly evaluating,” he explains. “But I’ve not seen the impact that people are saying out there.” This doesn’t mean ignore AI tools. It means understand your context, measure systematically, and set realistic expectations. For rapid prototyping and web development? The 2-3x gains are real. For complex production systems with safety requirements? The gains exist but are much more modest. Both can be true simultaneously—the difference is context. What This Means for You First, measure systematically rather than relying on anecdotes. Set up dedicated tracking like Raju’s tiger team—assign ownership, establish regular reporting, and gather actual usage data. The hype cycle around AI tools means everyone has an opinion, but data reveals what actually works in your specific context. Second, understand where your bottleneck actually sits. If architectural decisions and system design consume most of your time, AI tools will help less than if code generation is your constraint. Be honest about what’s actually slowing you down before expecting AI to solve it. Third, adjust expectations based on risk profile. If your domain allows rapid iteration and tolerable failure rates, AI tools can deliver significant acceleration. If mistakes have serious consequences, the non-compressible validation cycles will limit overall gains regardless of how fast code gets generated. Fourth, keep evaluating as tools improve. Raju expects capabilities to expand significantly over the next 6-12 months. Today’s limitations may not be tomorrow’s. But base your current planning on current capabilities, not projected future states. The question every engineering leader should ask: What’s actually constraining my team’s velocity—code generation or everything else? Because if it’s everything else, AI coding tools will help incrementally, not transformationally. And that’s okay—incremental gains compound over time. Raju’s measured approach provides the reality check the market needs. AI tools deliver real value, but the magnitude depends entirely on your specific context. Understanding that context is how you set realistic expectations and make smart adoption decisions. High Output is brought to you by Maestro AI [https://getmaestro.ai]. Raju talked about forming a tiger team to systematically track AI tool adoption with biweekly status reports—but that measurement challenge extends beyond just AI tools. When your 200+ person engineering team is distributed across four countries and multiple tools, it becomes impossible to see what’s actually happening without systematic tracking. Maestro cuts through that complexity with automated reporting and metrics and show where' your team’s time and energy actually go, so you can spot patterns and make data-driven decisions about everything from AI adoption to resource allocation. Visit https://getmaestro.ai [https://getmaestro.ai] to see how we help engineering leaders get actually useful insights into their teams. Running systematic evaluations of new tools and processes? We’d love to hear your approach. Schedule a chat with our team → https://getmaestro.ai/book [https://getmaestro.ai/book] This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit maestroai.substack.com [https://maestroai.substack.com?utm_medium=podcast&utm_campaign=CTA_1]

11 de dic de 202536 min

When Craft Meets Non-Determinism

Descripción

Comentarios

Empieza 7 días de prueba

Todos los episodios