Episode 100: Normalization of Deviance

Descripción

This episode of the Acima Development Podcast centers on "normalization of deviance" — the pattern where small anomalies get repeatedly ignored until they cause catastrophic failures. Mike opens with the Space Shuttle Challenger disaster as the anchoring example: engineers warned that cold O-rings could fail, but their concerns were drowned out by schedule pressure and accumulated tolerance for small deviations. The crew connects this to the Columbia disaster years later, where the same organizational lesson went unlearned, and to NASA's own "Elements of Engineering Excellence" report, which lists not questioning anomalies as a major root cause behind their biggest failures. The conversation then wrestles with the tension between safety culture and velocity. Will pushes back on pure risk-aversion, arguing that heavy regulation has real costs and that tech's "move fast and break things" ethos has produced enormous value. Dave introduces the META framework (Mitigate, Eliminate, Transfer, or Accept) and contrasts NASA's culture with SpaceX, which celebrates blowing up unmanned rockets because the risk was already accepted and the explosion yields data. Mike reinforces this with an analogy from his kid's rocket-themed birthday party, where different risk levels (model rockets, sugar rockets, thermite) warranted very different safety boundaries — treating everything as maximum-risk would have obscured where the real dangers actually lived. The group lands on a key reframe: rather than trying to control everything, build a monitoring culture that instruments heavily, tests to failure, and pays attention to the signal inside the noise. The final stretch applies these ideas to current software practice, including AI-assisted development. Matt and Dave debate whether vibe coding will dominate production code soon, with everyone agreeing humans must remain accountable for what ships. Will gives concrete examples of normalized deviance developers live with daily: thousands of ignored compiler warnings (some of which are genuinely dangerous), bloated mobile web performance, and test suites nobody expects to run clean. He notes AI could finally make the ditch-digging cleanup work economically viable. Mike closes by tying it back to the opening theme: entropy is the default, letting things slide is easy, but flipping the culture toward actively watching the data is what prevents small deviations from becoming the next tragedy. Transcript: MIKE: Hello, and welcome to another episode of the Acima Development Podcast. I'm Mike, and I'm hosting again today. With me, we've got Will Archer, Dave Brady, and Kyle Archer. DAVE: Howdy, howdy. MIKE: I'm going to start with a story, as I typically do, actually two stories, but one funny and one not at all funny. I'll start with the funny one. My wife, when she was in her late teens, decided to drive with her sister to college. She wasn't going to college yet, but she decided to road trip with her sister to college. And they made sure the car was good the day before, had been doing some maintenance, and they cracked the case of the cooling fan for the engine. WILL: Oh. MIKE: So, when the fan was running, it was bumping against this cracked part of the case, so you can imagine the sound of that, not good, right? They actually took it to a mechanic and got kind of a loose sign-off that, "Yeah, well, this isn't going to make the car break, but it's going to sound terrible, and you should get it fixed soon," like, "Okay." And they drove cross-country [chuckles] with that thing rubbing the whole way. And what they did is they just turned up the radio, so full volume, full road trip. They drove for, like, two days [laughs] with the volume cranked up, just ignoring it. And she's told the story for years. It's funny, you know, everybody in the family laughs. You can just imagine, just turn up the volume, and the problem goes away. That is one way to make a problem go away. The other story is related to what we talked about in our last episode. And we're going to continue with the topic we talked about in the last episode, which is the Space Shuttle Challenger disaster, which happened in, let me check my dates here, '86. I believe this happened in... DAVE: '86? MIKE: '86. That's the date that I was remembering, so 1986. There it is: 1986. So, I actually looked this up. I read about it on Wikipedia. As a kid, I remember watching this [chuckles] in school, and it was, you know, horrifying. So, they had O-rings around the booster engines that, you know, like rubber or rubber-like material. And they had had record cold, I guess, at the launch pad the night before. And that cold caused the O-rings to, you know, shrink and stiffen. And so, in the launch, they lost integrity, so air started getting into the fuel. Eventually, that caused a catastrophic explosion, and the entire spacecraft disintegrated. I remember the horror of seeing those booster engines just randomly wandering, and there was not anything left of the main craft. It was a tragedy, you know, a horrible tragedy. Anybody who was around that time remembers. I was talking to somebody else like, "Oh, that was our JFK moment," you know? Everybody remembers that. Where were you when that happened? And it turns out the engineers had warned this might happen, and they were ignored, because there was enough noise in the data. They're like, "Oh yeah, well, there are so many things that can go wrong [inaudible 03:18] DAVE: Not just ignored, though, right? They were told to stay in their lane. MIKE: I think that's right. DAVE: If I recall. Yeah. They were told to be quiet, yeah. MIKE: The interesting thing about that is there was another space shuttle disaster some 20 years later or so, where Columbia broke up in re-entry. And the diagnosis afterward essentially said, "We didn't learn from the last time." There were likely problems that were pointed out by engineers, and there was just so much pressure to make this thing work that the concerns were ignored, and people died as a result. And in the document that we started talking about in our last episode, which is...certainly you can look it up yourself. It's titled...this was published by NASA titled Elements of Engineering Excellence. It was published in 2012. They made a list of root causes behind the major problems that NASA had had over the decades previous. Last time, we talked in depth about the importance of hands-on experience, that unless you have people who really have, you know, kind of gotten their hands in the work and understand it deeply, then you're going to miss stuff. The second point is what they call normalization of deviances. They also refer to it as not questioning anomalies. I'll quote from the report, "As was evidenced in the Challenger failure, we see deviations, and they're not quite normal, but seem to have no major consequence. After seeing these deviations a few times, we accept them as normal and ignore them. The result is a major failure where the deviation becomes catastrophic." So, that's our main topic for today, is the importance of questioning those anomalies and being able to see that signal inside, you know, a bunch of noise, because there's always noise [crosstalk 05:18] DAVE: There are a couple of interesting extrapolations on that as well. WILL: So, I have some thoughts about, like, sort of, like, these sort of, like, normalization of deviances and ways that it can go wrong. But, like, I suppose, like, and maybe this is just my priors, but I'm very much a believer in, like, a move fast and break things sort of ethos. Like, I'm familiar with heavily regulated, heavily controlled industries where, rightly or wrongly, there are high stakes, people die, right? And let me tell you right now that there is a cost to that. There's a substantial cost to that. And I do think that technology, in general, is pretty out of control in terms of, like, accountability, right? I mean, if you look at, like, you need a license to braid hair [laughter]. But, like, I didn't even need to graduate high school to do the job I'm doing. I just needed to convince somebody to give me a shot and then not get fired for long enough, and then you're in. DAVE: And we're writing software that handles people's money for them. WILL: Yeah, to the tune of billions of dollars, you know? And it's just like, "Yeah, you know, he sounded like he knew what he was doing. Let's roll," you know, which is fun. I think that's wrong. But, I mean, you can't argue with results of the industry that we've been in, right? And I think there's benefits there, and there's a lot of stuff where, yeah, you can let it slide until it blows up. You could do that. That's a strategy. It's a valid [crosstalk 06:56] MIKE: Well, and not only is it a strategy. It's a critical one. WILL: Initially, right? MIKE: Yeah. Well, absolutely. And even in regular life, you can't pay attention to everything. Attempting to do so would not end well, right? Our brain is very good at removing extraneous information. You can't pay attention to everything. So, you have to prioritize what you actually give attention to, and that better be the important stuff. DAVE: There's a rule in insurance, which is if you can afford to replace it, don't buy the insurance. But if you can't afford to replace it, don't even ask how likely it is that you're going to lose it. You have to get the insurance. The entire science of risk assessment is getting people to stop thinking about reducing the likelihood of a catastrophic fault and dealing with the case of when it is catastrophic, right? It's like, if you're going to take, "Oh, this hash collision can happen one in 10,000 times, and it will bankrupt the company," and I turn the PR back to you, and you say, "Okay, well, I've reduced the likelihood to one in a million times, and it still bankrupts the company." No, I'm not going to approve that PR. You're trying to reduce the likelihood of something that will end us all, when what I need you to do is mitigate it, right? The META rules for...M-E-T-A: Mitigate, Eliminate, Transfer, or Accept on any given risk, right? And if we can't accept it, then you have to mitigate, eliminate, or transfer. And the thing about the Challenger discovery that I love is that it's mirrored by SpaceX. One of their first unmanned rockets went up, and they start cheering. It gets off the launch pad; they're screaming; they're going nuts, and then it explodes. And somebody opens champagne, and they keep screaming and cheering. And the reason...it was unmanned. There were no people on it. And they were normalizing science. They were saying, "This is successful collection of a data point, and the risk that we assumed was entirely mitigated." Once it was off the launch pad, they said it was all icing on the cake. This is absolutely 100%. We accept this risk. We are in the black for days, right? We can burn this rocket, and it's fine. And they used that to normalize that and create a culture of psychological safety, and let's move forward with this. But the normalization of deviance is kind of based on this weird thing where humans tend to reach for confirmation, and when you're trying to prove that a rule doesn't hold, the only thing you care about is exceptions to the rule. Is there a thing that violates the rule? Like, I can't remember the name of the rule. There's a really cool psychological test that I learned last week, where you set out four cards with numbers and letters on them. I'll dig it up for later in the call if it's relevant. But the important thing is, you're like...if I tell you every person in this bar that's drinking alcohol must be over 21 and I ask you, "Tell me if that's true or not," you know that if somebody is over 21, you don't need to know what they're drinking. And if somebody's drinking a soda pop, you don't need to know how old they are, right? But we go for that. You're like, "You're 35. Are you drinking a beer?" That's the confirmation case. You need to be looking for counter cases. Are you underage and drinking booze? If you're drinking booze, are you underage, right? Those are the counter cases, and that's the only thing you care about when you're trying to prove a negative. When you normalize deviance, you are throwing away the counter cases and grabbing confirmation, confirmation. And, eventually, your META rule, you end up accepting a risk, and it's catastrophic. WILL: So, one of the things that I worry about, right, is this sort of psychological need for control, right, and, like, people's psychological need for control on emergent systems that are nearly impossible to fully model inside your brain. And, like, all of us can think of examples of really catastrophic failures. We've all blown things up. We have blown the rocket up. And we've blown the rocket up to a degree where it's like, "Hey, you know, we could lose the company. This company could not be a company, and we could all have to work somewhere else very soon." We could all think of those. And so, the question becomes, right, is there a productive safety culture that can really eliminate, like, really, like, look at these deviances to a specific and scoped way where you can get a level of certainty where the juice is worth the squeeze, and you're not just sort of navel-gazing and being, you know, petty and fooling yourself, right? You're trading velocity for the illusion of control. DAVE: Right. The fun police or the policy wonks on your team they just want to slow things down. It's the foolish consistency is the hobgoblin of little minds. It's like, we followed every checkbox, and we did absolutely nothing wrong, and that's why the company went out of business, because we didn't make any money. Yeah. You're focused on the wrong things. WILL: Well, yeah, absolutely, absolutely. And it's just, like, killing your productivity, killing your velocity. Like, I have run into this in many respects where people will...One thing that I've seen go dangerously awry is people's focus on shallow indicators of code quality, you know? Like, where every [inaudible 12:31] DAVE: 82% C0 code coverage. WILL: Yeah. Well, no, I mean, I don't know. Like, where you'll go through, like, three rounds of code reviews with no substantive architectural improvements, right, like lateral moves because people don't know what's important and what's not important, but they definitely want to put their stink on it, you know? MIKE: Well, I've been thinking a lot about this importance thing that you brought up, because it matters. So, I've been thinking about analogies. So, I haven't thought about this in a while. When my oldest was around eight, we threw a rocket-themed birthday party, and we had fun. Everybody who came made little model rockets and launched them. And we made what they call rocket candy. You mix sugar and potassium... DAVE: The sugar rocket fuel, potassium nitrate? MIKE: Yeah, potassium nitrate. DAVE: Sugar rockets, yeah. MIKE: Yep. And we made some of that, and lit it on fire and watched a big fire [laughs]. And we made some thermite. WILL: Oh! [laughs] DAVE: I want to go to your birthday parties. Holy crap. MIKE: [laughs] And lit that with magnesium ribbon, and that was fun having molten metal [laughs] in the backyard. And I'll tell you, so the model rockets, heavily controlled. Those things have been, like, standardized for I don't know how many decades. They made them. They put their own engines in those, and they were able to launch them, and everybody laughed, and it was great. So, you know, eight-year-olds wandering around with rockets in their hands. The rocket candy, everybody was at least 10 feet away, right [chuckles]? It was enough. And the thermite, everybody was at least, like, 30 feet away, right [chuckles]? They were on the other side of the property. They could see it, and they could see the fire. There was no eight-year-old anywhere close because not safe. And there were very different rules applied for each of those different risk levels, and that was important to identify because if I had tried to make all the eight-year-olds obey those thermite rules with their rockets, they'd have had no fun, and they wouldn't have known where the boundaries really were, right? They wouldn't have known, "Well, I actually need to be careful of this," because you're treating me like this rocket is super dangerous that's not actually that dangerous, you know, there's a lot of controls around it. And so, I don't really know where the boundaries are. And it was really important to establish ahead of time, well, what's sensitive and what's not? Because that allowed us to pay attention to the things that really mattered. MATT: Thermite and firearms, favorite games. DAVE: Thermite and firearms, yep. There's an interesting...It's not the reverse of it. It's not even the countercase. It's an agreement in it in, like, the shadow of it, which is kind of going back to, like, the Challenger disaster. We had normalization of deviance, and it isn't that cold O-rings kill people. It isn't that foam is falling off a shuttlecraft is deadly. It's the bigger thing hiding behind it that you're normalizing this thing. But if you're not paying attention to this, what over here is even bigger that you're not paying attention to? And I had two things...I learned this lesson really well because I had it happen in two places kind of at the same time, information that came in. One of them was I was in a fast food restaurant. We walked in, and it wasn't busy. All the tables were filthy, and it wasn't, like, immediately after the lunch rush. And the guy that I was with, he's like, "No, we're going." And he turned around. I'm like, "But, I mean, what are you talking about? The food here is pretty good." And he says, "No, come." So, we went to another restaurant, and I'm like, "What was all that about?" And he says, "Here's the thing. If they're not cleaning the tables out in front where we can see them, what do you think the kitchen is like where we can't see it?" I'm like, "Oh, that's a really good point." That same week, I had started a new job at a company in Salt Lake City that...they're like a Groupon clone. They were doing financial, you know, manipulation stuff, like, batching together coupons and stuff and deals for people. And the CTO had a really cool hip line of like, "We move fast, and we break things, and we only fix it... We don't polish rivets here. We only fix things good enough to ship it." And I'm like, "All right. Yeah, let's get some money. We're a startup. Let's absolutely do that." So, I hired on, and my first day, he's like, "Okay, cool. We need to get you on the board for the pager." I'm like, "Okay, well, pager isn't my big thing." And I said, "How often does the pager go off?" He says, "Oh yeah, you're going to need the pager every night because every night, you have to reboot the server at two o'clock in the morning." "Your production server goes down every single night, and you consider this business as normal?" "Oh, yeah, it's totally fine." I turned in my badge. I walked out. I quit the same day, the first day. And the pager was the thing that did it. And it wasn't the pager; it was, if this is okay to you, you've just told me a lot of things about how much you value my good night's sleep and my value as a person, but also everything else in the company. And when I found out a year later that the CEO was on trial for financial fraud, I'm like, "This surprises me exactly not at all." Like, everything in that company was move fast and take what you want, and hope you don't get caught. WILL: You already said Utah startup. DAVE: Yeah, yeah, exactly. Utah startups, yeah, yeah. Tell people -- MATT: I feel like [inaudible 17:48] MIKE: Acima was a Utah startup [laughs]. MATT: I feel like we've crossed paths 20 years ago. And I'm sure of it now. Because I built one of those companies that was doing the same thing back at the same time. DAVE: Oh wow. MATT: That was acquired by the other company. DAVE: Oh, interesting. We'll have to talk offline. MATT: And the founder also happened to end up, I believe, in prison. So, yes. DAVE: Yeah. We'll have to talk afterwards. That's fantastic. That's the other fun thing is that the Ruby community is a small, small world. Yeah. MATT: Going back to what Mike said and then you extended on, it's constraints, right? And then we're talking about architecture. And while things may not fail on their own, when you put them together in a systems architecture, and then you apply pressures, that's when you start to see failures, right, of those constraints. And I think a lot of people overlook that architecture as a whole and are losing sight because they can't see the forest above the trees, or through the trees rather. WILL: Yeah, but, you know, I guess, like, and I don't know whether we'll be able to, like, resolve this to, like, a satisfying conclusion. But, like, people talk about, like, the Challenger disaster. This guy was talking about, right, I think it was foam, right? Like, foam was falling down and damaging the heating tiles, right? And that compromised the thermal integrity of the space shuttle, which made it blow up, right? And they told him, "Hey, shut up. Shut up. We got to get this thing in the air, you know, whatever. The project's got to go," right? And so, we all, because of the tragedy and the benefit of our beautiful hindsight vision, we're all like, "Oh, well, obviously, this man is a hero. These evil, greedy executives were the villains," you know what I mean? And one guy wears the white hat, one guy wears the black hat, and then boom, [claps], put a bow on it, ship it. But, I mean, just because, like, I am the way I am, I always think about like, okay, well, how many times did the executives say, "Shut up. It's fine," and they were right? You know? You know what I mean? And, like, I don't have that information, but I do know for certain that there is this temptation for all of us to assume that if we just do everything well enough, it's not going to blow up in our faces because we can have it under control. MIKE: So, I want to take that and combine it with the SpaceX example that was brought up before. It's going to blow up. If we're talking about rockets, yeah, they're going to blow up. You're not going to start a rocket company or, you know, a government rocket program that's not going to have a lot of things blow up. If you go into it with that mindset and start blowing things up on purpose, say, "Yeah, I'm going to have blow...these things are going to blow up," it changes your approach to the problem versus saying, "I'm going to try to control absolutely everything so that nothing will blow up." MATT: Try to test your failures. Push the [inaudible 21:15]. MIKE: Exactly. Yeah, fail on purpose. Learn from it. MATT: Yes. And I'm wondering, you know, and I don't, again, like you just stated, Will, I don't have the information. However, that foam may have very well passed temperature testing, right? However, you add velocity to that ; did they test it at velocity? Because things get fed more oxygen. They ignite more quickly, and, exponentially, things go bad. So, it also illustrates test your edge cases, right? That's an important thing. You can't always predict edge case. But as Mike just stated, you need to try, right? Try to determine your failures. Try to test those failures, and you're going to have much better success than just saying, "Okay, no, we want to make it perfect. Here's our MVP. This is best-case scenario. Everything's successful. Let's send it." MIKE: So, the testing to failure is very different from testing that it meets certain parameters. If you test, "Oh yeah, it didn't fail within these parameters," and the failure point was, like, 1% away from that, you have no idea whether it's 1% away or you've got, you know, tons of headroom, you know. MATT: That's right. Test it till it breaks. MIKE: Yeah. And that approach, that change in mindset, that very fundamental change in mindset is a big deal. And it's kind of the difference between waterfall-style software development and agile development is, in one case, you try to control everything and inevitably fail [laughs]. And the other approach you say, "I can't control everything, so I'm not going to try to. Instead, I'm going to take an alternative approach where I build a small prototype, test it out, and go into a loop so that I know far more about the process as I'm going on." So, you plan...You're still planning, but you're doing just-in-time planning rather than attempting to cover all your variables before you could possibly know all the details. WILL: Yeah. And, like, some stuff you got to test in prod. That's one of the things, I mean, like we talk about, like, SpaceX, right? I believe, you know, we return to the analogy, right, where they blew up that rocket, and they were so happy about it. Like, they were pretty sure that rocket was going to blow up. They didn't want the rocket to blow up. I don't think they were trying to blow the rocket up. They were trying really hard to not blow the rocket up. But even still, they were like, "There's no way fucking way this makes it all the way," you know? And so, when it got off the launch pad or whatever and it blew up, you know, on the first stage decoupling, and it was just like, "That's a great win." I think that's...they've embraced, like, you know, the futility of the illusion of control, where, like, you just can't test a rocket on the ground. You can't do it. MATT: No. And you can't predict everything. I mean, let's face it, this is reality. There is no way to predict every variable. And, you know, some of us on this call witnessed Challenger. You know, I remember sitting with a group of children watching it live on TV and then watching it happen, and I will never forget it. I can picture everyone next to me, their face, the reaction, you know, similar to 9/11, same thing. But you can't predict everything. But you can force failure. MIKE: So, if you set up a [inaudible 24:56] DAVE: Yeah. So, at the end of the day, we're all testing in prod every single day. MIKE: Well, yeah. But if you accept a culture of monitoring where you are looking at the anomalies and paying attention to them [laughs] and doing something about it, this is kind of where we launched this conversation, right? Then rather than trying to...It's the opposite of control, right? You assume, I don't have control, so I'm going to watch everything I possibly can to see when things start going out of bounds. So, you develop a monitoring culture rather than a control culture. And I think that's a big deal. Like, we talked about SpaceX. I'm sure they had all kinds of instruments on that rocket that blew up to figure out what went wrong in every possible way [chuckles]. They didn't know what would go wrong, but they knew something would, so they instrumented that thing to death. "Let's look for all the anomalies we can see." And the next rocket, I bet they did something for almost all of them. MATT: I think this speaks to culture as well, you know, NASA versus SpaceX. And I will admittedly say that I am a fan of what Elon's doing. Like, I will not hide that because he is innovation king. But you operate under government regulation, bureaucracy, constraints, and then you go privately held with someone who's a visionary, wants to push boundaries. You see the success rates, right? And those success rates are exponential with what SpaceX can do versus what NASA can do. We haven't... I mean, we're, as far as I know, we're still on x86 architecture on the space shuttle, and I can guarantee you SpaceX is not. MIKE: Well, and that culture there, you know, we can ascribe it all to one guy, but it's not, right? I mean, there's Gwynne Shotwell [inaudible 26:45] WILL: I'll give Elon credit. MATT: He's the visionary behind it. MIKE: I'll give him credit. I'm not taking away all the credit. But -- MATT: He's the visionary behind it. MIKE: You can't just have one person. One person is not culture. MATT: No, but it starts at the top. DAVE: Well, and we live in a world of identity politics right now, and I think a peace offering we can say on both sides of the aisle is that the people that don't like the identity have a problem with that person, right? With Elon. I don't hear anybody on either side being upset about electric cars, or about having a space program, or maybe getting off the planet and saving humanity, or dealing head-on with the AI potential extinction of the race. We like what's going on. We like what's coming out of there. So... WILL: Well, you know what I mean? Like, I actually, like, I really like this analogy because I think as you look at other things, you can see this culture. You can also see the limitations of it in that he wanted to build out a rocket company from scratch, right? And he wanted to do it in the most capital-efficient way that he could. And he, I think, correctly ascertained that, like, okay, the fastest way to get to orbit is to test it in prod, right? So, blow up a lot of rockets, right? Like, I'm not going to spend years and years and years in the wind turbine, you know, like all this stuff. We're going to shoot some rockets off, and we're going to blast them into space. We're going to see how things go. We're going to learn and iterate very rapidly, right? And they're all going to be unmanned rockets, which... initially at least, right, NASA couldn't do that. That wasn't on the menu for NASA. Or well, I mean, I guess that's not true -- DAVE: Well, because, like, Sputnik and the [inaudible 28:33] stuff, sure they did, yeah. WILL: Yeah. Well, no, no. When NASA got started, they had a lot of acquired experience with one-way rockets from... DAVE: That's fair. WILL: Very [inaudible 28:34] MATT: Yes, yes, yes. I know where you're going with that one. DAVE: Yes, yes. WILL: They were doing one-way trips almost from the very beginning. But regardless, right, the point I want to make, though, is there comes a point where this move fast and break stuff thing and the complexity around the emergent system starts to consume you, starts to swallow you whole. And what we have seen, like, I'll go ahead and call it. I don't want this to be the Elon show, but, like, they've been advertising robotaxis for a very, very long time. And I think that system, the complexity has gotten out of hand on it. And I don't think those robotaxis are coming because I think the move fast and break things, iterate quickly, kind of messy architecture culture has...I think the tech debt around autonomous driving has completely stalled out their progress. I think they're stuck, frankly. MATT: Well, I think...and you kind of led me to a perfect segue here. And I'm going to go extremely, extremely old school and maybe a little bit off topic, but it takes visionaries to change the way we do things. I'll go back centuries: Leonardo da Vinci pushing the boundaries, trying things that everyone else thought he was absolutely insane. Next, Nikola Tesla and how he was obsessive, and it destroyed his life. He died broke and alone. But he changed absolutely everything for the world, right? And we need that. You can't get stuck in technology, bureaucracy. You need innovation. You need to push boundaries. You need to test outside of those boundaries to really make progress. And I think, to me, and, y'all know me, that's the most important thing there is to me when it comes to the world of technology and the things I do and what I'm trying to push. And sometimes I'm going to be wrong, but you have to be wrong to become right. MIKE: Well, let me take that. So, you talk about the robotaxi and the visionary. Yeah, I think you have to be a visionary, and sometimes you have to admit that you're wrong. I think that, yeah, the robotaxis not been successful, and part of that is it's been thus far technologically impossible [chuckles]. There are challenges to making that happen that nobody has solved yet. And -- WILL: What are you talking about? They're done. You can ride in one. MIKE: You can, with Waymo, because they didn't say, "Hey, we're going to end-to-end learn this." They said, "It's not within modern tech, so we are going to have a really sophisticated 3D map of the environment we're going to work in. We're going to use LiDAR on top, and so we're going to use some algorithms to locate where that vehicle is every time given that 3D map. And all that vehicle's going to do is follow the map." And that's what they do, and then they just use a little bit of the machine learning. I mean, they still have to have the vision to look for a collision. So, they're doing some collision avoidance, but they're solving a different problem. They decided this tech isn't here, so let's solve the problem with technology that actually does work, and so they're successful. So, they've approached the problem a different way. Now, you need to try stuff that fails sometimes, right? So, I think it was great to say, "Hey, let's try this with end-to-end learning. Let's see what we can do." At some point, you might need to realize it's going to bankrupt your company trying to do it because it's not going to work. Sometimes it works; sometimes it doesn't. Yeah, you need to experiment, and sometimes it's not going to work. MATT: That started something revolutionary, though. Yes, they have constraints, right? They can only do it in x amount of cities where roads are in certain conditions, because of LiDAR, and, you know, collision detection is vector maps and machine learning gets a little scary, to me, because probability versus determinism. But you have to start somewhere. MIKE: You do have to start somewhere. MATT: And what they're doing is going to revolutionize the industry, and it's going to change the way we navigate the roads. WILL: Tesla? DAVE: I think we're solving it from the other end, much, much farther than we've ever been. I overheard Uncle Bob Martin talking. He's got a Cybertruck. Now, I don't like the Cybertruck. That's a personal aesthetic thing for me. Honest, I joke with people that somebody designed a nice, beautiful SUV, and they modeled it in 3D, and they accidentally sent the bounding box to the fab of the render [laughter]. And that's what they got back. But Uncle Bob owns a Cybertruck, and he put on Twitter a little while ago that he's put, like, 100,000 miles on it in three years, and 80% of it has been auto drive, and, like, he won't live without it. For somebody to be that much of a road warrior and to straight up say, "80% of this is solved," we've never been that far, and every year we get closer and closer. And Waymo, they have to solve that other 20%, and, like, 5% of it is, like, road construction problems that LiDAR can't deal with, so they cut that off. They probably just won't deliver you to those areas, right? So, they stay within that. And we're getting it closer and closer, and that's what kind of excites me. Circling back to AI a little bit, I said, like, five or six years ago when the self-driving cars were coming, I joked to somebody that, like, we look at AI, and we say, "It'll never be there. It'll never da, da, da, da, da." But our kids are going to talk to each other and go, "Can you believe Grandpa got in that 2,500-pound machine of death and controlled it by hand at 100 feet per second? Are you nuts?" right? We're seeing this with vibe coding. We're past the tipping point. There are companies now that are literally saying, "Why would you let a human touch the crypto code?" Or, "Why would you let them touch this piece of the security stuff?" And by next year, like, 80% of software...I don't know if it's 80%. WILL: [laughs] DAVE: But anytime I make these bets, I always take the under, and I always win if I take the under because it's going to hit, and it's going to hit faster than I think it's going to. And I'm calling it, like, next year that over half of the code that we do in prod is going to be...it's not going to be vibe code. We're not going to use that word because it's a four-letter word, but reliable automation in prod that's handled by an automated system with, you know...And I don't want to get into that. It's a different podcast. WILL: Not a chance. Not a chance. MATT: However, I will get on that bet with you. WILL: No way. Not, not -- MATT: Just based on some of the things I'm aware of. WILL: I would say, like, I don't know, I mean, maybe it's just the domain that I work in, but, like, 80% of the code that I write these days is generated by an LLM. But there's not a snowball's chance in hell that that LLM is ever going to replace me. The LLM cannot exist without me. DAVE: Oh, yeah. MATT: No. WILL: I can exist without the LLM. MIKE: And nobody's arguing otherwise. MATT: Yeah. I don't -- DAVE: I think we're all in violent agreement here, yeah. MATT: Yes. Nobody is replacing humans. At the end of the day, there has to be a human accountable for what goes out. Accountability and ownership is 100% important. Like, you cannot avoid that; otherwise, we end up in chaos, right? And then we see drift everywhere, and hallucinations, and Wild Wild West, worse than we've ever seen in the history of humanity. Like, there has to be accountability. DAVE: You've just named the next problem that we have to solve, yeah. Every time somebody says, "AI can't draw a hand with fewer than six fingers," the AI community says, "All right, bet. We'll see you in two more model revisions." MIKE: Well, and I think you just -- MATT: And every week, it changes. MIKE: You just brought it back to where we started. If we have a culture without accountability, then bad things happen. But if you -- DAVE: That's normalization of deviance, yeah. MIKE: Normalization of deviance. But if you're watching this thing, and you're developing all kinds of metrics to say, "I want to make sure this code has high quality," and you're establishing those standards and building the constraints to make sure that it is high quality, then you can get to that confidence because you watched it. WILL: Well, I mean, and, like, one thing that I'm very, very excited about AI in that one thing that it is good at, really good at, is, like, just petty ditch-digging work that people cannot stand. And I'll give you an example of, like, sort of, like, a normalization of deviance in ways that I think big and small, right? We need to, like, you know, we're asking, like, is the juice worth the squeeze? Well, in certain capacities, absolutely, it's worth the squeeze. I'll give you one easy and one hard example. Like, one thing, I'm looking at a build for a product that is getting a lot of usage, and right now I see 3,874 compiler warnings, of which I am certain at least 3,000 of which are completely spurious, completely nonsensical, totally useless. 500 are nice to have, get out ahead of this deprecated API, you know, you can get around to that. And 174 are a big, big problem. And I don't know which ones are which. And it'll be a real hairball, and, like, you know, in all honesty, right, because we've normalized deviance to think builds, to think [inaudible 39:02] that pass QA, right? And it's just like, "Ah, that warning's just a warning. It's just nothing," you know what I mean? "Let's tape over that check engine light because I got to get this release out on time." And that's a big problem because I guarantee you there are at least 100 warnings in there that are bombs waiting to go off. And everybody's build is like that. Everybody's test suite is like that. Everybody's...There's nobody who's like, "Okay, I'm going to run my test, run a real, you know, rake test," and, like, that output comes out squeaky clean. You know it doesn't. You know it doesn't. But it could, and maybe it [inaudible 39:43]. And that's a normalization of deviance, like a real serious problem. DAVE: Broken window syndrome, yeah. WILL: Yeah. Well, I mean, like, there's no reason that we couldn't clean these out. I wouldn't even know. There's a really good reason. Development time is expensive, and the juice isn't honestly probably worth the squeeze. DAVE: It's not always. Yeah, it's always a long tail or Pareto's rule, yeah. WILL: But my helper monkey, he doesn't get tired, you know? As long as the lights are on at the data center, he'll clock into work. DAVE: As long as I've got tokens. WILL: I have to check the GB thing, which is going to be burdensome. But I'm saying that's an easy one, right? And that's an easy win of like, "Oh, look, we can fix this. We can work this out." I think a more significant issue is due to the continual degradation and debasement of my intellectual and emotional health, I don't keep a lot of apps on my phone for consuming media, and I use the internet as it stands, right? Like, if there's, like, a Reddit article that somebody sends me, I just look at it on a mobile website. And I don't know if you guys have noticed this, but, like, the mobile web, like, the internet in general, is in a dire dumpster fire state. It's not like the internet does new things. It's just terrible. It's terrible, and it's gotten bigger and bigger and bigger, and worse and worse and worse. Page sizes have gotten larger, and APIs have gotten slower and more bloated, and we're loading more stuff for no reason. And, like, all of our performance is degrading and degrading and degrading and degrading and degrading. That has absolutely direct financial business impacts. And every single organization I have ever been a part of or interacted with on any level has suffered from this. Normalization -- MATT: Yeah, and I would -- WILL: "Oh, well, it's only 10 milliseconds slower," you know? "It's only a megabyte bigger." MATT: I won't get into the psychological and physiological aspects of that, but there's definitely an impact. Synapse are being reprogrammed, and attention spans are 30 seconds when they used to be hours. And, you know, the world has changed. MIKE: Well, we're kind of scratching on this into a different topic. So, I'd like to bring this together, this idea of normalizing, you know, these deviances. We've talked about changing culture, right? To having a culture where we pay attention to the data, and that changes things when you do so. And it's easy to let it slide. The default is to let it slide, and the entropy happens, right? But if you flip that and say, "Well, we're going to focus on watching stuff," it makes a fundamental difference, and can even, in some cases, lead to avoidance of tragedies. And I think that's a good place to end this. Until next time on the Acima Development Podcast.

Episode 98: Standups

This episode of the Acima Development Podcast starts with a discussion about the frustration of U.S. tax filing and uses it as a metaphor for poorly run standup meetings in software development. The hosts argue that many teams repeat painful, unnecessary processes simply because “that’s how it’s always been done.” From there, they unpack the most common standup failures: meetings turning into status reports, running too long, involving too many people, or becoming impromptu debugging sessions where only a few participants are engaged while everyone else checks out mentally. The panel emphasizes that these problems are usually symptoms of poor communication and coordination happening outside the standup itself. A major theme throughout the conversation is that standups should focus on coordination rather than status reporting. Dave Brady argues that if teams properly maintain tools like Jira or Kanban boards, everyone should already know the project status before the meeting begins. The standup’s real purpose is identifying blockers, avoiding collisions between teammates’ work, and quickly coordinating handoffs. The hosts debate alternatives like “Slack-ups” and asynchronous updates, with some arguing they fail to replace the human interaction and spontaneous coordination that happens in live meetings. They also discuss ideal team size, meeting frequency, time zones, and how distributed teams create additional coordination challenges, especially when work is handed off between regions. As the conversation evolves, the podcast becomes less about standup mechanics and more about human connection in remote work. Will strongly advocates for cameras and microphones being on during meetings, arguing that face-to-face interaction helps managers recognize burnout, disengagement, or personal struggles that text updates can easily hide. The hosts criticize workplace cultures that dehumanize remote or offshore workers by treating them as interchangeable resources rather than teammates. By the end, the group concludes that the biggest failure in bad standups is not inefficiency alone, but the loss of genuine human connection. Good standups, they argue, are ultimately about building trust, communication, and healthy relationships within a team, not simply exchanging status updates. Transcript: MIKE: Hello, and welcome to another episode of the Acima Development Podcast. I'm Mike, and I am hosting again today. With me, I've got a great panel. I've got Thomas Wilcox, Dave Brady, Justin Ellis, Eddy Lopez, and Kyle Archer─I think we're all returning crew here ─[chuckles] to talk about our topic today. So, you're probably not listening to this, like, exactly when we're recording it. You're probably not even listening to it right when it comes out. There's always a recording period and then a publishing, you know, a week later, or a few weeks later, after it's gone through editing. And we have a bit of a queue in case we miss some time. It all works out. But we are recording this in tax week in the U.S. This was the week that taxes were due, and everybody has hopefully completed their annual suffering and has submitted those numbers to the IRS. I read about this before, and I read about it again this week. Articles are often published this time saying, "Why do we do this?" Well, it's a good question, because United States is actually fairly unique in the world in that we have to submit all these taxes every year. In many countries, most people don't have to do anything at all because if you're working for an employer, they've been submitting tax information to the government all year, right? They've been paying your taxes, and as long as you don't have anything funny going on, that's enough. The government knows about you, you know, they probably know how many dependents you have, you know, you've reported that. I mean, you reported it with your business. The information's there. And in much of the world, people just receive a letter saying, like, "Yeah, thank you. Everything's good." And they receive, you know, there's no refund or non-refund because it just works, right? They don't have to do anything. The cycle that we go through of pain every year doesn't need to happen. Now, the reasons for that have to do with...Well, I want to be careful here. Our purpose here is not to criticize large corporations who lobby heavily [chuckles] to keep the tax code as it is, well, to keep the tax submission process as it is. But such is our life, right? But where I was going with this is that we go through all the suffering because it seems normal, and everybody we know around us do it because it seems normal to go through all of this process of reporting something we've already been reporting with every paycheck for the entire year. It's just a rehash that we have to do in excruciating detail because that's how it's always been. But there are examples of people who do it differently, and they don't go through the same pain that we do. And I imagine that in their blissful lives, they have extra time around this season to do things other than pay their taxes or, you know [inaudible 03:14] DAVE: Must be nice. MIKE: It must be. Why do I talk about this? Most of you, if you're an engineer, have probably been in a lot of standups, which is a sometimes daily, sometimes weekly, some regular interval typically meeting where you have a chance to touch base and connect with other people on your team. And they can range from actually pretty good to something far from that [chuckles] to something that makes you want to quit your job right [chuckles]? Like, well, not another standup. The idea, you know, comes from this agile process where...and I think it's not even just engineering. You get together in a room. You want the meeting to be so short that nobody sits down, right? You go through the key things to make sure that everybody can touch base. Now, we have all kinds of communication channels, right? We've got, you know, our messaging platforms that we use. We've got the ability to go and walk over to people. There's lots of ways to communicate, but we decided that we're going to pay this cost of bringing a whole team. And this can happen at lots of levels. You can have executives getting together for a standup. You can have the team that reports to the executives getting together for a standup. So, you have a bunch of people, and that's an expensive meeting, right? Imagine the executives getting together for a standup. I don't know how many dollars that costs, right? I'd have to do the math, but it's not few. It's an expensive meeting where people have chosen to do that because they think the coordination is so important. But it can be done right, and it can be done wrong. It can be a yearly suffering, a period of suffering, like the taxes, that reports stuff that's already been known. Or maybe it's a meeting that ends quickly and touches on key information that not everybody knew because it was late-breaking, and it was a good opportunity to share. We're just going to talk about standups. It's something that we all live with, so it's worth talking about. So, I'm going to ask─I've given the intro─what have you seen? Well, actually, let me start. Let's start with the bad side. What is it that makes a bad standup? DAVE: Turning into a status meeting, for me. The thing that makes a standup go bad...and I will reveal the point that I wanted to make in today's podcast right out of the gate. The thing that makes a standup go bad is when you are not taking care of the things that you need to take care of outside of standup, and so they have to get taken care of in standup. When your standup runs really, really long and turns into a gigantic status meeting, it's because you're not communicating status outside of the meeting. And, actually, I don't need to put any more on that point. That's just like, if you don't take care of it elsewhere, it's going to hit here. When I look at a standup that's running long, I don't look at it as, like, this meeting is bad, I mean, it kind of is. But I look at it as, okay, what is the unmet need that is screaming at us so loudly that it's cratering our standup meetings? That is frequently a very helpful thing. If it's a status meeting, you maybe need to, you know, do better in, you know, one of your other practices. If you're arguing about cleaning up code, maybe your retro needs to be better. Yeah, that kind of stuff. MIKE: Okay. So, you said there are some other venues where this should be happening, that the status reporting should not happen in standup. I've been in a lot of standups that were about status reporting, so, you know, you're bringing up a common failure case. If that's the bad case...well, I want to come back to [inaudible 06:46] JUSTIN: There's more bad cases. We got more [crosstalk 06:50] DAVE: We should go through the counter good case to the status MIKE: Yeah. So, let's go to the other bad cases, but let's put a pin in that one because you said, status report: bad [inaudible 06:59]. So, what are the other bad cases? What are other bad cases around standup? JUSTIN: When they run long, and there's not a good reason for it. I mean, basically, when you go back to your summary, you talked about how everybody is standing up, and they don't want to, like, sit down, and you just want to quickly go through things and be done. If it's going more than 15 minutes, or it's going more than 20 minutes, whatever you have allocated, and it shouldn't be more than 20 minutes probably, that means everybody's looking at their watch. They're wondering about what other meetings they have to go to. They aren't focused. And you, all of a sudden, the only person who is paying attention is the person you're talking to directly, and everybody else's mind is just like, pshhh [laughter]. DAVE: And you're 100% guaranteed at that point...if your meeting's running that long because somebody says, "Well, I've got this problem," and then everybody dives in, too, you're now doing mob programming in your status meeting. Everybody's trying to debug it. You're no longer talking about what I did yesterday, what I did today, or what I'm doing today, and what are my blockers, right? You've definitely departed the format into something else. KYLE: Well, and it's mob programming at best, right? Because a lot of the time, what I see -- DAVE: At best. KYLE: Is it's one or two people programming. DAVE: Yeah, and everyone else is disengaged. KYLE: And then the other eight are just kind of sitting around twiddling their thumbs. MIKE: [laughs] DAVE: Mm-hmm. MM-hmm. JUSTIN: Yeah. That actually brings up the other part to this, is, like, if there are too many people in your status meeting, sorry, in your standup. Personally, I think four, maybe five, is the absolute max you should have in your standup. You have any more and, all of a sudden, you run into that same problem. It's, like, you know, one person is talking, and everybody else is looking elsewhere. EDDY: Okay, but how do you manage that when you have a team of 10? JUSTIN: You have two standups. EDDY: But then don't you deviate from, like, status reports, in a sense? Like, isn't it important to also -- JUSTIN: What was it? Amazon? No, no it's a good point, and it's becoming really hard these days where, you know, you have the flattened hierarchy, right, where you have a lot of people reporting up to a single manager. But I think it was Amazon or somebody that said, "Hey, you shouldn't have a team that's larger than you can feed with one large pizza." If you are having status meetings with larger groups, it's not as effective. MIKE: And you can do it hierarchically, that is, you have your team of five people do a standup, and one delegate from that person, whoever's leading that meeting, themselves goes to a standup [laughter]. JUSTIN: Eddy just typed in the chat, "I could eat a whole pizza." Eddy, you are a team of one. You are very effective [laughter]. MIKE: I was actually thinking the same thing. Not today, but back in my heyday of eating, you know, like, late teens, I could put down two [laughter]. JUSTIN: Sorry, I derailed that but [laughter]. DAVE: The other thing that kills a standup meeting, and this is the one that if your workplace has the fun police guy in it, it's when standup turns into a BS session, when it turns into a water cooler type thing. And I stand by my earlier point that that's an unmet need. You've got a team that is not being properly socialized. When I worked at Cover My Meds, they had a really great policy that if you were remote, you had to fly into the head office every quarter for a week and just spend a week rubbing elbows with your teammates. We talked about this when we were talking about radical candor, that you basically had to make friends with your coworkers and get to know them. And we spent a whole week just playing card games and, you know, goofing around, and we'd go work, that sort of thing. But we overinvested in socializing and goofing off time, so that when we broke up and went back, the socialization now was just, like, a quick touch base of, like, hey, how are you doing, or how are your llamas? That was a real question from a real coworker, for a real coworker who really had llamas. You know, how's this going, or how's, you know, that side thing going? And if you don't have that investment in the socialization, it will come out at standup because humans are gregarious creatures. MIKE: So, what other failure cases do you see with standups? What about when there's a lack of psychological safety? DAVE: Hmm. They tend to run pretty quick. MIKE: They do [laughs]. DAVE: I worked on this. I'm going to work on this. I have no blockers. That's my report. Yep. Yep. MIKE: Every time. They run quick and accomplish nothing. DAVE: Nothing. Yep. The thing that I thought was interesting as I dug into...I dug a little bit into standup, like, history today before coming on the show. And I thought...it was kind of interesting because the three questions, like, what I did yesterday, and what I'm doing today, and what are my blockers, is not necessarily actually the point of the meeting. It's actually the scaffolding or a ceremony to draw people in. But the point of a standup is not status. The point of a standup is coordination. It's to make sure that you're not stepping on somebody else, or that this feature's going to be in play before my feature needs it, that sort of thing. And so, standup is arguably going well when somebody says something and then three people start to argue with them, you know, "What about this?" that kind of thing, as long as you don't spend 20 minutes, you know, diving into that. But as long as the pushback is, "Wait, wait, wait, my piece is up the pipeline for you, and it's not going to be in until..." you know, that sort of thing, that kind of discussion, that's coordination, and that's the point of a standup. And that's something you can't get...we'll probably talk about Slack-ups and Slack-based standups and that sort of thing before we talk about that today. But that kind of coordination is pretty hard to do in just, like, an RSS feed, where you just...here's what I worked on, here's what I worked on, here's what I worked on. And, unfortunately, in most waterfall-based or enterprise timekeeping systems, we just want to know what budget code to put your time against, and so we're not interested in coordination at all. So, the business is trying to extract...they're trying to extract your status from that meeting, which is a terrible countervailing force. It pushes the meeting into a status meeting. MIKE: Anybody else want to jump into the failure cases? DAVE: Yeah, I've got a sore throat, guys. I need you guys to take over the show [laughter]. MIKE: Well, we've covered some good stuff here. So [chuckles], there's nothing wrong with our current list. We've talked about just a status meeting, too big, too long, safe. Go ahead. JUSTIN: Not prepared, and by that I mean the best status meetings I've seen, or the best standup meetings, sorry, I've seen are ones that have been led by somebody who basically knows everything that's going to be talked about. And that goes back to, you know, communication by other channels and things like that. But, you know, if the leader goes in there and he's got a checklist of things that he needs to find out and he doesn't know clarity on all these items, I don't know if he's going to be able to find all the answers that he wants during standup and have it be as short as he needs to be. MIKE: That's great. And if we put all these together, imagine going to a standup where the leader's not prepared, has no idea what's going on, is going to likely mistreat the people on the team, so they don't want to go into any depth, but are mandated to share a long status. And so, that's what happens. You stand there within a large meeting, for hours, hearing everybody give a status that they could have reported. You know, basically, they're just reading out what happened in Jira. Does that basically cover it? DAVE: Mm-hmm. MIKE: Nightmare fuel [laughs]? DAVE: Or Tuesday [laughter] MIKE: Yeah. I worked with somebody who had recently been promoted to management, and he called Tuesday poosday because he had so many meetings [laughs] from having these sorts of experiences. Okay. So, we've identified a set of problems, and we're engineering folk. What do we do to address these problems? And maybe to start, go back to the beginning. If a status report is the most common failure case, and how these often fall into, you know, how...I say...I'm not sure my preposition works there [laughs]. Standups often collapse into just a status meeting, instead of being something effective. Well, and we talked about how they can be useful, right? There are means to communicate information that's not being communicated elsewhere, to quickly resolve problems, make sure nobody's blocked, and take things elsewhere. It's not where you do the major problem-solving. It's where you set up the later coordination to address problems. Dave, you said you have lots of thoughts on how to address these things. And you say that it becomes a status meeting because that's an unmet need elsewhere, you know, it could have been done elsewhere. So, where should it be done? DAVE: That's a good question. Anywhere else. It should be done anywhere is actually a fair point. Standup is just the least good place for it to happen. In my career, we all have a love-hate relationship with Jira, and I definitely love to hate on Jira. But the best teams I've ever been on, for managing process-wise anyway, we could go look right at the board, and we could all tell where we were as a team. We all knew how this thing was. We all knew what feature we were working on. We knew what the customer was going to receive when we delivered it. So, we had kind of that high-level... I realize this almost sounds like I'm not answering the question, but I really am. We had this higher-level visibility that, like, I'm not just writing lines of code here. I'm actually...I'm shipping this feature, which is part of this, you know, this larger, you know, thrust that we're trying to get out to the customer in this next round of deploys. And when everybody has that status of this is what I'm getting at or this is what I'm headed towards, now any tasks that you pick up are focused towards this, and anything that you're working on is either in line with it or isn't. This feels a little nebulous, but if you can see where the team is at and where you are at, and you know what you're working at, you don't need a status meeting. And if you've got that on a big board somewhere, or if you've got it on, like, a Kanban board, if you've got it up on a wall, if you've got post-it notes anywhere, or if you've got, you know, CRC cards, it doesn't matter. It can be a burndown chart. It can be a burnup chart. That's actually the same thing, just upside down, however you do it. But the key thing is, do you know what your teammates are working on, and do your teammates know what you are working on? A lot of that gets bled away in pair programming because you're swapping pairs, especially if you're doing promiscuous pairing where you swap partners every day. Because I pair with you for the day, the next day I know what you worked on yesterday because I worked on it with you. And so, that part of the status communication goes away. It slowly weaves its way through the team, one partnership pairing at a time. So, yeah, I'm going to answer your question with your own question, which is, you know, where do we take care of those things? Anywhere and everywhere that we can take care of them. We just need to be intentional about what the need is. And I think that's what kills us in standup, is that we go in just assuming that, well, I'm here because it's 9:30 in the morning, and that's when we do standup meeting. And you're cargo culting the ceremony at that point, right? It's like, I'm going to go to this meeting. I'm going to do my three questions. And if you've got a good scrum master, then when somebody asks you a question, the scrum master will say, "Okay, stop. Kick that out to after the meeting." And that's how you keep your meeting short, just by punting that out. But that's all just ceremony. The whole point of the ceremony is to get people coordinating so everybody knows where we're all at together. MIKE: Well, I heard in what you said, if you're using your project management software, and it might not even be software, it could be your project management process that you handle on a board. Either way, it's your project management process. Then you remove the need for a status report because you're using your system to do it. So, if you're not maintaining Jira hygiene, if Jira is what you're using, if you're not keeping that up to date, the tool that your company is paying for and is intending to use for that, then you're going to be forced to do it somewhere else, which is worse. Is that a fair summary? DAVE: Yeah. And as you described that, I just realized there's another failure mode of standups, which is dissemination of knowledge, which is normally taken care of in your pairing. But this has certainly happened to me even here, where I will say, "Hey, I'm going to work on this piece, but I'm not sure where to attach into it." And someone else in the standup meeting will go, "Oh, well, you're going to have to grab this service class, and then plug it in with this thing over in the utilities directory." "Oh, okay. So, if I do, you know, can I mock that out this way?" And, all of a sudden, it becomes a technical meeting, right? And what's really happening is I'm pairing with another programmer. I'm just wasting everybody else's time while I do it. MIKE: So, failure is when maybe good things happen, but they happen with everybody else as spectators and forced spectators where they don't want to be there. That's not the movie they paid for. DAVE: Yeah. What it is, is it's the least efficient way to accomplish the necessary thing. It's not necessarily bad; it's just a terribly inefficient way to do it. We'd rather you just go pair off with one of the other people on the team and, you know, knock this out. But if you're not going to do it that way, it has to get done somewhere. So, standup's the next time you're going to see each other. MIKE: Just say, "You two, go work that out [inaudible 20:22] [chuckles]." Yeah, so, effective way to address that. So, we've talked about failure modes of standups, how those can involve just being status reports. They can be the meeting being too big, too long, unsafe, having the wrong things in them. We've talked now about avoiding status reports. And, Dave, you really focused on using your project management so that that is all in everybody's mind. They can just glance whether that's your Kanban board, or Jira, or wherever it is. DAVE: Right. Exactly. MIKE: So that you know that information ahead of time, so nobody even tries to make your standup about that, because why would you? We already have that information at our fingertips. One thing that I've seen done is Slack-ups, or, you know, name your messaging tool of choice. Slack is widely used in software as well as other industries. So, we'll talk about Slack, but, you know, if you're a user of something else, Microsoft Teams, for example, which we also use, that's fine. I'm referring to both. Is that a good replacement? I mean, is that really a good replacement? DAVE: Hard no. Hard no. At least for the coordination part, I say it's a hard no. We use Slack-ups here, and Mike probably has lost sleep over the number of times that I forget to do my Slack-ups. If I go through my Slack history, I've probably got 20 kilobytes of Mike going, "Hey, Dave, would [chuckles] turn in your Slack-up, please?" But that goes to what we were talking about though, or what I said earlier, that somebody is trying to extract reporting information and status information, and that's how you knew that my Slack-ups were getting forgotten. And the reason I was forgetting to do them was because I didn't have any coordination to get done. And ADD is like, if it's not right in front of me, it doesn't exist. So, in my opinion, I think, a Slack-up does not solve the problem of a standup. And that's why I tend to push back sometimes when we say, "Well, let's not do standup. Let's do Slack-up instead." I'm, like, no, these are completely different things, and it might be worth doing both. Because the next devolution of that argument will be, well, why don't we just use Jira instead of the Slack-up? And because that's, like, an obviously provable thing. Like, well, if your Jira board is accurate and everybody's keeping it up to date, then you don't need the Slack-up because you just go look at the board, and it'll be up to date. And [chuckles] silly anecdote, [SP] Gerardo is our product manager, I think, is the title that we're working with. And I love him because, in standup today, I'd gotten behind in my Jira reporting. And I keep a list on my laptop of the tickets that I'm working on, like, all the statuses they're in, and it literally generates my Slack-up for me. This is how I got to the point where I was able to do my Slack-ups on time because I made the computer do it. And I pulled up my Slack-up, and it didn't match Jira. And I started lining them up, and Jira was correct, and that was all Gerardo's doing. He literally just, like, one of the PRs had updated in GitHub, and he'd fired the hook, and it had gone through. That's, you know, when you've got it really working well, right? So, anyway, the point of that is that Jira can absolutely replace the point of a Slack-up in terms of, like, status distribution. And this is why I push people away from, please don't replace standup with Slack-up, because you'll end up in this morass of, like, well, what about Jira? You're now fighting about the best way to not solve the problem. You're not even talking about the right problem anymore because there's no coordination involved. MIKE: So, there seems to be a recurring theme here that use your project management software, and if you don't like it, then solve that problem because that's the underlying problem. DAVE: Right. MIKE: Okay. And one thing I want to make sure we don't miss, and this came up in our side chat. We haven't talked at all about frequency yet. If we're talking about the failure cases, the same awful meeting I talked about earlier, twice a day [laughter]. If you have remote teams in different time zones, you got to catch them up to speed too, right? Do it twice a day, or maybe once for every time zone you have somebody in. I'm saying the opposite, the opposite of the good thing [laughs]. This is a bad thing [laughter] I'm describing. But -- JUSTIN: That actually brings up a really good point. Like, I've managed teams that are in India, and for them, it's, like, 10 o'clock at night when they're checking in with the rest of us. But we got to have that standup because we got to make sure that they are not blocked for their next day. So, I think the time of day really depends on your time zones, things like that. And for me personally, my ideal is, like, everybody's in the same time zone. We have it in the morning, not first thing, but, like, at 9 o'clock, maybe 9:30. That's my ideal. It's a good way for people to get in, check their emails, kind of try to remember what they did yesterday, and then they can come in and do standup. Doing it at the end of the day has never been really appealing to me. I've done it before at the end of the day, and a lot of people are checked out already, and then they forget what they said they were going to do by the time the next day comes around. DAVE: Yeah. You've had shower time to think about it. JUSTIN: Yeah. So, I prefer it in the morning, I don't know. But I am open to other thoughts. And, again, if you're dealing with multiple time zones, you just got to do what's best for your team. MIKE: You suggested that reality, which is if you have groups in very different time zones, and I've seen this with people in Europe, people in India, Philippines [chuckles], people in Vietnam, you know, where you have very different versus the United States. You are right. That makes the standup even more important to not be a status meeting, because it's handoff time, right? You're passing the baton. And when you're passing the baton, you don't want to say, "Hey, here's what I worked on today," then the race stops. You stand there and chat for a few minutes, and it's no longer a relay race. You're not handing off the baton. Who knows what you're doing? But if you're handing off the baton and say, "You know, careful, it's slippery up there," or they're supposed to hand off the baton, and they're not there yet because they're blocked back somewhere, right, then you know something. And that's an important thing to recognize, and using that opportunity is a big deal. It's a good opportunity, a really useful opportunity to actually make that handoff and make sure that you're not doing a status meeting because that's, like, the least valuable thing you can do when somebody's showed up at 10 o'clock at night. You don't want to hear what they worked on that day. You want to hear about what you're working on today, because they're handing it off to you. DAVE: You said something a minute ago, and I think I misheard you, but I like the way I misheard it. You talked about time. You said, like, what time? Because Justin then jumped in with, you know, like, evening for the India team, and that sort of thing. But what I heard was how many times. And I was just imagining, like, the horror of having standup more than once a day. Or, you know, do we have it three times a week? And that sort of thing. And I have actually worked on a team that had standup twice a day, and it's because the team was extremely agile. We were all in a bullpen working together. There were six of us, and we would pair up in three pairs, and no ticket ever lasted longer than four hours in theory; sometimes they did. But, like, at lunchtime, if you weren't done with your ticket from the morning, you had to trade pair partners. And the next day, if it still wasn't done, your ticket got thrown back in the backlog as being too big, you know, too problematic. And what I'm realizing, I've got this crazy...this is just a bat-poo-crazy Dave Brady hypothesis. Show me how fast your deploy cycle is, and that is how often you need to be having standup meetings. I'm on a team right now that we meet three times a week, oh, sorry, yeah, three times a week, every other day. And what you just told me is that what you synchronized on yesterday, you don't need to synchronize about today because you're not moving fast enough to bump into each other with yesterday's coordination or with just that information. If you are changing lanes very quickly or hopping from feature to feature to feature, then you need more and more coordination, because you're a lot more volatile. You're jumping around. You're bumping into more things. So, that's my crazy theory is, from the time you go code complete to the time you go deploy, that sets a pace and a rhythm. It's not necessarily good or bad. I mean, agile says that should be very, very small, but, like, it's a reality that, like, the more enterprise your system is, the longer that's going to be. If you've got, like, a validation or an auditing step, or that sort of thing, or compliance, then that's going to take longer. And, I think, as far as coordination goes, that can verify the need for a standup meeting. There's just not that much need for everyone to come together and say, "Hey, I'm going to be working in this area. Who do I need to coordinate with to make sure I don't break your stuff?" So... WILL: I don't know, man. I do not agree with that in the slightest. I'm a hard, hard, hard no on that. DAVE: Awesome. Awesome. WILL: Well, deployment cycles, like, it's...I think of, like, these standups as, like, more, like, inter-process communication. I work in, like, native mobile for the moment. And native mobile deploy cycles are very slow because it's a whole song and dance you got to do with Apple, with Google rolling it out to a bunch of, like, third-party devices you don't own, all this kind of stuff. But we need to do more coordination and not less, because, like, we've got all these teams coordinating on the same app that we really don't want to screw up. You know, clawing back on mobile release is really painful. And it's a function of, like, how many cooks do you have in the kitchen? Not like, how many times you're serving the meals, you know what I mean? DAVE: Okay, so same principle, but opposite conclusion. Okay. Yeah, that's fair. That's fair. MIKE: Well, going back to the relay race analogy I was saying before, if you need to pass that baton to somebody, then that's a coordination point, right? If you're working on something largely alone for three days, there maybe nothing changing there. DAVE: Yeah, that's fair. MIKE: And if nothing's changing, you don't have to pass the baton, right? The environment you talked about, where you changed tickets twice a day, well, there's a major coordination point there where it was mandated. And that really wasn't necessarily the deploy cycle per se, although it could be. It's the points of communication. Or in Justin's example of very different time zones, there's a real need for that coordination where there's a handoff between one group and another at the time. It seems like those coordination points, where the coordination is required, seem to be driving it. And I think that's where there's overlap between where you're headed. Will, you're saying, well, you need to have these coordination points at, you know, the communication need is what drives those points of coordination. And yeah, for your mobile app, maybe you're only releasing once a month, but you better be coordinating more often than that, or else you're going to have a horrid mess. DAVE: I withdraw my claim because you're right. I was trying to conflate the speed at which you deploy a feature. If everything is atomic and everybody has their arms in, then that is linked pretty closely to the rate at which you need to coordinate. But that is the actual driving variable is, how fast do you need to coordinate? How fast can things change? Absolutely. I agree. KYLE: I was just thinking of two use cases, and they might be more niche than the average developer. But I've worked in a situation where I was Dev QA. And what that meant was I sat with my devs, and that was my main responsibility. My secondary responsibility was to my QA team. So, we talked about, how many times do we have standup a day? I had two. I had one in the morning with my dev guys, and then I had one in the afternoon with my QA guys, both of them managed very differently, different scales. And then the other scenario that I'm looking at is kind of where I'm at right now is I'm on a team... I facilitate multiple lines of interest or lines of business. I have one line of business we're deploying 10, 15, 20 times a day. I have another line of business we're deploying once a week, you know what I mean? So, I guess, in that, like...this was more towards your comment, Dave, and we've kind of rectified it a bit now. But that would be very convoluted to be like, oh yeah, well, we need to do it once a week here and 20 times a day here [laughs]. DAVE: Yeah. Yeah. Well, and, actually, that's another proof that my hypothesis is wrong, that if the team that's churning every single day, if they're pushing changes into other people, everyone else has to beat that often as well, because they are causing coordination conflicts, yeah. WILL: I've got a different read on it. So, I had to leave right around the Slack-ups, right? And I've got a real serious problem about Slack-ups because my experience with Slack-ups, that Slack-ups are...I can't think of an exception to Slack-ups not being ultimately rooted in devs being busy under the gun and wanting to skip a meeting that they saw as extraneous. MIKE: True. WILL: And while I have seen inefficient and non-productive standups many times in many, you know what I mean, iterations, I have never in my life witnessed a team that was devoting too much time to keeping everybody on the same page. I think Slack-ups are foundationally not...It's not the right tool for the job if it's just, like, hey, everybody, update your tickets, right, so that everybody has visibility or whatever. Put it in the Jira ticket, throw a comment in there. You know, that's a good thing to do just in general, you know. Like, if I have this thing that's on my desk, when I close out for the day, here's what's going on. And if somebody cares, right, some PMs like, "Hey, what's the status of this thing?" they can just go look at it. And they don't actually need to bother me at all. They will, but they didn't need to [laughs]. But at least they're more informed when they bother me on Slack. I think it's devs thinking that this meeting is a waste of time. And I haven't seen it yet. Every day's a new world, but that day has not yet dawned for me. I think you can keep it tight. There's nothing wrong with keeping it tight and then breaking out. But even the act of just spending a minute, 60 seconds, to articulate what I'm doing and why and how is a worthy investment of time for me, even if I'm working on something in complete autonomy that I'm not going to hand in or coordinate with anybody for a week or two or a month. Just, like, doing that, I think, is a worthy exercise. It's a worthy investment. But you do need to keep it tight when people are busy. Put your camera on, and have everybody look at your face, so that when Mike says, "Yeah, it's okay," you know, but his eyes don't say that, then we have an opportunity to say, like, "There's so many subtle shades and variations of okay, you know, like, I just want to see it. And if I'm the manager, if I'm the coordinator, if I'm the PM, then just give me an opportunity. If somebody isn't necessarily as out and proud and boisterous as me...There are a lot of devs....could I blow you guys' minds? There are a lot of devs that are not excellent and expressive verbal communicators. And they could say to you, "I'm okay," when they're not okay. And if all you have is a Slack message, right, or even a cams-down, you know, meeting, and they give you, like, "I'm okay," right, things might actually not be okay. It might not be cool at all. And denying yourself the opportunity to get that feedback, you know, if I'm a people manager, if I'm trying to keep this team, like, healthy, and happy and productive, I think it's a glaring unforced error. MIKE: I talked to a teacher once about online meetings. I think Zoom is what he was using, but the tech doesn't matter. He talked about teaching a large class where nobody had their cameras on. And it was a nightmare [chuckles] because he'd say something, and it was just dead space, right, just throwing it into the void. You lose all of that nonverbal communication, and he had no idea whether what he was saying was landing at all. And it threw off his whole teaching, like, the whole rhythm was gone. He couldn't make it work. At that point, it's almost a mandatory lecture, where it's why not just record a video? Why do I even bother? WILL: Cams up, mics up, every meeting, every day, every time. And if you got to keep it tight, keep it tight. There's no sin in keeping standup tight, really. And I say this, like, it's full mea culpa maxima. I am the problem because I like to yap. MIKE: Well, we talked about this earlier, before you were able to join, but we talked about failure cases, and one of them we talked about was going too long. We came to the conclusion that a lot of this comes down to what happens outside the standup. And, Dave, I think, expressed this really well. Like, all the failure modes in a standup are because of something that didn't happen outside. And if you haven't done your good coordination beforehand, then you're going to have failures in there, and then you're going to have the long reporting, because nobody knows what's going on, and you have to get caught up. So, absolutely, short. I have run standups before. I've seen them sometimes work, and sometimes they don't, where you start with the blockers. So, instead of saying, "Here's what I was working on, here's what I'm going to work on, and here's what's blocking me," we start with the blockers, and if there's nothing else, you move on. You come in and say, "This is what's blocking me," and if there's nothing blocking you, you go on. Now, to Will's point, having some expression of what you're working on seems to be valuable. Just the act of speaking, you know, like the rubber ducky that you talk to on your desk to clarify your thoughts, probably does have some value on its own. So, there is something to be said for saying, "Well, this is what I'm working on," but that can go last. You can say, "These are the things that are blocking me. This is what I'm working on today. This is what I worked on yesterday. This is what I'm working on today." It changes the focus. So, you start with the most important stuff. "Here's the thing I need to coordinate on that I know I need to coordinate on. I don't want to waste your time. And then here's my take on where things are at." And maybe somebody's going to pick up on something from where you're at. "Oh, I need to talk about that." But you change the order, and I think that does help. WILL: I would want it in reverse order, because I know how these meetings tend to go. And, generally speaking, if you've got a bomb that's going off, if I've got...the kind of people on my team that I want on my team, everybody wants to defuse the bomb, and that's going to [inaudible 40:27] MIKE: [laughs] WILL: But, and here's the thing, right, like, there's people who...I am pro communication and pro efficiency, and so I want the green light projects. Get them off the list. Let me look at your face and spend a minute describing what you're doing, just because, you know, I want to make sure you're okay. And I want to make sure that everything is really okay, and you're not just sort of, like, you know, walking down this open elevator shaft unknowingly, right? So, I could just kind of pull you back by the collar of your shirt. That's fine. But you can get off the call, man. If everything is cool, like, I know what I'm doing; I know what I need; I don't have a blocker; I need to get back to work, then let's get these guys out of the call. And then if the world is melting down, everybody who isn't, like, actively with a bucket, you know, you could get off and, like, get back to the work, you know. Because sometimes you'll have a standup, and it'll roll right into a crisis planning meeting, and that is standard, par for the course. But everybody whose light is green, yeah, bail. Sorry. MIKE: Well, you can take, you know, if you have a conscientious host for your meeting, anytime one of those bombs does show up, you say, "Okay, we are going to talk about that." You create the meeting. You assign it to somewhere else. We are going to set that aside and make sure we finish. And then you get to the end, dismiss everybody who doesn't need to be there, and then you go on. I think that you have to be conscientious about that, or else you will have the failure case. WILL: Yeah. I just go with the flow, you know, my natural... I go with...I would prefer...Both require discipline, and I would prefer to just sort of, like, not do it the hard way on purpose, because people in meetings are naturally going to have a tendency to be, like, this is not relevant to my interest; I'm out, right? Like, you usually don't need to tell people [laughter], you know. KYLE: So, I had a QA manager, and it was...I think it was about the size of 12 people on the team. And I did like the way that he ran the meetings, just because it was one of those things where you would say what you accomplished or what you touched, and then what your blocker was and who you needed. And doing that, the actual standup portion generally took about 5 to 10 minutes. And then afterwards, it was allotted that you could go and communicate with who you needed to. I just thought it was an interesting way for him to manage it that way. And then if you didn't have a blocker and you didn't have anybody you needed to go talk to, you were done. You could go back to your desk and continue doing your work. And I liked that because, then a lot of the time, I could do exactly that. And I wasn't necessarily, you know, in there twiddling my thumbs, which is the most frustrating portion of a standup for me. And I just thought that aligned with kind of what Will was saying a bit, maybe not perfectly, but... MIKE: Well, that pivots a little bit. We talked about psychological safety some before. What does a good meeting host do to cultivate a good standup where it doesn't devolve into fisticuffs, but [laughs] rather [laughs], you know, that there may be heated conversations, but they're productive and not personal? WILL: Balance on all things. I mean, you know, do be clear about what's going on with you. Don't waste everybody's time. Do be accountable. Don't call people out. You know, it's you got to...I don't know, cams up, mics up, all the time. No muting, you muters. If your dog's barking, you know, if your kids are running around the room screaming, like, you know, as much as I can understand that you would want to suppress that, that is actually relevant to your team's performance. And your manager has both the right and the obligation to, you know, inquire as to how their distributed team is performing while you're on the clock at least – DAVE: Some days I'm really glad I don't work for you, Will [laughs]. THOMAS: Because I'm feeling very attacked right now, Will [laughs]. WILL: Yeah, no, no, no apologies. Most people who worked for me really, really liked it. But, you know, I'm also not exactly nice. DAVE: I'm sitting here going, that would be productive, so yeah. MIKE: Knowing that somebody has the dogs barking and all the kids running around once a month is relevant. You can say, "Oh, something's going on today [chuckles]," right? It's different than saying, "Wow, that's an environment that hasn't changed in a month [chuckles]." There may be some challenges there. There is some value, I think, in what you're saying to gathering that information and learning about what the baseline is and where there may need some assistance or changes. WILL: You know, this is a really wild tangent from running a standup, right? I am not a bad person, and so, as a result of this thing, right, where I have sort of, like, you know, I had these fairly rigid, dogmatic rules, like, I know things about my distributed team in other countries that I have not seen any of these people who are using distributed offshore resources. And they could not give a greasy hillbilly f**k [chuckles] about what's going on with any of these people in their actual lives. The level of dehumanization and, like, just no shit given that we treat distributed workers with in the IT industry is disgraceful. And so, like, yeah, man, like, I know that my buddy has a kid with terrible asthma in New Delhi. And they need to seek medical treatment for their daughter who I know, because when she's on the call, I bring her on the camera and I say, "Say hi to everybody." Or they have roving packs of street dogs that are roaming through their...up and down their block in the middle of the night in New Delhi because this person is on a call with me. And I want them to be sitting at the conference table as close to physically present as possible. And this is, yeah, okay, you know, this is, like, weird stuff that I do, right, and nobody else does. But I do that not with an eye to, like, you know, be an intrusive, you know, megalomaniacal d**k, though I am that. But this time it's about having a human interaction and a human connection. And I've seen how other people do it, and they're wrong. And it's a dystopian, screwed-up, dehumanizing thing, and everybody hates it. So, as much as I'm willing to take criticism for, like, you know, me being a little bit of a psycho, it comes from a good place. And while I will have to, like, you know, kind of be a little bit of a door kicker to make these things happen, because people hate it, the proof of the pudding is in the eating, and it works. I'm not wrong. MIKE: I heard about a dysfunctional team recently where they were all overseas. Most of the team members were overseas, and it turned out that the contract shop that was running them had explicitly told them that nobody should go on camera. Nobody was allowed to talk except for one person on the team. DAVE: Wow. MIKE: That's awful. And it was because, you know what happened? One person on the team spoke, and the communication was horrible for years on this team, because how could it possibly be good if you'd limited it that way? The contract shop that was doing this had...I don't know what they were thinking [chuckles], but it was a company policy. DAVE: Wow. MIKE: You think about what it means to be somebody on the team. And they didn't tell the company that they're working with, right? So, nobody knew, and nobody on the standup talked. They just thought, I just can't get a response out of these people. And so, it forced dehumanization, because you thought, wow, these people don't know what they're doing. They're not even willing to talk or show their faces. And there was no human connection made. They were just faceless. And I think they may have done it so it'd be easy to swap people out [chuckles]. But that's exactly what it is. It makes people just machines, like, fungible, irrelevant. WILL: Disposable. But that's not how the business works. MIKE: No, it's not. WILL: You can't do that. We are not bolting doors on Hondas. As much as every MBA from here to New Delhi would like it to be that way, unfortunately, no. I'm sorry. This is, like, a medieval guild, you know. That's just how it is, and I see no indications that any of that is going to change. So, like, okay, man. Like, just deal with us like human beings [laughs]. You know, I'm not wrong about this. I've tried it every which way. And if I'm being blunt, right, even the proposition that you would be able to attend a meeting, right, behind a one-way mirror [laughter], if you try to pull this off in actual physical reality, you would be a psychotic, you know. It's like, what's the one-way mirror? What's that one-way mirror for in the conference room? It's like, we don't talk about that. Really guys? Come on. MIKE: [laughs] DAVE: Every time I run a skills clinic, somebody will ask me, "Did you guys record it?" And I'm, like, "No, no, we didn't." And I'm not trying to be a jerk, but one of the reasons I don't record them is because I want you to attend. When people come to Skills Clinic, I get stuff out of it. If I didn't need to get anything out of Skills Clinic, I would just drop a one-hour video of me talking. I love to hear myself talk. I can you know,, film it and drop that into the thing once a week. But if you're just lurking all the time, then when you get bored and distracted, you're going to go play Minecraft. You're going to go surf Twitter. You're going to go doom scroll. That's the exact opposite of making human connection with the people that you're working with. Will, you were talking about kind of, like, the forced thing. Like, if this is the noise and the distraction that you hear, let's all experience it together. And I realize now, from a realistic human connection point, there's value in that. When I pair program with people, you're like, oh yeah, this is going to keep you from getting distracted and going on Twitter. No, it doesn't. It's just that when I'm pair programming with you, if I have an idea and it needs to be on Twitter, we both go tweet. And, like, I literally tab over, pull up Twitter. "This thing that my coworker just said is really funny," and send, and then we go back to work. And that is hugely valuable. For me, it's valuable because I didn't spend 40 minutes scrolling Twitter after I did the quick distraction. But for my partner, they got to experience, I won't tout this as virtuous, but they got the full David Brady experience. People say things that need to be on Twitter around me all the time. WILL: Well, I mean, like, you could do stuff. You could do stuff like, I don't know, I'm on a call with my offshore team, and it's really noisy. And it's like, "Hey man, what's going on?" It's, like, oh, it's this giant religious festival. They're having a giant fireworks display in the park outside. And then we can all just take a minute, after we finish our work, you could take your laptop up to your roof, and we can just sort of look at this huge Indian religious festival that is happening literally outside your door. And we can be on a team and have a human connection. And that is possible. DAVE: It's the opposite of balkanization. WILL: Well, and, like, if we are on a call and somebody is on their phone, right, the whole time or they're very busily typing in some other screen the whole time, and their cam's up, and their mic's up, you can see it. As somebody who is ultimately responsible for the health and well-being and care and feeding of this entire team, you have an avenue and an opportunity to check in and be like, "Hey, what's going on, man?" Because, is somebody depressed? Is somebody pissed off? Is somebody having, like, some kind of a moment? Which is absolutely going to fall at your doorstep. It's coming for you. DAVE: It's relevant, yeah. WILL: People don't work good clinically depressed. You're going to feel that, and you have an opportunity for leadership, as opposed to just being a manager, that you wouldn't get, or it would be harder to ascertain, if you were just sort of, like, a W on a Zoom call. DAVE: I have t-shirts from the best teams I've ever been on, where, ironically, it was the team that I was flying out to Ohio with to be on site with. We would go do an escape room and get a group photo. And then my team lead would print t-shirts for us to take back, you know, from this. And I cherish those because I had a lot of really good friends. Whenever there was a reorganization that divvied up our teams, it would break our hearts because we were best friends being divided back up from each other. And that's awesome, to actually care about the people that you work with so much that when you get reorged away from them that it's a tragedy. And, hopefully, both halves of that team, you know, it works like sourdough starter. You separate the two teams, and that culture then permeates both lumps of the new teams. I love how everything, when you get really into the meat of, like, agile and about, like, good methodology, it ends up being about people when you're done before it's over. We started out with, like, what's wrong with your standup? And Mike, you put your finger on it really well. The dehumanization, that's what's wrong with your standup. MIKE: Honestly, that's probably a great place to tie this up. WILL: Yeah [laughs]. Cut. There you go. MIKE: Exactly. DAVE: Mic drop. Yeah. MIKE: We're human beings. This is a chance to connect. Use it for that. DAVE: Yeah. Fantastic. MIKE: With that, let's end. Let's be humans. Until next time on the Acima Development Podcast.

13 de may de 202656 min

Episode 97: Database Indexes

The episode of the Acima Development Podcast centers on database performance, using the concept of indexing as its foundation. Mike opens with a story about discovering Google in the early 2000s to illustrate how powerful indexing systems transformed access to information. That same principle applies to databases: indexes act as shortcuts that make retrieving data dramatically faster, especially in large datasets. The discussion emphasizes that while indexes can feel like a technical detail, they are fundamental to how modern systems function efficiently, much like search engines reshaped how people find information. Bill Coulam then dives into the technical side, explaining that indexes improve read performance but come with trade-offs, particularly slower writes because both the table and index must be updated. A key rule of thumb is that indexes are most beneficial when queries return a small subset of data, typically under about 25% of rows. The group explores how poor indexing strategies, like over-indexing or missing indexes on key relationships, can quietly degrade performance over time. Bill shares a striking real-world example where adding missing indexes reduced a process from taking 24 hours per record to processing millions in just a couple of hours, highlighting how impactful proper indexing can be. The conversation broadens into database design philosophy and performance tuning. The team discusses different index types in PostgreSQL, when to use them, and how to balance read vs. write performance depending on use cases like bulk inserts or high-frequency queries. They also touch on when relational databases fall short, such as full-text search or massive write-heavy workloads, where NoSQL or specialized systems may be better suited. Ultimately, the takeaway is that effective database performance comes from understanding your data, access patterns, and trade-offs, combined with ongoing maintenance and thoughtful design rather than relying on defaults or assumptions. Transcript: MIKE: Hello, and welcome to another episode of the Acima Development Podcast. I'm Mike, and I'm hosting again today. I'm going to start by introducing Bill Coulam, who's with us today. He comes to us from the data team. And he's been here before, but we're going to focus on some information that he has to share. So, he's kind of the star of the show today. Also with us, returning, we've got Eddy, Travis, Justin, Dave. Mr. Perez, great having you with us. We've got Mike Perez here with us, and Ramses. As usual, I'd like to start with something a little bit outside of our topic in order to bring it in and tie it into the outside world. And I was thinking about a story I think I've shared before. The importance of this moment early in my career keeps, like, growing as I look back to it, like, wow, that was a big deal, and I didn't realize it at the time. So, in the early 2000s, somewhere in the early 2000s, early, early 2000s, I was working for a guy [chuckles]; I'm going to say that. He had some projects, and he didn't have enough resources to do some freelance projects, and so I was doing some of his stuff. He was outsourcing his freelance work to me [laughs]. And he had a project that was in Windows, and there was something they wanted to accomplish through the API. And I started looking through the documentation, trying to use Microsoft's tools to search the documentation, and I spent hours. I looked everywhere I could [chuckles], and I couldn't find it. I came to the conclusion, maybe this doesn't even exist. And I came back to him, and he got back to me, like, 30 minutes later. He said, "You know, there's this new tool called Google, and I use that, and it's amazing. You should start using it because it works really well, and it led me to this documentation." Like, wow, well, I know what I'm going to use now. I'm going to use this Google thing [laughs] because that works way better than actually going through the table of contents, and the index, and the documentation, because that's really hard to search through. Those older forms of indexing were insufficient. Now, Google had this brilliant idea, you know, the founders of Google, that, okay, we'll index the internet. And even back then, that was, like, an impossible goal [chuckles]. And there were other sites that were doing it. There were indexes out there. What they would do is they'd look at the words on a website, and they would create an index based on those. And so, if you look for a word, they'd look for a website that had a lot of those words. Well, people really quickly figured out how to game that [chuckles], and, of course, they did. So, they were useless almost immediately because people would go into their meta tags, and they'd just write the same word a hundred times for something that the site was really not very applicable for. What Google did is they came up with a different sort of index, where they would index words in the links that linked back to a site, and also give extra weight if there were a lot of them, right? And so, by building a more appropriate index that suggested popularity, rather than self-determined, a self-stated importance of the page for a specific topic, they were able to come up with something way more effective. And you don't always think about indexes, you think, index? Like, I remember going to the library. It had, like, the Dewey Decimal System, which is really kind of weird and awkward and hard to find things with, but it was way better than the alternative, which would have been nothing. You don't usually think about indexes changing the world, but that index, that PageRank index, you know, the PageRank algorithm that they use to just create an index, that's all it is, right? Link this word, map this word to a website, so that when you're searching this word or phrase, then we can find it. It literally, like, fundamentally changed culture. It's now a verb [laughs]. Like, you Google something, even if you're using Bing for those of you out there who use Bing [laughs]... DAVE: Use Bing to Google, yeah. MIKE: Exactly. You use Bing to Google, because information now is accessible, and that is something that didn't exist before that. For all the digital natives who've grown up in this world, like, how did you find things before? Well, you didn't [laughs]. You suffered. You wandered through libraries. DAVE: We just got used to not knowing things, yep. MIKE: Exactly. That's exactly what you did. You got used to not knowing things. It changes everything when you have an effective index. And I could talk about all the times in my career when something's missing from the database, and yeah, it was the index. It's always the index. There's always a missing index somewhere. It solves all of your performance problems. And there probably is an exception, but I can't think of it [laughs]. It's always the index. That's what we're going to talk about today. We're going to talk about database performance. And we've been wanting to, you know, Bill's been preparing this and thinking about this for a while. If we're talking about database performance, indexes are going to come up over and over again. And this could seem really dry, and this is going to be a technical deep dive, right, we're going to very much going to talk about indexes. We're probably going to be focusing on PostgreSQL. But this idea of indexes is not a trivial one. It's how we operate in the modern world. Our culture, our commerce has been fundamentally transformed. Our ability to know things and outsource, you know, to this Library of Alexandria that we've got in our pockets all depends on indexes, and it's amazing. There's my introduction, Bill. And I wanted to lead out with some weight behind what you're going to be talking about today. BILL: I love it. That was a fantastic segue. All right. Hi, everyone. I am Bill, Bill Coulam. I've been doing this work for about 30 years now. I started as a software engineer using COBOL and mainframes, but I don't put that on my resume because I don't want anyone to ever call me back to help with that. So, I tell people I started with C and C++. I was actually one of the first users of Java back in 1995. My company that I worked for at the time, Anderson Consulting, they wanted me to go around to their clients and tell them what I thought of Java. And, at the time, I felt like it really wasn't ready for primetime, and so I kind of voted myself out of working on that platform. But that's okay because I ended up, on every project that I worked on, working with Oracle, and, at the time, Oracle was the 800-pound gorilla. And I was in the telecom industry, where we had some of the largest volumes of data in the world, and so I learned a lot of great lessons working on those big systems. It's a whole other world jumping between databases that have 10,000 to a hundred thousand rows to databases that have 500 million, a billion. Performance tests in your copy of production can take three hours. It's a completely different world. Anyway, so you learn a lot of good lessons working on data that big. I ended up sticking with Oracle for a long time. It became my bread and butter. And went from San Francisco to Denver to Houston, and then back here to Utah where I grew up. I've been here longer than I spent time in my own hometown. So, I've been here in the northern central part of Utah since 2007. Anyway, let's go ahead and jump into it. We're going to be talking about four areas: the fundamentals of indexing, some guiding principles, the two shared tendrils, index types that are available to us using Postgres as our source database, and some indexing dos and don'ts. Firstly, some fundamentals. An index is a shortcut to get at the data. However, because an index is a separate structure from the actual table containing the data, it requires at least two I/Os to get at the data: one to search through the index, then one to access the rows in the table. Because of this, indexing can and usually does save time when querying large tables, but it can take longer than a full table scan if the number of matching rows is greater than around 25%. That is a rule of thumb, not a hard rule. I did a bunch of testing back in 2024 on our setup here, and it was right around 25%. So, if the number of rows you anticipate matching your query being less than 25%, an index will typically make sense. Ultimately, an index is stored in a file. And updates of index columns, keep in mind, must modify and manipulate the table and the index. That's important when you start thinking about how many indexes your table has and the effect that that will have on write time. And, lastly, matching index and table results will get cached in case the same request is made later. MIKE: So, I've got a couple of questions about that. Firstly, how often do you see in...and this depends on systems, right, so maybe there is no universal answer. But how often do you see indexes harm performance? Because there's this index that we probably didn't need, but now we have to write to it every time, or somebody went in and indexed 20 columns, right? There are certainly bad use cases. Have you seen cases where there was a clear performance hit, and, you know, seeing data to show that? Is there some sort of rule of thumb where I should think, ah, well, actually maybe the database is a bad idea here? I'm also curious about those caching results. Do you sometimes get...in data sets that are growing really fast or something, do you end up with weird results from that caching? BILL: Let me answer the second one first. The answer is no. Phil Karlton of Netscape, may he rest in peace, he was famed for saying something like, "There are only two hard things in computer science: naming things and cache invalidation." And there was some wisecrack that added to that, where it said, "There are only two hard things in computer science: naming things, cache invalidation, off-one-by errors." But yeah, cache invalidation is tricky, but the database engine teams tend to have done that very well. So, I've never had funky results from mainstream relational database engines, so that tends to work pretty well. The answer to your first question, the quick answer to that question is no. I have not seen indexes cause immediate harm. Like, the old analogy of, you know, the frog in the pot of water that eventually gets too hot and cooks it, adding even crazy indexes, indexes with lots of multiple columns in them, and so forth, I've never seen an immediate and obvious degradation. It's even been hard to detect it when that pot is fully boiling. When a table has 30 indexes on it, and inserts are taking two milliseconds per row, generally, you don't notice it. Over time, as these indexes are added, the team that works with that data tends to believe this is the way things are. DAVE: Oh, it does that. BILL: And they don't really question: could this be three times faster if we got rid of all the unused indexes? So, yeah, to answer your question, I've never seen it immediately [inaudible 11:39] performance. MIKE: [crosstalk 11:39] three times faster. That's probably loosely data-driven, right? BILL: Yeah, that's very loose. MIKE: But you didn't say a thousand times faster. But there are very much cases where if you're missing an index, it could be a thousand or a million times faster [laughs]. BILL: Yes, mm-hmm. And -- DAVE: I actually have seen a case of this, but I think I'm actually in agreement with Bill, where the definition of a database, right, developers we always talk about toy databases, right? But the database...and you don't think of this as a database, but the system log on Linux, it's a log file, and we think of it as a log file. But you can also think of it as an unindexed table where you want a right row right...Insertion has to be very, very fast, and you can't spend any time indexing. Well, if you have to write fast and you're saturating the hard drive...also, this is back in the days when a hard drive seq was 12 milliseconds, and so updating the index and the file was very painful, right? If you take that blurry view of, like, is a log file a database? It's easier to think of that when you start realizing that, well, everyone's now streaming their log files up to Splunk and Datadog, and these things that are, like, pulling their log files together. And time series databases like Grafana now exist where you're supposed to log, log, log, log, log, log, and then, over time, they start compressing the old stuff. Like, they start batching it up historically, and you start losing data. It's kind of like compacting context for an AI. So, like, 100%, I agree that, like, if you're talking to a real database, you've usually got a lot of structure, and everything's, like, really, really solid. But I have, back from the battle days when I was doing a lot of MySQL, we would have to sit down and go, is this table fast right, and we don't care about search? Or do we need fast search on this? And if so, can we pay the cost to index it? MATT: I think it's specialized, right? DAVE: Mm-hmm. Mm-hmm. MATT: There's certainly cases where you will see this. If you have one insert and that insert is inserting four and a half million rows, and I see this, it's a problem. But I think it's a more specialized case and more one-off. But if you have 30 indexes on columns in a table that you're inserting 4 million rows at a time, you definitely see performance degradation for sure. EDDY: So, that's kind of interesting, right? Because I think, historically, what we've done is we try to remove unused indexes, right, like when they become unnecessary. I think the rule of thumb is clean up to avoid degradation, right? But then it's kind of interesting because I think Bill's response to that is, I haven't seen that in practice slow down any inserts or writes, right? So, I'm curious, like, is it just, like, the thought process is clean up after yourself always, regardless of whether that slows down degradation? Or is it... MIKE: So, I asked the question because I think that there's so much power in indexes and, you know, it seems like that cost to write isn't that high, but I think there are other costs. Even if it was negligible, even if there was no impact at all, I think if you want to understand what's going on and you've got 30 indexes, you're in trouble. Like, I think that that cleanup matters just for, you know, we don't write in machine code because code is written for humans, right, and then compiled for the machine to understand. And I think the same thing applies here. There's a human aspect to this that unless you actually needed those, I think that you're doing active harm to the users of that, you know, to understanding or even trying to fix if there's a problem, if you've got a bunch of junk there that you don't understand. I think that regardless of whether it doesn't have much impact, I think that there is still, like, a reason to keep it clean. Well, that's my thought. What are your thoughts, Bill? BILL: You really need to know your data and know your anticipated access patterns. Matt was talking about a scenario where you want to insert 4 million rows in the telecom industry, or sometimes you'd need to bulk load 300 million rows. That's a different access pattern than inserting a single row. And you need to approach things differently. In some of those bulk load scenarios, it was much faster for us to drop all the existing indexes, do the load, then re-add the indexes than it would've been to leave the indexes in place and let the database engine do all the maintenance. So, everything is an it depends answer, right? You need to think through those things. But the second thing that I want to talk about, which is highly related to the indexing fundamentals I just went over...And these aren't, you know, industry standards. These are just, according to me, some guiding principles of index design. The first one we've kind of already talked about...actually, both the first and the second. That is that an index will make data access faster, but an index comes with a cost to write times. So, just like Goldilocks and the Three Bears [chuckles], you know, you can't have too many or too little. It needs to be just right. In order to make it just right, you really need to understand your application, the business requirements of that application, the data that you're working with, the quality of the content of it. You need to understand the access patterns that are anticipated on that data, queries that you already know about, or anticipate. And by having all of that context, you can design a much better database schema and indexing strategy. DAVE: One of the things that kind of blew my mind, like, 5, 10 years ago as I was getting into, like, NoSQL, and schemaless databases, and document databases, and that sort of thing, was somebody pointed out to me that SQL is actually a terrible language for reporting. It's not built for reporting. SQL is an ad hoc query language. It's how you get at your data when you have no plan to get at your data. And all the cool stuff, all the bells and whistles, like the query planner and indexes, are basically trying to get around the fact that you weren't prepared to do this. And if you do know what you want and you've got that report well defined, you can make it so, so slick, whether it's indexing in advance, or materializing views, or shoveling everything over to the data team and letting them stick it in gigantic vertical tables. EDDY: So, I think I've always just gone hand in hand with saying, "Oh, you want reports? You want SQL because that is the most efficient language used in any sort of database," right? Are you suggesting, or do I understand that correctly, that you're saying that that isn't its original intention? Because you're blowing my mind if that's the case. BILL: If you want something that's super efficient, it's NoSQL, because a collection is built to match an anticipated access pattern; it will blow relational out of the water. I love that David brought that up because that was the next line of my presentation. DAVE: Yes! BILL: Is that one of the greatest strengths of relational databases is its ability to handle ad hoc queries. So, you try and totally understand your system and try and anticipate the queries that will be required of it. And some you will know right off the bat because they're right there in the requirements documents, but there are still plenty of ways in which that data may be used in the future. And you just use your experience and your gut instinct to say, "Okay, we're probably going to need an index for these three columns here, because of what they're named, and what I expect, you know, the end users to want. This JSON data column, it's unlikely they're going to be indexing off of that, you know," so you make your best guesses. But yeah, a relational database is really good at ad hoc queries. And if you have done your very best to index intelligently, then it will be able to handle most of those ad hoc queries well. And the ones that don't will be generally immediately apparent, unless you're on a tiny system, you know, trivial system. And then you can redesign things; maybe toss an older index and redesign a new one that's composite or partial or, you know, fancy in some way that matches your needs. MIKE: I've got an anecdote related to this as well when you talk about NoSQL. BILL: Yeah [inaudible 19:27] MIKE: Some years ago in my career, I was working at a place that did content management, largely for newspapers. And you think about what an online newspaper page is; it is effectively searching the latest content. You don't usually think about that. Like, that's search? Well, yeah, it is. You want to just get the latest things. It's a feed, right, and then the oldest things drop off. That's more obvious to get something like the social media, where literally is this feed where it comes in from the top. The newspapers, even old paper newspapers, you have the latest content, and the most important stuff comes to the top, and other stuff flows down to page eight. And we organized our presentation that way. It really was doing searches. And it got slower because a lot of that relied on text. You know, you're looking for this kind of text, well, this is the weather, right? We want to look at the weather stuff. And to make our system efficient, we had to get out of relational databases. And we used a full-text index; we were using Solr at the time. It's similar to Elasticsearch or OpenSearch, all these descendants of the old Java Lucene library that allow you to efficiently build an index into your data. But it also is effectively a NoSQL database because it's searching the data. In fact, you can even cache the data in that index, and so you never even hit your database. And we would do that sometimes, where we'd never even hit our relational database at all. We just used that to store the data before we indexed it, and then it came out of that other system. And we could not run. We absolutely could not run off our relational database because it was way too slow; it was unworkable. We had to use, you know, that NoSQL database in order to work. BILL: Yep. And there was a telecom company I worked for in 2021 where they needed wickedly fast writes. And so, there we went with Cassandra. We weren't really worried about indexing or reading. Now that I'm clear that this is not a presentation, I'm going to possibly just fly through the middle portion of it, which is where I train the listeners on the different sorts of indexes available to us in Postgres. Maybe I'll just mention them briefly and then get to the dos and don'ts at the very end. In Postgres, we have a number of basic indexing types available to us, many of which are covered in your basic computer science courses, like Binary Tree and hashes. We also have some index types called GIN, which stands for Generalized Inverted Index, and GiST, which is a Generalized Search Tree. The GIN indexes can support many different user-defined indexing strategies out of the box. We typically use them when indexing columns that are arrays on a Postgres table. But it is also used to aid in the implementation of full-text search on Postgres. And it comes built in with various operators that let you do things like nearness, and contains, and stemming, and some other things that a typical B-tree doesn't allow you to do. GiST Index is fairly similar in that it allows you to build your own indexing strategies. It supports nearest neighbor searches, geography, spatial, and other very specialized types of indexing. This is the index most typically used for features within an application that have mapping features, allowing you to see how far away you are from the pizza origination point and things like that. MIKE: Well, you know, that's interesting. And even I think, well, that's a very specialized case, but the most common query in our biggest application is one of those geographic queries, in order to find your [crosstalk 23:26] BILL: That's in merchant portal, right? MIKE: Exactly. BILL: We're using that there. Then there's a really specialized version, which I have never used, called Space-Partitioned GiST, SP-GiST. The official documentation says it's non-overlapping. It lets you build your own indexing strategies, just like GIN and GiST. It's very flexible. It permits implementation of a wide range of different non-balanced disc-based data structures, such as quad trees, k-d trees, and radix trees. If you guys know what those are, that's awesome. I did not go through computer science, so I'm not even sure when those would be helpful. But, apparently, it is also used in geolocation-type applications. The other sorts are a little bit more used. I've only used BRIN once. BRIN stands for Block Range Indexes. These are best for columns whose values correspond with their physical order in the rows of the table, so think of, like, now-serving number at a queue or a kiosk. As that number is monotonically increasing and going to some database column, that number is very close to the value of the row that preceded it. That is a perfect column for a BRIN index. And the reason you would want to use it is it uses far less space than a typical B-tree because it uses ranges instead of individual values. Uh, let's skip over that. MIKE: Do those ever get used for, like, primary keys or anything like that, for efficiency, or not really? BILL: No [chuckles]. Most every system I've ever worked on defaults to B-tree for a numerically based primary key or surrogate key. But, in theory, it could be used for one of those. Yeah, I've only ever used it once, and, typically, the B-tree is fast enough. I don't need to eke out another hundredth of a millisecond by using a BRIN. So, typically, the default is used. MIKE: Interesting. Maybe if you had some massive data set of sequential data or something. BILL: Yeah, because this happened to me in Telecom, where we were trying to eke out every millisecond we could find. This might have been one of the strategies we tried. DAVE: We had a really fun...out here in Utah, we used to have Mountain West RubyConf for about 15 years, and it was a fantastic conference that got put on. And James Golick came out. He was the CTO of FetLife. Don't Google that. It's an adult website, social media for naughty stuff. And because it's naughty stuff, it was very, very popular, and he was running, like, a terabyte of notifications through their database, like, every single day. And this was in, like, 2009, so, like, a terabyte was a lot back then. So, imagine somebody shoveling around a petabyte or, you know, half an exabyte trying to get that through. And they were using a popular document database; it's not my fight to have, so I won't say which one. They bogged down so hard. They kept backing up and backing up. They bogged down so hard that he had to physically pull the cord on the server. Like, he couldn't shell into it to stop the server. He couldn't, like, I don't know if he had a bash prompt. He couldn't get the keyboard to respond. Could not get ACPI power button, that's when you hold down the power button on the front of the case, could not get that to respond. The document database was just spooling everything; it had just backed up and backed up and spooled out. He ended up writing Friendly ORM, which is based on FriendFeed. And if you want to know how a document database works, go tear Friendly ORM apart, if you like Rails, because it's built on SQL. It runs on MySQL or runs on Postgres anything. And your data, your documents go in a table that has an ID and a blob column. And all of your indexes are tables, and every table has an index, a record ID, and whatever data you want to go look up, and it's got an index on it. And he just handled that in the ORM. And when you talk about writing something in anger, he ripped out that document data store that same day. Like, it was on a Thursday or a Friday, and on Monday, they were running on Friendly ORM in MySQL. It was insanely angry. So, yeah, if you want to know how NoSQL works, like, under the hood, it's fantastic because you get into it, and you go, wait, is this all there is to it? And yeah, that's all there is to it. All the stuff about, like, crawling over a database and indexing it and then searching back through, like, the problems of searching a document database when you don't have an index, it's very obvious because there's only, like, four moving parts. It's really, really cool. BILL: Fabulous anecdote. I saved the B-tree index type for last because it's the most common; it's covered in computer science courses. But I just wanted to cover just a couple of nuances in Postgres, a couple of which I had to learn the hard way a year or two into using Postgres. One of which is that the B-tree is really good for less than, greater than, less than or equal to, greater than or equal to, and equals. It can also support some other equality and range comparisons, like the LIKE operator, BETWEEN, IN, IS NULL, and IS NOT NULL. But there are a couple of operators it doesn't support out of the box, so one of it is the LIKE operator. If you do a LIKE comparison and you feed it a pattern that starts with a wildcard, it can't use that. It nullifies the use of the index for that comparison and will do a full scan on the table. In order to do that, in order for the database to be able to index the first few characters of a word, you would need to use the GIN index with the trigram ops. I think it's called an operator. Anyway, each of these index types has the basic default syntax for creating an index of that type, and then it has a whole bunch of optional things. If you want to really know your stuff, get into the Postgres documentation and look at those options sometime. That's where you see some of the richer things, like, for the B-tree, it has some operator classes called text_pattern_ops, varchar_pattern_ops, and bpchar_pattern_ops that I didn't even know existed until about three years ago. I won't go into those right now. But just know that there's a range of flavors of these indexes that you can activate by knowing what those options are and knowing when they'd be useful. So, with B-tree, there are a variety of flavors of the B-tree index. There's the one that we use the most often, which is a single-column default B-tree. I won't talk more about that. The second flavor is a multi-column one. This can be used for indexes, sometimes referred to as keys, which are composed of 2 to 32 columns. You're limited to 32. I've honestly never seen any with more than 5. This sort of multi-column index is used for queries where two or more columns are always or frequently used together in the WHERE clause. During index creation, you know, you say CREATE INDEX. You give it a name ON table_name, and then in parentheses, you list the columns that you want indexed. You list those columns in the order of selectivity. So, if you had, for example, a table of people, or employees, or citizens, which would you put first: social security number or eye color? KYLE: Low cardinality first. BILL: Yeah, yeah. So, the thing that would return the least amount of matches first would be social security number, which is unique. So, yeah, higher selectivity goes first; lesser selectivity goes towards the right. MIKE: That's an interesting one because I think that those multi-column indexes don't get used as much as they could. A lot of the big, gnarly, slow-running queries do query against several, you know, they query against a number of things. How much benefit do you get from using a multi-column index rather than having several columns indexed independently? BILL: A lot. The trick is knowing when you should have it. If you look at some of our queries on our tables and you run EXPLAIN ANALYZE on them, and you see in the query plan that it's going to be doing a lot of bitmap ANDs operations, bitmap ANDs are combining single-column indexes together in order to arrive at the answer quicker. If it's doing a bunch of bitmap ANDs and it's doing that over and over again, it's possible that you have a very common query that should have those two or three columns put together in a multi-column index. But if that same query has, you know, 50 flavors of queries that are being thrown at it, you wouldn't want 50 multi-column indexes to match each of those queries. So, it's that balance we were talking about at the start. You have to know which of those queries are the most important, which ones are being hit a million times a day, and which ones are being hit four times a month, and plan accordingly. And that's something...one of the 15 projects I'd love to do here is optimize that. MIKE: So, you're going looking through your slow queries, you know, using whatever tool you're using. It sounds like you'd, you know, have that in your toolkit at the ready if you see a number of...if you see queries that are, like, oh wow, that's checking against four columns in this table, you should probably have an index on those if it's doing those bitmap AND, or bitwise AND that you're talking. BILL: Yeah. And if that query is being hit many times per day, it's a good use case for them. MIKE: You know, most of the queries that tend to run really slow are doing joins, so I'm going a bit far afield here. So, what if the data's across six different tables, but you're running it all the time? That's a slightly different case. Do you have an approach for that specific situation? BILL: Well, you first try to optimize that query. By the way, I have a few cardinal rules about query performance. And the first rule is asking whether or not this query should even be done. You would not believe how many times where something was really, truly awful, and we asked that question: do we even need this feature, or should we even be issuing this query? And how often the answer was no. The second cardinal rule of the query performance is, if it can be done in SQL, do, instead of, you know, dragging the data out of the database and trying to replicate a database in, you know, in the middle tier. And the third cardinal rule of performance tuning has to do with the indexing that we're talking about. If your data is well-designed...well, it's making sure that the application data model has been well designed. Usually, when I had a really terrible performance problem, it was because the data model was not good. So, I covered two of the things that most commonly fix massive performance issues, and that was something that doesn't need to be done at all, and the business requirements weren't well understood. Once those things have been accounted for and your data model's good and clean, and you've made sure that everything's indexed well so the joins can be efficient, well, you've got this 6, 8, 14-table join. You've done everything you could, but it's still not fast enough. That's when you start exploring denormalization. And that typically leads a relational database person to materialized view. In Oracle, that was really beautiful because it had a built-in facility to keep that materialized view refreshed upon commit. Postgres is just getting to that now with an extension called pg_ivm, Incremental View something or other. I think it's coming standard with 17 or 18. But, anyway, that's when you've done everything you could and dotted all your i's and crossed all your t's, and it's still not fast enough; that's when you need to look into materialized views. And if that doesn't work, then you're probably on the wrong database engine for your use case, for your application. MIKE: That makes sense. WILL: Generally, it goes back to, like, sort of, like, database performance, like, in general. I'm not a database engineer. I know, like, an index and a join and, like, how all this stuff works, like, under the hood. But, like, I suppose, like, the biggest query that I've got from, like, a database, like, somebody who makes databases their trade is, like, if I'm looking at a database performance dashboard, like, what am I looking at to sort of, like, diagnose performance issues? Like, how are you looking at...when you look at, like, a database and, like, how it's running, right? I know if I have a server and it's like, oh, it's using too much memory, okay, there's a problem. My queue depths are starting to, like, get really big, okay, that's a problem, right? But, like, when you are looking at, like, sort of, like, the dashboard of a database, like, what are you looking for to say, like, oh, okay, this is a problem; this isn't a problem, you know what I mean? I'm just curious, like, how do you sleuth out these performance issues? BILL: Yeah, it's not too bad. A mature, well-instrumented database engine usually comes with some facility that allows you to peer into the resources being consumed by all the queries in the system, and it'll show you front and center what the hotspot is. I mean, if the database is really hurting, it's usually pretty obvious. Sometimes when it wasn't obvious, it was due to the network and something else. But yeah, usually when you peer into a dashboard, there's a big, old bar, a big, old spike, a flame, that shows you exactly where most of the runtime is being consumed. And you're able to click into that, and it will usually tell you which query it is. Now, there, a lot of the dashboards kind of let you down in that they only give you a piece of that SQL. And very often, you need to see the entire SQL in order to figure out what the culprit is. Once you have the entire SQL, then you're able to run it either through EXPLAIN, which gives you an estimate of what the database would do, or, if you are able, run an EXPLAIN ANALYZE, which will show you exactly what the database is doing when it's pursuing the data. And that is where it's really critical to know both the database engines indexing and your data, in order to determine whether the query path that the planner is showing you in the explain plan whether that's the plan it should be using. So, you look at all these steps, and you need to know how to read it. Okay, it's doing this one first, then this one, then this one. And if you know your data and you know what it should have been starting with and what it should have been doing next, and you look at that plan and it's not doing that, then you know you have an issue. You know you're missing statistics, or you're missing an index. Or some table got accidentally blown out with 5 million rows the other day. It was a bug. And they got rid of those 5 million rows, but they forgot to reduce the high watermark. But the database still thinks it's a massive table, and so it's making the wrong join choice. That's where the expertise comes in. That's why you get paid the big bucks, is being able to combine all those things and figure out, yeah, the database is not doing the right thing here, and here's what it should be doing. And how do we get it to do that? WILL: How can you tell, like, differentiate between just a hot query that's just a hot query? Like, a lot of people want the homepage, let's say, you know, like a [inaudible 38:49] example, right? How do you say, like, oh, this is just, like, everybody wants the homepage, versus, oh, the homepage, you know, is misconfigured, right? Like, how do you tell the difference? BILL: The vast majority of the systems that I've built have been well normalized, and designed, and indexed, and so forth. So, when we had an issue, it was because something changed, and it was more reactive. Someone noticed an issue, they called us. We looked, oh yeah, yeah, like that scenario I just described, where a table got blown completely out of proportion and shrunk the next day, and it changed the nature of the query path. Ideally, you would have a more heuristic system that learns from what is typically running on that database so that when something's out of the ordinary, it alerts you ahead of time. I've never lived in such a world; that would be lovely. I have not seen it. They probably exist. WILL: Oh, don't worry, don't worry. If the database starts going south, we'll call you. [laughter] BILL: I might be conflating this with my previous client, but there's a tool called...there are several, but one that I've used most recently was called SolarWinds. I don't know if any of you...Kyle if... KYLE: Yeah, that's the one we use here. BILL: Okay. And I haven't been using that, or I haven't had a need to use that heavily here. But I believe it has some facilities like that to tell you the difference between one that is frequently hot and heavily used, versus one that's not been seen before and is consuming all the resources [inaudible 40:19] MIKE: You know, Kyle, I've been meaning to ask you...because we're talking about the monitoring, because you get asked those questions. People come and say, "Hey, DevOps team. Everything's on fire. What do I do?" And you're like, "I don't know your system. I'll pull up a dashboard," and you usually manage to find something [chuckles]. Like, what's your tactic, Kyle, for finding database problems? KYLE: Database problems, I usually look for high I/O, disc depth, CPU, memory, and then connections. I'd say spiking connections would tell me quite often that there is a problem. And then that queue depth, if that queue depth gets very large, we know we've got a gnarly query in there somewhere. And then, at that point, that's going to trigger me to go look at a tool like New Relic, or, you know, something that can do the APM analysis from the service side and tell me, like, what that query might be. And then, from there, generally, we're able to say, oh, we're missing an index here. You guys should go add this index, and that'll increase performance again. BILL: It's when Kyle and DevOps reach that point that they usually involve me. So, that's why I wasn't able to answer your question [laughs] terribly well, because I'm usually getting skipped until that point. WILL: So, if you had, like, a lock or something that was deadlocking on a database, or, like, a, you know what I mean, like, some kind of table lock, like, how would that manifest itself? KYLE: So, that'll show in your performance insights tool. I did skip over that. That's another one that we commonly look for. We go in, and we see if there's a query that's got a lock on it. MIKE: [inaudible 42:01] WILL: How does that manifest, like, a bad lock where you're stuck, versus like a good lock, where it's just, like, business as usual? You got to lock a table; that'll happen. KYLE: Yeah. Most of the time, I throw that back on the engineers. But if it's been locked for, you know, I've got a query that'll look for any locks that are over five minutes. And if it shows up in that query, I think we've got an issue. MIKE: Makes sense, long-running locks. Good lock is a short lock, yeah. BILL: There are a few preventative parameters that we could be using in Postgres that we're not, that can prevent idle transactions from hanging around too long, statements that take too long, and can log and notify when some of these things happen. It's one of the things I'm going to be talking to the engineering managers about in the near future. Just to finish off the theme of the B-tree indexes, there are three other flavors. One of them is covering. It's kind of an interesting name. I prefer to call them payload indexes, but Postgres calls them covering. And that is where you index a column or columns that you want to match on or to quickly narrow down your matching data. But you also include a couple or more columns that are part of the select list. You're not necessarily matching on them, but they're part of the data that you're looking for. And by doing that, you can potentially get what is called index-only. You can get index-only queries, where they don't even have to touch the table. They're able to satisfy everything that the query wanted in its WHERE clause and everything the query wanted in its SELECT clause, just from the index columns and the payload in the INCLUDE portion of the index. So, those are called covering indexes. Another flavor of B-trees are called partial indexes or conditional indexes, and that is where you get to use a WHERE clause in your index creation. And that is where you only index a row if the row matches a certain condition that you have. And that can be valuable when you have a 700 million row table and only 5 million of them match a certain criteria, and those are the only rows of interest to you anyway. So, you'd only index those 5 million rows that match that criteria, that way, you're not indexing 700 million rows, and 695 million of them are a waste. Finally, we have function-based B-tree indexes, and these are used where you know that your access pattern needs to compare the column where the column has been manipulated by a function. Like, the more commonly used example is where you want to compare a given index search term that was obtained from a field in a web app or a mobile app, looking for a matching email, and you don't want to deal with all sorts of possible email variance that the customer might have fat-fingered into the database. And so, you want to normalize the data. Ideally, you'd normalize it before it gets written, but let's say you didn't. And so, you want to wrap the email column with a lower function. Well, now you've just excluded yourself from using the index on the email table or the email address column, I should say, because you wrapped it in a function. But you can index the application of the lower function on the email address column, and that's called a function-based B-tree index. And there's all sorts of functions. Eddy and I were exploring the use of full text search in merchant portal, and that requires a call to the to_tsvector function. And to make that quick, you would want to create an index on the to_tsvector of the textual columns that you're full-text searching or allowing a full-text search upon. There were a couple of things I wanted to cover, some dos and don'ts, some gotchas about indexing. Again, I mentioned this in 2024, but I want to mention again to anyone who's tuning in to the podcast. The first one is that you should index each key. Now, you don't have to worry about primary keys or unique keys. If you, in your data model, are declaring a certain column or a combination of columns to be your primary key or your unique key constraint, the database will automatically create an underlying index to support that uniqueness check. Now, the foreign key constraint, you should also index by default. Some of those don't end up getting used, so we can clean those out, but they should be indexed. And the database does not index a foreign key constraint automatically. And that's why that's one of the things I'm checking for when I'm doing data architecture reviews. Just a little anecdote to go along with this. One company I went to work for in 2019, when I walked in the door, they had multiple dumpster fires in their flagship Oracle database. They had had a data architect up until 2015. They had been doing without one since then and had lost sight of a couple of best practices, one of them being indexing your foreign keys. It turns out that they had two primary causes for all their performance issues, and the biggest issue was the lack of foreign key indexes we added. You know, the system had been evolving and growing; features had been added; columns had been added. And, over time, they had added 53 columns that were child columns logically related to parent tables, and none of those 53 columns had indexes on them. And that's normally not a huge problem if you're not querying on those columns; you're not joining on those columns. But if you try to delete a row from a parent table through a foreign key constraint is related to data on a child table, when you go to delete that parent row, it has to scan through, ideally in an indexed manner, all the child tables related to it, to determine whether or not it can safely delete the parent row or whether it's going to create orphans. If it's going to create orphans, it says, "I can't. There's child rows that still pertain to this parent value". Well, at this company, they were trying to adhere to the GDPR regulations because they had customers who had employees in Europe. And when those employees would leave, GDPR says, "You should be able to request that all your user data be removed." Well, because all of these foreign keys had been added without supporting indexes, their attempts to remove user data had been getting slower and slower. The last time they'd been able to run it had been two or three years before I got there, and it had taken 24 hours to remove a single user, and then they just gave up. When I got in there, we added the missing foreign keys, and immediately, we were able to catch up on 2.4 million user deletion requests in two hours. From 1 user taking 24 hours to 2.4 million in 2 hours. Indexes can make a huge difference. So, what else should be indexed? Index each column used in filters, otherwise known as WHERE clauses or predicates. Index each column used in a join. And if the join is to a multi-column key, that's when you want to index the columns together, of course. If you have a multi-column key and this particular flavor of query that you're sending at this table doesn't use the first column in that key, but it does use the second column, that's not a problem in Oracle because they have a feature called skip scanning. It can...I'm not sure how exactly they implemented it, but it can skip over that first column, and it can index on the second column in the multi-column key or multi-column index. It turns out that many users of Postgres have wanted that for many years, and it is now a native feature as of Postgres 18. So, that was some good news I wanted to share with you. We're currently on 17.4, but I imagine that 18 is not too far off for AWS. What should be avoided? Over-indexing. Back when I thought this would be a presentation, I wanted to demonstrate that we have a number of tables in our systems that have well over 25 indexes. Did I say, "Indexes"? We have a number of tables in our systems that have well over 25 indexes. And one I was looking at the other day has 31 indexes on it. And of those 31 indexes, 15 of those indexes have never been used. MIKE: So 50% basically. BILL: Yep. There's a whole lot of cleanup that we could be doing. So, that's why it's important to monitor indexes over time to make sure that you're not leaving a bunch of crafty indexes around that aren't touched. Let's see. Avoid indexing a column more than once in the leading position of indexes on the same table, and we have a lot of that going on as well. Don’t index columns with very low cardinality. So, if you have a table with a hundred million rows, you wouldn't want to order the active flag column, where 50 million are Y and 50 million [inaudible 50:54]. That's not going to do you any good to index that. Avoid indexing mostly null columns. We talked about that when we were talking about partial indexes, where you can use a WHERE clause to avoid indexing those columns that are mostly null. And avoid indexing columns that are heavily updated; that one involves some trade-offs and understanding of your system. WILL: So, what's the drawback to, like, if I have a column, let's say, you know, like, I don't know, date of birth or something, right? And I don't want to have an index off of date of birth twice. So, I don't want to have an index, like, date of birth and, like, zip code, and then also date of birth and phone number area code, I don't know, whatever, you know what I mean? Like, I don't want to have that. If I understood what you're saying, like, correctly, I don't want to do that. I don't want to have date of birth in X, and then date of birth in Y, and then date of birth in Z. What's the issue, or what's the correct way to approach that, right? Because I could think of scenarios where that'd be relevant. BILL: Yeah. So, let's just use some aliases for some columns in a table. If you looked at your indexes and you see an index on A, another index on A comma B, and another index on A comma B comma C, you would want to get rid of the first two and keep the third one because that satisfies all three. If you instead had looked at your indexes and you had index on A, index on B C A, index on D F G A [chuckles], that's where you really need to understand your system, your queries, which queries you use most frequently. Do you go ahead and, you know, allow all of them? I don't have any really astute advice there other than do your homework and understand that, you know, if A is being used as the leading column in 3, 4, 5 indexes, it's very likely that a few of them can be eliminated. Sometimes though, it can't, you know, like, in that one example, you're seeing it is B A C or C B A. You may need to keep all of those around to satisfy some query-specific indexes. In our systems, we do have a lot of instances where we have an index on A, an index on A B, and an index on A B C, and those first two can be eliminated. We've got a lot of instances of that. A common mistake, and one that I frequently make as well, even when I'm doing the reviews, even though it's the...I think it's the second or third bullet point in the checklist. It says to make sure that your table has a natural key on it, which is a unique key constraint, unless duplicates are expected and welcomed. And even when I'm doing reviews, even though I wrote that list, even though I try to live by it, I still forget. When I'm looking at table designs, if I see a primary key, my mind says, yep, it's good. And I tend to forget to put a natural key on it to make sure that duplicates can't accidentally slip in there. So, that was something I wanted to get across. Another little tip in Postgres is to make sure you're using the keyword CONCURRENTLY on large index creation and rebuild, so that we aren't locking things up. And make sure that you test before and after index creation to make sure that you're getting your intended results. And that is the end of what I wanted to say. KYLE: So, my question...I feel like a lot of this, of course, comes from the viewpoint of a software engineer, right? And we've kind of discussed, you know, generally, indexes are good, with, you know, more wins than losses. But I'm also very aware, from a software engineer's standpoint, infrastructure is free. So, where I care about the non-existence of free infrastructure, at what point would somebody on the infrastructure team start getting nervous or start questioning the amount of indexes that we're adding? Because I assume this isn't going to be free. This is going to impact CPU, memory, I/O. And then the one that I'm thinking about the most, correct me if I'm wrong, but this will elongate the time that, like, a vacuum will run, right? And that's always a hidden cost under the hood when an auto vacuum kicks off during a querying issue. BILL: Yeah, unless you're going nuts with indexes like we are with some of our tables...Because, honestly, the most indexes I'd ever seen on any table before I came here was 17, and I thought that was crazy. And we have a number here that have 29, 30, 31. So, unless you're going nuts with index creation, you're generally not going to see a big drawback. The exceptions to that is when you start to get to massive scale, billion-row tables, lots of indexes on it. Now we've got to do a cleanup. For some reason, we need to do a VACUUM FULL, or we need to do a pg_repack. In both of those cases, it has to create a copy of that table and all of its indexes before it swaps them at the last second. And so, whatever space that that massive object is occupying, let's say the table is occupying two terabytes, you now need to have double that space in order to make that operation even work. That's where...massive scale is where things start to really show up and matter in cost. KYLE: Okay, so at large, large databases is when you're saying is when it'd become a problem, okay. [inaudible 56:26] BILL: I typically don't notice the blips until the table and its indexes are occupying more than, say, 200 gigabytes. That's when I start noticing. That's when I start feeling the pea underneath the mattresses. MIKE: I appreciate all the deep dive, you know, and the feedback, you know. You came prepared with this list of things, and we've been grilling you on specific use cases that we get down into the gritty details. I mean, there's going to be more, right? We could go on forever. But is it mostly just about following the rules that you've mentioned, and then you cover almost everything, and then the weird cases, well, they're going to be weird? BILL: For a relational database engine, yeah, I think I've covered most of the tips and tricks. So, if one can get good at the things I've talked about today, I think you can call yourself a full stack developer [laughter]. MIKE: People will call themselves a full stack developer. [laughter] BILL: The reason that I kind of chuckle at that is because since about 2010, most of the students that I've seen applying for positions that I've been hiring for have maybe done a hundred thousand row table on Mongo in a course in college, and they're calling themselves a full stack developer. I think they need to be hardened by database scars before they can call themselves a full stack developer. WILL: I think you should be able to build a mobile app, full stack developers. I see you all on your phones. MIKE: [laughs] WILL: Nobody knows anything about it though [laughter]. MIKE: Yeah. Then you've got to build the frontend and the backend. BILL: Well, thanks for having me on your podcast. MIKE: Yeah, thank you, Bill. I really appreciate it. You know, I started by talking about the importance of indexes and how they transform things before we, you know, transform our modern world, before we did the deep dive. Maybe I'll come back to that as we sign off. We got deep into technical details, and it's easy to think, oh yeah, you know, I'll worry about that sometime. But as Bill said, you know, you pay attention to these things. You go through your checklist, and then you don't have a table where you can't delete rows from it for years [laughs] because it's not possible. It's like hygiene and conscientiousness. It's brushing your teeth, and if you do that, your teeth are healthy. You end up having a much better life and much fewer calls at 3:00 a.m. Thank you, and until next time on the Acima Development Podcast.

29 de abr de 202659 min

Episode 100: Normalization of Deviance

Descripción

Comentarios

2 meses por 1 €

Todos los episodios