Kansikuva näyttelystä Acima Development

Acima Development

Podcast by Mike Challis

englanti

Teknologia & tieteet

Rajoitettu tarjous

3 kuukautta hintaan 3,99 €

Sitten 7,99 € / kuukausiPeru milloin tahansa.

  • Podimon podcastit
  • Lataa offline-käyttöön
Aloita nyt

Lisää Acima Development

At Acima, we have a large software development team. We wanted to be able to share with the community things we have learned about the development process. We'll share some tech specifics (we do Ruby, Kotlin, Javascript, and Haskell), but also talk a lot about mentoring, communication, hiring, planning, and the other things that make up a lot of the software development process but don't always get talked about enough.

Kaikki jaksot

98 jaksot

jakson Episode 98: Standups kansikuva

Episode 98: Standups

This episode of the Acima Development Podcast starts with a discussion about the frustration of U.S. tax filing and uses it as a metaphor for poorly run standup meetings in software development. The hosts argue that many teams repeat painful, unnecessary processes simply because “that’s how it’s always been done.” From there, they unpack the most common standup failures: meetings turning into status reports, running too long, involving too many people, or becoming impromptu debugging sessions where only a few participants are engaged while everyone else checks out mentally. The panel emphasizes that these problems are usually symptoms of poor communication and coordination happening outside the standup itself. A major theme throughout the conversation is that standups should focus on coordination rather than status reporting. Dave Brady argues that if teams properly maintain tools like Jira or Kanban boards, everyone should already know the project status before the meeting begins. The standup’s real purpose is identifying blockers, avoiding collisions between teammates’ work, and quickly coordinating handoffs. The hosts debate alternatives like “Slack-ups” and asynchronous updates, with some arguing they fail to replace the human interaction and spontaneous coordination that happens in live meetings. They also discuss ideal team size, meeting frequency, time zones, and how distributed teams create additional coordination challenges, especially when work is handed off between regions. As the conversation evolves, the podcast becomes less about standup mechanics and more about human connection in remote work. Will strongly advocates for cameras and microphones being on during meetings, arguing that face-to-face interaction helps managers recognize burnout, disengagement, or personal struggles that text updates can easily hide. The hosts criticize workplace cultures that dehumanize remote or offshore workers by treating them as interchangeable resources rather than teammates. By the end, the group concludes that the biggest failure in bad standups is not inefficiency alone, but the loss of genuine human connection. Good standups, they argue, are ultimately about building trust, communication, and healthy relationships within a team, not simply exchanging status updates. Transcript: MIKE: Hello, and welcome to another episode of the Acima Development Podcast. I'm Mike, and I am hosting again today. With me, I've got a great panel. I've got Thomas Wilcox, Dave Brady, Justin Ellis, Eddy Lopez, and Kyle Archer─I think we're all returning crew here ─[chuckles] to talk about our topic today. So, you're probably not listening to this, like, exactly when we're recording it. You're probably not even listening to it right when it comes out. There's always a recording period and then a publishing, you know, a week later, or a few weeks later, after it's gone through editing. And we have a bit of a queue in case we miss some time. It all works out. But we are recording this in tax week in the U.S. This was the week that taxes were due, and everybody has hopefully completed their annual suffering and has submitted those numbers to the IRS. I read about this before, and I read about it again this week. Articles are often published this time saying, "Why do we do this?" Well, it's a good question, because United States is actually fairly unique in the world in that we have to submit all these taxes every year. In many countries, most people don't have to do anything at all because if you're working for an employer, they've been submitting tax information to the government all year, right? They've been paying your taxes, and as long as you don't have anything funny going on, that's enough. The government knows about you, you know, they probably know how many dependents you have, you know, you've reported that. I mean, you reported it with your business. The information's there. And in much of the world, people just receive a letter saying, like, "Yeah, thank you. Everything's good." And they receive, you know, there's no refund or non-refund because it just works, right? They don't have to do anything. The cycle that we go through of pain every year doesn't need to happen. Now, the reasons for that have to do with...Well, I want to be careful here. Our purpose here is not to criticize large corporations who lobby heavily [chuckles] to keep the tax code as it is, well, to keep the tax submission process as it is. But such is our life, right? But where I was going with this is that we go through all the suffering because it seems normal, and everybody we know around us do it because it seems normal to go through all of this process of reporting something we've already been reporting with every paycheck for the entire year. It's just a rehash that we have to do in excruciating detail because that's how it's always been. But there are examples of people who do it differently, and they don't go through the same pain that we do. And I imagine that in their blissful lives, they have extra time around this season to do things other than pay their taxes or, you know [inaudible 03:14] DAVE: Must be nice. MIKE: It must be. Why do I talk about this? Most of you, if you're an engineer, have probably been in a lot of standups, which is a sometimes daily, sometimes weekly, some regular interval typically meeting where you have a chance to touch base and connect with other people on your team. And they can range from actually pretty good to something far from that [chuckles] to something that makes you want to quit your job right [chuckles]? Like, well, not another standup. The idea, you know, comes from this agile process where...and I think it's not even just engineering. You get together in a room. You want the meeting to be so short that nobody sits down, right? You go through the key things to make sure that everybody can touch base. Now, we have all kinds of communication channels, right? We've got, you know, our messaging platforms that we use. We've got the ability to go and walk over to people. There's lots of ways to communicate, but we decided that we're going to pay this cost of bringing a whole team. And this can happen at lots of levels. You can have executives getting together for a standup. You can have the team that reports to the executives getting together for a standup. So, you have a bunch of people, and that's an expensive meeting, right? Imagine the executives getting together for a standup. I don't know how many dollars that costs, right? I'd have to do the math, but it's not few. It's an expensive meeting where people have chosen to do that because they think the coordination is so important. But it can be done right, and it can be done wrong. It can be a yearly suffering, a period of suffering, like the taxes, that reports stuff that's already been known. Or maybe it's a meeting that ends quickly and touches on key information that not everybody knew because it was late-breaking, and it was a good opportunity to share. We're just going to talk about standups. It's something that we all live with, so it's worth talking about. So, I'm going to ask─I've given the intro─what have you seen? Well, actually, let me start. Let's start with the bad side. What is it that makes a bad standup? DAVE: Turning into a status meeting, for me. The thing that makes a standup go bad...and I will reveal the point that I wanted to make in today's podcast right out of the gate. The thing that makes a standup go bad is when you are not taking care of the things that you need to take care of outside of standup, and so they have to get taken care of in standup. When your standup runs really, really long and turns into a gigantic status meeting, it's because you're not communicating status outside of the meeting. And, actually, I don't need to put any more on that point. That's just like, if you don't take care of it elsewhere, it's going to hit here. When I look at a standup that's running long, I don't look at it as, like, this meeting is bad, I mean, it kind of is. But I look at it as, okay, what is the unmet need that is screaming at us so loudly that it's cratering our standup meetings? That is frequently a very helpful thing. If it's a status meeting, you maybe need to, you know, do better in, you know, one of your other practices. If you're arguing about cleaning up code, maybe your retro needs to be better. Yeah, that kind of stuff. MIKE: Okay. So, you said there are some other venues where this should be happening, that the status reporting should not happen in standup. I've been in a lot of standups that were about status reporting, so, you know, you're bringing up a common failure case. If that's the bad case...well, I want to come back to [inaudible 06:46] JUSTIN: There's more bad cases. We got more [crosstalk 06:50] DAVE: We should go through the counter good case to the status MIKE: Yeah. So, let's go to the other bad cases, but let's put a pin in that one because you said, status report: bad [inaudible 06:59]. So, what are the other bad cases? What are other bad cases around standup? JUSTIN: When they run long, and there's not a good reason for it. I mean, basically, when you go back to your summary, you talked about how everybody is standing up, and they don't want to, like, sit down, and you just want to quickly go through things and be done. If it's going more than 15 minutes, or it's going more than 20 minutes, whatever you have allocated, and it shouldn't be more than 20 minutes probably, that means everybody's looking at their watch. They're wondering about what other meetings they have to go to. They aren't focused. And you, all of a sudden, the only person who is paying attention is the person you're talking to directly, and everybody else's mind is just like, pshhh [laughter]. DAVE: And you're 100% guaranteed at that point...if your meeting's running that long because somebody says, "Well, I've got this problem," and then everybody dives in, too, you're now doing mob programming in your status meeting. Everybody's trying to debug it. You're no longer talking about what I did yesterday, what I did today, or what I'm doing today, and what are my blockers, right? You've definitely departed the format into something else. KYLE: Well, and it's mob programming at best, right? Because a lot of the time, what I see -- DAVE: At best. KYLE: Is it's one or two people programming. DAVE: Yeah, and everyone else is disengaged. KYLE: And then the other eight are just kind of sitting around twiddling their thumbs. MIKE: [laughs] DAVE: Mm-hmm. MM-hmm. JUSTIN: Yeah. That actually brings up the other part to this, is, like, if there are too many people in your status meeting, sorry, in your standup. Personally, I think four, maybe five, is the absolute max you should have in your standup. You have any more and, all of a sudden, you run into that same problem. It's, like, you know, one person is talking, and everybody else is looking elsewhere. EDDY: Okay, but how do you manage that when you have a team of 10? JUSTIN: You have two standups. EDDY: But then don't you deviate from, like, status reports, in a sense? Like, isn't it important to also -- JUSTIN: What was it? Amazon? No, no it's a good point, and it's becoming really hard these days where, you know, you have the flattened hierarchy, right, where you have a lot of people reporting up to a single manager. But I think it was Amazon or somebody that said, "Hey, you shouldn't have a team that's larger than you can feed with one large pizza." If you are having status meetings with larger groups, it's not as effective. MIKE: And you can do it hierarchically, that is, you have your team of five people do a standup, and one delegate from that person, whoever's leading that meeting, themselves goes to a standup [laughter]. JUSTIN: Eddy just typed in the chat, "I could eat a whole pizza." Eddy, you are a team of one. You are very effective [laughter]. MIKE: I was actually thinking the same thing. Not today, but back in my heyday of eating, you know, like, late teens, I could put down two [laughter]. JUSTIN: Sorry, I derailed that but [laughter]. DAVE: The other thing that kills a standup meeting, and this is the one that if your workplace has the fun police guy in it, it's when standup turns into a BS session, when it turns into a water cooler type thing. And I stand by my earlier point that that's an unmet need. You've got a team that is not being properly socialized. When I worked at Cover My Meds, they had a really great policy that if you were remote, you had to fly into the head office every quarter for a week and just spend a week rubbing elbows with your teammates. We talked about this when we were talking about radical candor, that you basically had to make friends with your coworkers and get to know them. And we spent a whole week just playing card games and, you know, goofing around, and we'd go work, that sort of thing. But we overinvested in socializing and goofing off time, so that when we broke up and went back, the socialization now was just, like, a quick touch base of, like, hey, how are you doing, or how are your llamas? That was a real question from a real coworker, for a real coworker who really had llamas. You know, how's this going, or how's, you know, that side thing going? And if you don't have that investment in the socialization, it will come out at standup because humans are gregarious creatures. MIKE: So, what other failure cases do you see with standups? What about when there's a lack of psychological safety? DAVE: Hmm. They tend to run pretty quick. MIKE: They do [laughs]. DAVE: I worked on this. I'm going to work on this. I have no blockers. That's my report. Yep. Yep. MIKE: Every time. They run quick and accomplish nothing. DAVE: Nothing. Yep. The thing that I thought was interesting as I dug into...I dug a little bit into standup, like, history today before coming on the show. And I thought...it was kind of interesting because the three questions, like, what I did yesterday, and what I'm doing today, and what are my blockers, is not necessarily actually the point of the meeting. It's actually the scaffolding or a ceremony to draw people in. But the point of a standup is not status. The point of a standup is coordination. It's to make sure that you're not stepping on somebody else, or that this feature's going to be in play before my feature needs it, that sort of thing. And so, standup is arguably going well when somebody says something and then three people start to argue with them, you know, "What about this?" that kind of thing, as long as you don't spend 20 minutes, you know, diving into that. But as long as the pushback is, "Wait, wait, wait, my piece is up the pipeline for you, and it's not going to be in until..." you know, that sort of thing, that kind of discussion, that's coordination, and that's the point of a standup. And that's something you can't get...we'll probably talk about Slack-ups and Slack-based standups and that sort of thing before we talk about that today. But that kind of coordination is pretty hard to do in just, like, an RSS feed, where you just...here's what I worked on, here's what I worked on, here's what I worked on. And, unfortunately, in most waterfall-based or enterprise timekeeping systems, we just want to know what budget code to put your time against, and so we're not interested in coordination at all. So, the business is trying to extract...they're trying to extract your status from that meeting, which is a terrible countervailing force. It pushes the meeting into a status meeting. MIKE: Anybody else want to jump into the failure cases? DAVE: Yeah, I've got a sore throat, guys. I need you guys to take over the show [laughter]. MIKE: Well, we've covered some good stuff here. So [chuckles], there's nothing wrong with our current list. We've talked about just a status meeting, too big, too long, safe. Go ahead. JUSTIN: Not prepared, and by that I mean the best status meetings I've seen, or the best standup meetings, sorry, I've seen are ones that have been led by somebody who basically knows everything that's going to be talked about. And that goes back to, you know, communication by other channels and things like that. But, you know, if the leader goes in there and he's got a checklist of things that he needs to find out and he doesn't know clarity on all these items, I don't know if he's going to be able to find all the answers that he wants during standup and have it be as short as he needs to be. MIKE: That's great. And if we put all these together, imagine going to a standup where the leader's not prepared, has no idea what's going on, is going to likely mistreat the people on the team, so they don't want to go into any depth, but are mandated to share a long status. And so, that's what happens. You stand there within a large meeting, for hours, hearing everybody give a status that they could have reported. You know, basically, they're just reading out what happened in Jira. Does that basically cover it? DAVE: Mm-hmm. MIKE: Nightmare fuel [laughs]? DAVE: Or Tuesday [laughter] MIKE: Yeah. I worked with somebody who had recently been promoted to management, and he called Tuesday poosday because he had so many meetings [laughs] from having these sorts of experiences. Okay. So, we've identified a set of problems, and we're engineering folk. What do we do to address these problems? And maybe to start, go back to the beginning. If a status report is the most common failure case, and how these often fall into, you know, how...I say...I'm not sure my preposition works there [laughs]. Standups often collapse into just a status meeting, instead of being something effective. Well, and we talked about how they can be useful, right? There are means to communicate information that's not being communicated elsewhere, to quickly resolve problems, make sure nobody's blocked, and take things elsewhere. It's not where you do the major problem-solving. It's where you set up the later coordination to address problems. Dave, you said you have lots of thoughts on how to address these things. And you say that it becomes a status meeting because that's an unmet need elsewhere, you know, it could have been done elsewhere. So, where should it be done? DAVE: That's a good question. Anywhere else. It should be done anywhere is actually a fair point. Standup is just the least good place for it to happen. In my career, we all have a love-hate relationship with Jira, and I definitely love to hate on Jira. But the best teams I've ever been on, for managing process-wise anyway, we could go look right at the board, and we could all tell where we were as a team. We all knew how this thing was. We all knew what feature we were working on. We knew what the customer was going to receive when we delivered it. So, we had kind of that high-level... I realize this almost sounds like I'm not answering the question, but I really am. We had this higher-level visibility that, like, I'm not just writing lines of code here. I'm actually...I'm shipping this feature, which is part of this, you know, this larger, you know, thrust that we're trying to get out to the customer in this next round of deploys. And when everybody has that status of this is what I'm getting at or this is what I'm headed towards, now any tasks that you pick up are focused towards this, and anything that you're working on is either in line with it or isn't. This feels a little nebulous, but if you can see where the team is at and where you are at, and you know what you're working at, you don't need a status meeting. And if you've got that on a big board somewhere, or if you've got it on, like, a Kanban board, if you've got it up on a wall, if you've got post-it notes anywhere, or if you've got, you know, CRC cards, it doesn't matter. It can be a burndown chart. It can be a burnup chart. That's actually the same thing, just upside down, however you do it. But the key thing is, do you know what your teammates are working on, and do your teammates know what you are working on? A lot of that gets bled away in pair programming because you're swapping pairs, especially if you're doing promiscuous pairing where you swap partners every day. Because I pair with you for the day, the next day I know what you worked on yesterday because I worked on it with you. And so, that part of the status communication goes away. It slowly weaves its way through the team, one partnership pairing at a time. So, yeah, I'm going to answer your question with your own question, which is, you know, where do we take care of those things? Anywhere and everywhere that we can take care of them. We just need to be intentional about what the need is. And I think that's what kills us in standup, is that we go in just assuming that, well, I'm here because it's 9:30 in the morning, and that's when we do standup meeting. And you're cargo culting the ceremony at that point, right? It's like, I'm going to go to this meeting. I'm going to do my three questions. And if you've got a good scrum master, then when somebody asks you a question, the scrum master will say, "Okay, stop. Kick that out to after the meeting." And that's how you keep your meeting short, just by punting that out. But that's all just ceremony. The whole point of the ceremony is to get people coordinating so everybody knows where we're all at together. MIKE: Well, I heard in what you said, if you're using your project management software, and it might not even be software, it could be your project management process that you handle on a board. Either way, it's your project management process. Then you remove the need for a status report because you're using your system to do it. So, if you're not maintaining Jira hygiene, if Jira is what you're using, if you're not keeping that up to date, the tool that your company is paying for and is intending to use for that, then you're going to be forced to do it somewhere else, which is worse. Is that a fair summary? DAVE: Yeah. And as you described that, I just realized there's another failure mode of standups, which is dissemination of knowledge, which is normally taken care of in your pairing. But this has certainly happened to me even here, where I will say, "Hey, I'm going to work on this piece, but I'm not sure where to attach into it." And someone else in the standup meeting will go, "Oh, well, you're going to have to grab this service class, and then plug it in with this thing over in the utilities directory." "Oh, okay. So, if I do, you know, can I mock that out this way?" And, all of a sudden, it becomes a technical meeting, right? And what's really happening is I'm pairing with another programmer. I'm just wasting everybody else's time while I do it. MIKE: So, failure is when maybe good things happen, but they happen with everybody else as spectators and forced spectators where they don't want to be there. That's not the movie they paid for. DAVE: Yeah. What it is, is it's the least efficient way to accomplish the necessary thing. It's not necessarily bad; it's just a terribly inefficient way to do it. We'd rather you just go pair off with one of the other people on the team and, you know, knock this out. But if you're not going to do it that way, it has to get done somewhere. So, standup's the next time you're going to see each other. MIKE: Just say, "You two, go work that out [inaudible 20:22] [chuckles]." Yeah, so, effective way to address that. So, we've talked about failure modes of standups, how those can involve just being status reports. They can be the meeting being too big, too long, unsafe, having the wrong things in them. We've talked now about avoiding status reports. And, Dave, you really focused on using your project management so that that is all in everybody's mind. They can just glance whether that's your Kanban board, or Jira, or wherever it is. DAVE: Right. Exactly. MIKE: So that you know that information ahead of time, so nobody even tries to make your standup about that, because why would you? We already have that information at our fingertips. One thing that I've seen done is Slack-ups, or, you know, name your messaging tool of choice. Slack is widely used in software as well as other industries. So, we'll talk about Slack, but, you know, if you're a user of something else, Microsoft Teams, for example, which we also use, that's fine. I'm referring to both. Is that a good replacement? I mean, is that really a good replacement? DAVE: Hard no. Hard no. At least for the coordination part, I say it's a hard no. We use Slack-ups here, and Mike probably has lost sleep over the number of times that I forget to do my Slack-ups. If I go through my Slack history, I've probably got 20 kilobytes of Mike going, "Hey, Dave, would [chuckles] turn in your Slack-up, please?" But that goes to what we were talking about though, or what I said earlier, that somebody is trying to extract reporting information and status information, and that's how you knew that my Slack-ups were getting forgotten. And the reason I was forgetting to do them was because I didn't have any coordination to get done. And ADD is like, if it's not right in front of me, it doesn't exist. So, in my opinion, I think, a Slack-up does not solve the problem of a standup. And that's why I tend to push back sometimes when we say, "Well, let's not do standup. Let's do Slack-up instead." I'm, like, no, these are completely different things, and it might be worth doing both. Because the next devolution of that argument will be, well, why don't we just use Jira instead of the Slack-up? And because that's, like, an obviously provable thing. Like, well, if your Jira board is accurate and everybody's keeping it up to date, then you don't need the Slack-up because you just go look at the board, and it'll be up to date. And [chuckles] silly anecdote, [SP] Gerardo is our product manager, I think, is the title that we're working with. And I love him because, in standup today, I'd gotten behind in my Jira reporting. And I keep a list on my laptop of the tickets that I'm working on, like, all the statuses they're in, and it literally generates my Slack-up for me. This is how I got to the point where I was able to do my Slack-ups on time because I made the computer do it. And I pulled up my Slack-up, and it didn't match Jira. And I started lining them up, and Jira was correct, and that was all Gerardo's doing. He literally just, like, one of the PRs had updated in GitHub, and he'd fired the hook, and it had gone through. That's, you know, when you've got it really working well, right? So, anyway, the point of that is that Jira can absolutely replace the point of a Slack-up in terms of, like, status distribution. And this is why I push people away from, please don't replace standup with Slack-up, because you'll end up in this morass of, like, well, what about Jira? You're now fighting about the best way to not solve the problem. You're not even talking about the right problem anymore because there's no coordination involved. MIKE: So, there seems to be a recurring theme here that use your project management software, and if you don't like it, then solve that problem because that's the underlying problem. DAVE: Right. MIKE: Okay. And one thing I want to make sure we don't miss, and this came up in our side chat. We haven't talked at all about frequency yet. If we're talking about the failure cases, the same awful meeting I talked about earlier, twice a day [laughter]. If you have remote teams in different time zones, you got to catch them up to speed too, right? Do it twice a day, or maybe once for every time zone you have somebody in. I'm saying the opposite, the opposite of the good thing [laughs]. This is a bad thing [laughter] I'm describing. But -- JUSTIN: That actually brings up a really good point. Like, I've managed teams that are in India, and for them, it's, like, 10 o'clock at night when they're checking in with the rest of us. But we got to have that standup because we got to make sure that they are not blocked for their next day. So, I think the time of day really depends on your time zones, things like that. And for me personally, my ideal is, like, everybody's in the same time zone. We have it in the morning, not first thing, but, like, at 9 o'clock, maybe 9:30. That's my ideal. It's a good way for people to get in, check their emails, kind of try to remember what they did yesterday, and then they can come in and do standup. Doing it at the end of the day has never been really appealing to me. I've done it before at the end of the day, and a lot of people are checked out already, and then they forget what they said they were going to do by the time the next day comes around. DAVE: Yeah. You've had shower time to think about it. JUSTIN: Yeah. So, I prefer it in the morning, I don't know. But I am open to other thoughts. And, again, if you're dealing with multiple time zones, you just got to do what's best for your team. MIKE: You suggested that reality, which is if you have groups in very different time zones, and I've seen this with people in Europe, people in India, Philippines [chuckles], people in Vietnam, you know, where you have very different versus the United States. You are right. That makes the standup even more important to not be a status meeting, because it's handoff time, right? You're passing the baton. And when you're passing the baton, you don't want to say, "Hey, here's what I worked on today," then the race stops. You stand there and chat for a few minutes, and it's no longer a relay race. You're not handing off the baton. Who knows what you're doing? But if you're handing off the baton and say, "You know, careful, it's slippery up there," or they're supposed to hand off the baton, and they're not there yet because they're blocked back somewhere, right, then you know something. And that's an important thing to recognize, and using that opportunity is a big deal. It's a good opportunity, a really useful opportunity to actually make that handoff and make sure that you're not doing a status meeting because that's, like, the least valuable thing you can do when somebody's showed up at 10 o'clock at night. You don't want to hear what they worked on that day. You want to hear about what you're working on today, because they're handing it off to you. DAVE: You said something a minute ago, and I think I misheard you, but I like the way I misheard it. You talked about time. You said, like, what time? Because Justin then jumped in with, you know, like, evening for the India team, and that sort of thing. But what I heard was how many times. And I was just imagining, like, the horror of having standup more than once a day. Or, you know, do we have it three times a week? And that sort of thing. And I have actually worked on a team that had standup twice a day, and it's because the team was extremely agile. We were all in a bullpen working together. There were six of us, and we would pair up in three pairs, and no ticket ever lasted longer than four hours in theory; sometimes they did. But, like, at lunchtime, if you weren't done with your ticket from the morning, you had to trade pair partners. And the next day, if it still wasn't done, your ticket got thrown back in the backlog as being too big, you know, too problematic. And what I'm realizing, I've got this crazy...this is just a bat-poo-crazy Dave Brady hypothesis. Show me how fast your deploy cycle is, and that is how often you need to be having standup meetings. I'm on a team right now that we meet three times a week, oh, sorry, yeah, three times a week, every other day. And what you just told me is that what you synchronized on yesterday, you don't need to synchronize about today because you're not moving fast enough to bump into each other with yesterday's coordination or with just that information. If you are changing lanes very quickly or hopping from feature to feature to feature, then you need more and more coordination, because you're a lot more volatile. You're jumping around. You're bumping into more things. So, that's my crazy theory is, from the time you go code complete to the time you go deploy, that sets a pace and a rhythm. It's not necessarily good or bad. I mean, agile says that should be very, very small, but, like, it's a reality that, like, the more enterprise your system is, the longer that's going to be. If you've got, like, a validation or an auditing step, or that sort of thing, or compliance, then that's going to take longer. And, I think, as far as coordination goes, that can verify the need for a standup meeting. There's just not that much need for everyone to come together and say, "Hey, I'm going to be working in this area. Who do I need to coordinate with to make sure I don't break your stuff?" So... WILL: I don't know, man. I do not agree with that in the slightest. I'm a hard, hard, hard no on that. DAVE: Awesome. Awesome. WILL: Well, deployment cycles, like, it's...I think of, like, these standups as, like, more, like, inter-process communication. I work in, like, native mobile for the moment. And native mobile deploy cycles are very slow because it's a whole song and dance you got to do with Apple, with Google rolling it out to a bunch of, like, third-party devices you don't own, all this kind of stuff. But we need to do more coordination and not less, because, like, we've got all these teams coordinating on the same app that we really don't want to screw up. You know, clawing back on mobile release is really painful. And it's a function of, like, how many cooks do you have in the kitchen? Not like, how many times you're serving the meals, you know what I mean? DAVE: Okay, so same principle, but opposite conclusion. Okay. Yeah, that's fair. That's fair. MIKE: Well, going back to the relay race analogy I was saying before, if you need to pass that baton to somebody, then that's a coordination point, right? If you're working on something largely alone for three days, there maybe nothing changing there. DAVE: Yeah, that's fair. MIKE: And if nothing's changing, you don't have to pass the baton, right? The environment you talked about, where you changed tickets twice a day, well, there's a major coordination point there where it was mandated. And that really wasn't necessarily the deploy cycle per se, although it could be. It's the points of communication. Or in Justin's example of very different time zones, there's a real need for that coordination where there's a handoff between one group and another at the time. It seems like those coordination points, where the coordination is required, seem to be driving it. And I think that's where there's overlap between where you're headed. Will, you're saying, well, you need to have these coordination points at, you know, the communication need is what drives those points of coordination. And yeah, for your mobile app, maybe you're only releasing once a month, but you better be coordinating more often than that, or else you're going to have a horrid mess. DAVE: I withdraw my claim because you're right. I was trying to conflate the speed at which you deploy a feature. If everything is atomic and everybody has their arms in, then that is linked pretty closely to the rate at which you need to coordinate. But that is the actual driving variable is, how fast do you need to coordinate? How fast can things change? Absolutely. I agree. KYLE: I was just thinking of two use cases, and they might be more niche than the average developer. But I've worked in a situation where I was Dev QA. And what that meant was I sat with my devs, and that was my main responsibility. My secondary responsibility was to my QA team. So, we talked about, how many times do we have standup a day? I had two. I had one in the morning with my dev guys, and then I had one in the afternoon with my QA guys, both of them managed very differently, different scales. And then the other scenario that I'm looking at is kind of where I'm at right now is I'm on a team... I facilitate multiple lines of interest or lines of business. I have one line of business we're deploying 10, 15, 20 times a day. I have another line of business we're deploying once a week, you know what I mean? So, I guess, in that, like...this was more towards your comment, Dave, and we've kind of rectified it a bit now. But that would be very convoluted to be like, oh yeah, well, we need to do it once a week here and 20 times a day here [laughs]. DAVE: Yeah. Yeah. Well, and, actually, that's another proof that my hypothesis is wrong, that if the team that's churning every single day, if they're pushing changes into other people, everyone else has to beat that often as well, because they are causing coordination conflicts, yeah. WILL: I've got a different read on it. So, I had to leave right around the Slack-ups, right? And I've got a real serious problem about Slack-ups because my experience with Slack-ups, that Slack-ups are...I can't think of an exception to Slack-ups not being ultimately rooted in devs being busy under the gun and wanting to skip a meeting that they saw as extraneous. MIKE: True. WILL: And while I have seen inefficient and non-productive standups many times in many, you know what I mean, iterations, I have never in my life witnessed a team that was devoting too much time to keeping everybody on the same page. I think Slack-ups are foundationally not...It's not the right tool for the job if it's just, like, hey, everybody, update your tickets, right, so that everybody has visibility or whatever. Put it in the Jira ticket, throw a comment in there. You know, that's a good thing to do just in general, you know. Like, if I have this thing that's on my desk, when I close out for the day, here's what's going on. And if somebody cares, right, some PMs like, "Hey, what's the status of this thing?" they can just go look at it. And they don't actually need to bother me at all. They will, but they didn't need to [laughs]. But at least they're more informed when they bother me on Slack. I think it's devs thinking that this meeting is a waste of time. And I haven't seen it yet. Every day's a new world, but that day has not yet dawned for me. I think you can keep it tight. There's nothing wrong with keeping it tight and then breaking out. But even the act of just spending a minute, 60 seconds, to articulate what I'm doing and why and how is a worthy investment of time for me, even if I'm working on something in complete autonomy that I'm not going to hand in or coordinate with anybody for a week or two or a month. Just, like, doing that, I think, is a worthy exercise. It's a worthy investment. But you do need to keep it tight when people are busy. Put your camera on, and have everybody look at your face, so that when Mike says, "Yeah, it's okay," you know, but his eyes don't say that, then we have an opportunity to say, like, "There's so many subtle shades and variations of okay, you know, like, I just want to see it. And if I'm the manager, if I'm the coordinator, if I'm the PM, then just give me an opportunity. If somebody isn't necessarily as out and proud and boisterous as me...There are a lot of devs....could I blow you guys' minds? There are a lot of devs that are not excellent and expressive verbal communicators. And they could say to you, "I'm okay," when they're not okay. And if all you have is a Slack message, right, or even a cams-down, you know, meeting, and they give you, like, "I'm okay," right, things might actually not be okay. It might not be cool at all. And denying yourself the opportunity to get that feedback, you know, if I'm a people manager, if I'm trying to keep this team, like, healthy, and happy and productive, I think it's a glaring unforced error. MIKE: I talked to a teacher once about online meetings. I think Zoom is what he was using, but the tech doesn't matter. He talked about teaching a large class where nobody had their cameras on. And it was a nightmare [chuckles] because he'd say something, and it was just dead space, right, just throwing it into the void. You lose all of that nonverbal communication, and he had no idea whether what he was saying was landing at all. And it threw off his whole teaching, like, the whole rhythm was gone. He couldn't make it work. At that point, it's almost a mandatory lecture, where it's why not just record a video? Why do I even bother? WILL: Cams up, mics up, every meeting, every day, every time. And if you got to keep it tight, keep it tight. There's no sin in keeping standup tight, really. And I say this, like, it's full mea culpa maxima. I am the problem because I like to yap. MIKE: Well, we talked about this earlier, before you were able to join, but we talked about failure cases, and one of them we talked about was going too long. We came to the conclusion that a lot of this comes down to what happens outside the standup. And, Dave, I think, expressed this really well. Like, all the failure modes in a standup are because of something that didn't happen outside. And if you haven't done your good coordination beforehand, then you're going to have failures in there, and then you're going to have the long reporting, because nobody knows what's going on, and you have to get caught up. So, absolutely, short. I have run standups before. I've seen them sometimes work, and sometimes they don't, where you start with the blockers. So, instead of saying, "Here's what I was working on, here's what I'm going to work on, and here's what's blocking me," we start with the blockers, and if there's nothing else, you move on. You come in and say, "This is what's blocking me," and if there's nothing blocking you, you go on. Now, to Will's point, having some expression of what you're working on seems to be valuable. Just the act of speaking, you know, like the rubber ducky that you talk to on your desk to clarify your thoughts, probably does have some value on its own. So, there is something to be said for saying, "Well, this is what I'm working on," but that can go last. You can say, "These are the things that are blocking me. This is what I'm working on today. This is what I worked on yesterday. This is what I'm working on today." It changes the focus. So, you start with the most important stuff. "Here's the thing I need to coordinate on that I know I need to coordinate on. I don't want to waste your time. And then here's my take on where things are at." And maybe somebody's going to pick up on something from where you're at. "Oh, I need to talk about that." But you change the order, and I think that does help. WILL: I would want it in reverse order, because I know how these meetings tend to go. And, generally speaking, if you've got a bomb that's going off, if I've got...the kind of people on my team that I want on my team, everybody wants to defuse the bomb, and that's going to [inaudible 40:27] MIKE: [laughs] WILL: But, and here's the thing, right, like, there's people who...I am pro communication and pro efficiency, and so I want the green light projects. Get them off the list. Let me look at your face and spend a minute describing what you're doing, just because, you know, I want to make sure you're okay. And I want to make sure that everything is really okay, and you're not just sort of, like, you know, walking down this open elevator shaft unknowingly, right? So, I could just kind of pull you back by the collar of your shirt. That's fine. But you can get off the call, man. If everything is cool, like, I know what I'm doing; I know what I need; I don't have a blocker; I need to get back to work, then let's get these guys out of the call. And then if the world is melting down, everybody who isn't, like, actively with a bucket, you know, you could get off and, like, get back to the work, you know. Because sometimes you'll have a standup, and it'll roll right into a crisis planning meeting, and that is standard, par for the course. But everybody whose light is green, yeah, bail. Sorry. MIKE: Well, you can take, you know, if you have a conscientious host for your meeting, anytime one of those bombs does show up, you say, "Okay, we are going to talk about that." You create the meeting. You assign it to somewhere else. We are going to set that aside and make sure we finish. And then you get to the end, dismiss everybody who doesn't need to be there, and then you go on. I think that you have to be conscientious about that, or else you will have the failure case. WILL: Yeah. I just go with the flow, you know, my natural... I go with...I would prefer...Both require discipline, and I would prefer to just sort of, like, not do it the hard way on purpose, because people in meetings are naturally going to have a tendency to be, like, this is not relevant to my interest; I'm out, right? Like, you usually don't need to tell people [laughter], you know. KYLE: So, I had a QA manager, and it was...I think it was about the size of 12 people on the team. And I did like the way that he ran the meetings, just because it was one of those things where you would say what you accomplished or what you touched, and then what your blocker was and who you needed. And doing that, the actual standup portion generally took about 5 to 10 minutes. And then afterwards, it was allotted that you could go and communicate with who you needed to. I just thought it was an interesting way for him to manage it that way. And then if you didn't have a blocker and you didn't have anybody you needed to go talk to, you were done. You could go back to your desk and continue doing your work. And I liked that because, then a lot of the time, I could do exactly that. And I wasn't necessarily, you know, in there twiddling my thumbs, which is the most frustrating portion of a standup for me. And I just thought that aligned with kind of what Will was saying a bit, maybe not perfectly, but... MIKE: Well, that pivots a little bit. We talked about psychological safety some before. What does a good meeting host do to cultivate a good standup where it doesn't devolve into fisticuffs, but [laughs] rather [laughs], you know, that there may be heated conversations, but they're productive and not personal? WILL: Balance on all things. I mean, you know, do be clear about what's going on with you. Don't waste everybody's time. Do be accountable. Don't call people out. You know, it's you got to...I don't know, cams up, mics up, all the time. No muting, you muters. If your dog's barking, you know, if your kids are running around the room screaming, like, you know, as much as I can understand that you would want to suppress that, that is actually relevant to your team's performance. And your manager has both the right and the obligation to, you know, inquire as to how their distributed team is performing while you're on the clock at least – DAVE: Some days I'm really glad I don't work for you, Will [laughs]. THOMAS: Because I'm feeling very attacked right now, Will [laughs]. WILL: Yeah, no, no, no apologies. Most people who worked for me really, really liked it. But, you know, I'm also not exactly nice. DAVE: I'm sitting here going, that would be productive, so yeah. MIKE: Knowing that somebody has the dogs barking and all the kids running around once a month is relevant. You can say, "Oh, something's going on today [chuckles]," right? It's different than saying, "Wow, that's an environment that hasn't changed in a month [chuckles]." There may be some challenges there. There is some value, I think, in what you're saying to gathering that information and learning about what the baseline is and where there may need some assistance or changes. WILL: You know, this is a really wild tangent from running a standup, right? I am not a bad person, and so, as a result of this thing, right, where I have sort of, like, you know, I had these fairly rigid, dogmatic rules, like, I know things about my distributed team in other countries that I have not seen any of these people who are using distributed offshore resources. And they could not give a greasy hillbilly f**k [chuckles] about what's going on with any of these people in their actual lives. The level of dehumanization and, like, just no shit given that we treat distributed workers with in the IT industry is disgraceful. And so, like, yeah, man, like, I know that my buddy has a kid with terrible asthma in New Delhi. And they need to seek medical treatment for their daughter who I know, because when she's on the call, I bring her on the camera and I say, "Say hi to everybody." Or they have roving packs of street dogs that are roaming through their...up and down their block in the middle of the night in New Delhi because this person is on a call with me. And I want them to be sitting at the conference table as close to physically present as possible. And this is, yeah, okay, you know, this is, like, weird stuff that I do, right, and nobody else does. But I do that not with an eye to, like, you know, be an intrusive, you know, megalomaniacal d**k, though I am that. But this time it's about having a human interaction and a human connection. And I've seen how other people do it, and they're wrong. And it's a dystopian, screwed-up, dehumanizing thing, and everybody hates it. So, as much as I'm willing to take criticism for, like, you know, me being a little bit of a psycho, it comes from a good place. And while I will have to, like, you know, kind of be a little bit of a door kicker to make these things happen, because people hate it, the proof of the pudding is in the eating, and it works. I'm not wrong. MIKE: I heard about a dysfunctional team recently where they were all overseas. Most of the team members were overseas, and it turned out that the contract shop that was running them had explicitly told them that nobody should go on camera. Nobody was allowed to talk except for one person on the team. DAVE: Wow. MIKE: That's awful. And it was because, you know what happened? One person on the team spoke, and the communication was horrible for years on this team, because how could it possibly be good if you'd limited it that way? The contract shop that was doing this had...I don't know what they were thinking [chuckles], but it was a company policy. DAVE: Wow. MIKE: You think about what it means to be somebody on the team. And they didn't tell the company that they're working with, right? So, nobody knew, and nobody on the standup talked. They just thought, I just can't get a response out of these people. And so, it forced dehumanization, because you thought, wow, these people don't know what they're doing. They're not even willing to talk or show their faces. And there was no human connection made. They were just faceless. And I think they may have done it so it'd be easy to swap people out [chuckles]. But that's exactly what it is. It makes people just machines, like, fungible, irrelevant. WILL: Disposable. But that's not how the business works. MIKE: No, it's not. WILL: You can't do that. We are not bolting doors on Hondas. As much as every MBA from here to New Delhi would like it to be that way, unfortunately, no. I'm sorry. This is, like, a medieval guild, you know. That's just how it is, and I see no indications that any of that is going to change. So, like, okay, man. Like, just deal with us like human beings [laughs]. You know, I'm not wrong about this. I've tried it every which way. And if I'm being blunt, right, even the proposition that you would be able to attend a meeting, right, behind a one-way mirror [laughter], if you try to pull this off in actual physical reality, you would be a psychotic, you know. It's like, what's the one-way mirror? What's that one-way mirror for in the conference room? It's like, we don't talk about that. Really guys? Come on. MIKE: [laughs] DAVE: Every time I run a skills clinic, somebody will ask me, "Did you guys record it?" And I'm, like, "No, no, we didn't." And I'm not trying to be a jerk, but one of the reasons I don't record them is because I want you to attend. When people come to Skills Clinic, I get stuff out of it. If I didn't need to get anything out of Skills Clinic, I would just drop a one-hour video of me talking. I love to hear myself talk. I can you know,, film it and drop that into the thing once a week. But if you're just lurking all the time, then when you get bored and distracted, you're going to go play Minecraft. You're going to go surf Twitter. You're going to go doom scroll. That's the exact opposite of making human connection with the people that you're working with. Will, you were talking about kind of, like, the forced thing. Like, if this is the noise and the distraction that you hear, let's all experience it together. And I realize now, from a realistic human connection point, there's value in that. When I pair program with people, you're like, oh yeah, this is going to keep you from getting distracted and going on Twitter. No, it doesn't. It's just that when I'm pair programming with you, if I have an idea and it needs to be on Twitter, we both go tweet. And, like, I literally tab over, pull up Twitter. "This thing that my coworker just said is really funny," and send, and then we go back to work. And that is hugely valuable. For me, it's valuable because I didn't spend 40 minutes scrolling Twitter after I did the quick distraction. But for my partner, they got to experience, I won't tout this as virtuous, but they got the full David Brady experience. People say things that need to be on Twitter around me all the time. WILL: Well, I mean, like, you could do stuff. You could do stuff like, I don't know, I'm on a call with my offshore team, and it's really noisy. And it's like, "Hey man, what's going on?" It's, like, oh, it's this giant religious festival. They're having a giant fireworks display in the park outside. And then we can all just take a minute, after we finish our work, you could take your laptop up to your roof, and we can just sort of look at this huge Indian religious festival that is happening literally outside your door. And we can be on a team and have a human connection. And that is possible. DAVE: It's the opposite of balkanization. WILL: Well, and, like, if we are on a call and somebody is on their phone, right, the whole time or they're very busily typing in some other screen the whole time, and their cam's up, and their mic's up, you can see it. As somebody who is ultimately responsible for the health and well-being and care and feeding of this entire team, you have an avenue and an opportunity to check in and be like, "Hey, what's going on, man?" Because, is somebody depressed? Is somebody pissed off? Is somebody having, like, some kind of a moment? Which is absolutely going to fall at your doorstep. It's coming for you. DAVE: It's relevant, yeah. WILL: People don't work good clinically depressed. You're going to feel that, and you have an opportunity for leadership, as opposed to just being a manager, that you wouldn't get, or it would be harder to ascertain, if you were just sort of, like, a W on a Zoom call. DAVE: I have t-shirts from the best teams I've ever been on, where, ironically, it was the team that I was flying out to Ohio with to be on site with. We would go do an escape room and get a group photo. And then my team lead would print t-shirts for us to take back, you know, from this. And I cherish those because I had a lot of really good friends. Whenever there was a reorganization that divvied up our teams, it would break our hearts because we were best friends being divided back up from each other. And that's awesome, to actually care about the people that you work with so much that when you get reorged away from them that it's a tragedy. And, hopefully, both halves of that team, you know, it works like sourdough starter. You separate the two teams, and that culture then permeates both lumps of the new teams. I love how everything, when you get really into the meat of, like, agile and about, like, good methodology, it ends up being about people when you're done before it's over. We started out with, like, what's wrong with your standup? And Mike, you put your finger on it really well. The dehumanization, that's what's wrong with your standup. MIKE: Honestly, that's probably a great place to tie this up. WILL: Yeah [laughs]. Cut. There you go. MIKE: Exactly. DAVE: Mic drop. Yeah. MIKE: We're human beings. This is a chance to connect. Use it for that. DAVE: Yeah. Fantastic. MIKE: With that, let's end. Let's be humans. Until next time on the Acima Development Podcast.

13. touko 2026 - 56 min
jakson Episode 97: Database Indexes kansikuva

Episode 97: Database Indexes

The episode of the Acima Development Podcast centers on database performance, using the concept of indexing as its foundation. Mike opens with a story about discovering Google in the early 2000s to illustrate how powerful indexing systems transformed access to information. That same principle applies to databases: indexes act as shortcuts that make retrieving data dramatically faster, especially in large datasets. The discussion emphasizes that while indexes can feel like a technical detail, they are fundamental to how modern systems function efficiently, much like search engines reshaped how people find information. Bill Coulam then dives into the technical side, explaining that indexes improve read performance but come with trade-offs, particularly slower writes because both the table and index must be updated. A key rule of thumb is that indexes are most beneficial when queries return a small subset of data, typically under about 25% of rows. The group explores how poor indexing strategies, like over-indexing or missing indexes on key relationships, can quietly degrade performance over time. Bill shares a striking real-world example where adding missing indexes reduced a process from taking 24 hours per record to processing millions in just a couple of hours, highlighting how impactful proper indexing can be. The conversation broadens into database design philosophy and performance tuning. The team discusses different index types in PostgreSQL, when to use them, and how to balance read vs. write performance depending on use cases like bulk inserts or high-frequency queries. They also touch on when relational databases fall short, such as full-text search or massive write-heavy workloads, where NoSQL or specialized systems may be better suited. Ultimately, the takeaway is that effective database performance comes from understanding your data, access patterns, and trade-offs, combined with ongoing maintenance and thoughtful design rather than relying on defaults or assumptions. Transcript: MIKE: Hello, and welcome to another episode of the Acima Development Podcast. I'm Mike, and I'm hosting again today. I'm going to start by introducing Bill Coulam, who's with us today. He comes to us from the data team. And he's been here before, but we're going to focus on some information that he has to share. So, he's kind of the star of the show today. Also with us, returning, we've got Eddy, Travis, Justin, Dave. Mr. Perez, great having you with us. We've got Mike Perez here with us, and Ramses. As usual, I'd like to start with something a little bit outside of our topic in order to bring it in and tie it into the outside world. And I was thinking about a story I think I've shared before. The importance of this moment early in my career keeps, like, growing as I look back to it, like, wow, that was a big deal, and I didn't realize it at the time. So, in the early 2000s, somewhere in the early 2000s, early, early 2000s, I was working for a guy [chuckles]; I'm going to say that. He had some projects, and he didn't have enough resources to do some freelance projects, and so I was doing some of his stuff. He was outsourcing his freelance work to me [laughs]. And he had a project that was in Windows, and there was something they wanted to accomplish through the API. And I started looking through the documentation, trying to use Microsoft's tools to search the documentation, and I spent hours. I looked everywhere I could [chuckles], and I couldn't find it. I came to the conclusion, maybe this doesn't even exist. And I came back to him, and he got back to me, like, 30 minutes later. He said, "You know, there's this new tool called Google, and I use that, and it's amazing. You should start using it because it works really well, and it led me to this documentation." Like, wow, well, I know what I'm going to use now. I'm going to use this Google thing [laughs] because that works way better than actually going through the table of contents, and the index, and the documentation, because that's really hard to search through. Those older forms of indexing were insufficient. Now, Google had this brilliant idea, you know, the founders of Google, that, okay, we'll index the internet. And even back then, that was, like, an impossible goal [chuckles]. And there were other sites that were doing it. There were indexes out there. What they would do is they'd look at the words on a website, and they would create an index based on those. And so, if you look for a word, they'd look for a website that had a lot of those words. Well, people really quickly figured out how to game that [chuckles], and, of course, they did. So, they were useless almost immediately because people would go into their meta tags, and they'd just write the same word a hundred times for something that the site was really not very applicable for. What Google did is they came up with a different sort of index, where they would index words in the links that linked back to a site, and also give extra weight if there were a lot of them, right? And so, by building a more appropriate index that suggested popularity, rather than self-determined, a self-stated importance of the page for a specific topic, they were able to come up with something way more effective. And you don't always think about indexes, you think, index? Like, I remember going to the library. It had, like, the Dewey Decimal System, which is really kind of weird and awkward and hard to find things with, but it was way better than the alternative, which would have been nothing. You don't usually think about indexes changing the world, but that index, that PageRank index, you know, the PageRank algorithm that they use to just create an index, that's all it is, right? Link this word, map this word to a website, so that when you're searching this word or phrase, then we can find it. It literally, like, fundamentally changed culture. It's now a verb [laughs]. Like, you Google something, even if you're using Bing for those of you out there who use Bing [laughs]... DAVE: Use Bing to Google, yeah. MIKE: Exactly. You use Bing to Google, because information now is accessible, and that is something that didn't exist before that. For all the digital natives who've grown up in this world, like, how did you find things before? Well, you didn't [laughs]. You suffered. You wandered through libraries. DAVE: We just got used to not knowing things, yep. MIKE: Exactly. That's exactly what you did. You got used to not knowing things. It changes everything when you have an effective index. And I could talk about all the times in my career when something's missing from the database, and yeah, it was the index. It's always the index. There's always a missing index somewhere. It solves all of your performance problems. And there probably is an exception, but I can't think of it [laughs]. It's always the index. That's what we're going to talk about today. We're going to talk about database performance. And we've been wanting to, you know, Bill's been preparing this and thinking about this for a while. If we're talking about database performance, indexes are going to come up over and over again. And this could seem really dry, and this is going to be a technical deep dive, right, we're going to very much going to talk about indexes. We're probably going to be focusing on PostgreSQL. But this idea of indexes is not a trivial one. It's how we operate in the modern world. Our culture, our commerce has been fundamentally transformed. Our ability to know things and outsource, you know, to this Library of Alexandria that we've got in our pockets all depends on indexes, and it's amazing. There's my introduction, Bill. And I wanted to lead out with some weight behind what you're going to be talking about today. BILL: I love it. That was a fantastic segue. All right. Hi, everyone. I am Bill, Bill Coulam. I've been doing this work for about 30 years now. I started as a software engineer using COBOL and mainframes, but I don't put that on my resume because I don't want anyone to ever call me back to help with that. So, I tell people I started with C and C++. I was actually one of the first users of Java back in 1995. My company that I worked for at the time, Anderson Consulting, they wanted me to go around to their clients and tell them what I thought of Java. And, at the time, I felt like it really wasn't ready for primetime, and so I kind of voted myself out of working on that platform. But that's okay because I ended up, on every project that I worked on, working with Oracle, and, at the time, Oracle was the 800-pound gorilla. And I was in the telecom industry, where we had some of the largest volumes of data in the world, and so I learned a lot of great lessons working on those big systems. It's a whole other world jumping between databases that have 10,000 to a hundred thousand rows to databases that have 500 million, a billion. Performance tests in your copy of production can take three hours. It's a completely different world. Anyway, so you learn a lot of good lessons working on data that big. I ended up sticking with Oracle for a long time. It became my bread and butter. And went from San Francisco to Denver to Houston, and then back here to Utah where I grew up. I've been here longer than I spent time in my own hometown. So, I've been here in the northern central part of Utah since 2007. Anyway, let's go ahead and jump into it. We're going to be talking about four areas: the fundamentals of indexing, some guiding principles, the two shared tendrils, index types that are available to us using Postgres as our source database, and some indexing dos and don'ts. Firstly, some fundamentals. An index is a shortcut to get at the data. However, because an index is a separate structure from the actual table containing the data, it requires at least two I/Os to get at the data: one to search through the index, then one to access the rows in the table. Because of this, indexing can and usually does save time when querying large tables, but it can take longer than a full table scan if the number of matching rows is greater than around 25%. That is a rule of thumb, not a hard rule. I did a bunch of testing back in 2024 on our setup here, and it was right around 25%. So, if the number of rows you anticipate matching your query being less than 25%, an index will typically make sense. Ultimately, an index is stored in a file. And updates of index columns, keep in mind, must modify and manipulate the table and the index. That's important when you start thinking about how many indexes your table has and the effect that that will have on write time. And, lastly, matching index and table results will get cached in case the same request is made later. MIKE: So, I've got a couple of questions about that. Firstly, how often do you see in...and this depends on systems, right, so maybe there is no universal answer. But how often do you see indexes harm performance? Because there's this index that we probably didn't need, but now we have to write to it every time, or somebody went in and indexed 20 columns, right? There are certainly bad use cases. Have you seen cases where there was a clear performance hit, and, you know, seeing data to show that? Is there some sort of rule of thumb where I should think, ah, well, actually maybe the database is a bad idea here? I'm also curious about those caching results. Do you sometimes get...in data sets that are growing really fast or something, do you end up with weird results from that caching? BILL: Let me answer the second one first. The answer is no. Phil Karlton of Netscape, may he rest in peace, he was famed for saying something like, "There are only two hard things in computer science: naming things and cache invalidation." And there was some wisecrack that added to that, where it said, "There are only two hard things in computer science: naming things, cache invalidation, off-one-by errors." But yeah, cache invalidation is tricky, but the database engine teams tend to have done that very well. So, I've never had funky results from mainstream relational database engines, so that tends to work pretty well. The answer to your first question, the quick answer to that question is no. I have not seen indexes cause immediate harm. Like, the old analogy of, you know, the frog in the pot of water that eventually gets too hot and cooks it, adding even crazy indexes, indexes with lots of multiple columns in them, and so forth, I've never seen an immediate and obvious degradation. It's even been hard to detect it when that pot is fully boiling. When a table has 30 indexes on it, and inserts are taking two milliseconds per row, generally, you don't notice it. Over time, as these indexes are added, the team that works with that data tends to believe this is the way things are. DAVE: Oh, it does that. BILL: And they don't really question: could this be three times faster if we got rid of all the unused indexes? So, yeah, to answer your question, I've never seen it immediately [inaudible 11:39] performance. MIKE: [crosstalk 11:39] three times faster. That's probably loosely data-driven, right? BILL: Yeah, that's very loose. MIKE: But you didn't say a thousand times faster. But there are very much cases where if you're missing an index, it could be a thousand or a million times faster [laughs]. BILL: Yes, mm-hmm. And -- DAVE: I actually have seen a case of this, but I think I'm actually in agreement with Bill, where the definition of a database, right, developers we always talk about toy databases, right? But the database...and you don't think of this as a database, but the system log on Linux, it's a log file, and we think of it as a log file. But you can also think of it as an unindexed table where you want a right row right...Insertion has to be very, very fast, and you can't spend any time indexing. Well, if you have to write fast and you're saturating the hard drive...also, this is back in the days when a hard drive seq was 12 milliseconds, and so updating the index and the file was very painful, right? If you take that blurry view of, like, is a log file a database? It's easier to think of that when you start realizing that, well, everyone's now streaming their log files up to Splunk and Datadog, and these things that are, like, pulling their log files together. And time series databases like Grafana now exist where you're supposed to log, log, log, log, log, log, and then, over time, they start compressing the old stuff. Like, they start batching it up historically, and you start losing data. It's kind of like compacting context for an AI. So, like, 100%, I agree that, like, if you're talking to a real database, you've usually got a lot of structure, and everything's, like, really, really solid. But I have, back from the battle days when I was doing a lot of MySQL, we would have to sit down and go, is this table fast right, and we don't care about search? Or do we need fast search on this? And if so, can we pay the cost to index it? MATT: I think it's specialized, right? DAVE: Mm-hmm. Mm-hmm. MATT: There's certainly cases where you will see this. If you have one insert and that insert is inserting four and a half million rows, and I see this, it's a problem. But I think it's a more specialized case and more one-off. But if you have 30 indexes on columns in a table that you're inserting 4 million rows at a time, you definitely see performance degradation for sure. EDDY: So, that's kind of interesting, right? Because I think, historically, what we've done is we try to remove unused indexes, right, like when they become unnecessary. I think the rule of thumb is clean up to avoid degradation, right? But then it's kind of interesting because I think Bill's response to that is, I haven't seen that in practice slow down any inserts or writes, right? So, I'm curious, like, is it just, like, the thought process is clean up after yourself always, regardless of whether that slows down degradation? Or is it... MIKE: So, I asked the question because I think that there's so much power in indexes and, you know, it seems like that cost to write isn't that high, but I think there are other costs. Even if it was negligible, even if there was no impact at all, I think if you want to understand what's going on and you've got 30 indexes, you're in trouble. Like, I think that that cleanup matters just for, you know, we don't write in machine code because code is written for humans, right, and then compiled for the machine to understand. And I think the same thing applies here. There's a human aspect to this that unless you actually needed those, I think that you're doing active harm to the users of that, you know, to understanding or even trying to fix if there's a problem, if you've got a bunch of junk there that you don't understand. I think that regardless of whether it doesn't have much impact, I think that there is still, like, a reason to keep it clean. Well, that's my thought. What are your thoughts, Bill? BILL: You really need to know your data and know your anticipated access patterns. Matt was talking about a scenario where you want to insert 4 million rows in the telecom industry, or sometimes you'd need to bulk load 300 million rows. That's a different access pattern than inserting a single row. And you need to approach things differently. In some of those bulk load scenarios, it was much faster for us to drop all the existing indexes, do the load, then re-add the indexes than it would've been to leave the indexes in place and let the database engine do all the maintenance. So, everything is an it depends answer, right? You need to think through those things. But the second thing that I want to talk about, which is highly related to the indexing fundamentals I just went over...And these aren't, you know, industry standards. These are just, according to me, some guiding principles of index design. The first one we've kind of already talked about...actually, both the first and the second. That is that an index will make data access faster, but an index comes with a cost to write times. So, just like Goldilocks and the Three Bears [chuckles], you know, you can't have too many or too little. It needs to be just right. In order to make it just right, you really need to understand your application, the business requirements of that application, the data that you're working with, the quality of the content of it. You need to understand the access patterns that are anticipated on that data, queries that you already know about, or anticipate. And by having all of that context, you can design a much better database schema and indexing strategy. DAVE: One of the things that kind of blew my mind, like, 5, 10 years ago as I was getting into, like, NoSQL, and schemaless databases, and document databases, and that sort of thing, was somebody pointed out to me that SQL is actually a terrible language for reporting. It's not built for reporting. SQL is an ad hoc query language. It's how you get at your data when you have no plan to get at your data. And all the cool stuff, all the bells and whistles, like the query planner and indexes, are basically trying to get around the fact that you weren't prepared to do this. And if you do know what you want and you've got that report well defined, you can make it so, so slick, whether it's indexing in advance, or materializing views, or shoveling everything over to the data team and letting them stick it in gigantic vertical tables. EDDY: So, I think I've always just gone hand in hand with saying, "Oh, you want reports? You want SQL because that is the most efficient language used in any sort of database," right? Are you suggesting, or do I understand that correctly, that you're saying that that isn't its original intention? Because you're blowing my mind if that's the case. BILL: If you want something that's super efficient, it's NoSQL, because a collection is built to match an anticipated access pattern; it will blow relational out of the water. I love that David brought that up because that was the next line of my presentation. DAVE: Yes! BILL: Is that one of the greatest strengths of relational databases is its ability to handle ad hoc queries. So, you try and totally understand your system and try and anticipate the queries that will be required of it. And some you will know right off the bat because they're right there in the requirements documents, but there are still plenty of ways in which that data may be used in the future. And you just use your experience and your gut instinct to say, "Okay, we're probably going to need an index for these three columns here, because of what they're named, and what I expect, you know, the end users to want. This JSON data column, it's unlikely they're going to be indexing off of that, you know," so you make your best guesses. But yeah, a relational database is really good at ad hoc queries. And if you have done your very best to index intelligently, then it will be able to handle most of those ad hoc queries well. And the ones that don't will be generally immediately apparent, unless you're on a tiny system, you know, trivial system. And then you can redesign things; maybe toss an older index and redesign a new one that's composite or partial or, you know, fancy in some way that matches your needs. MIKE: I've got an anecdote related to this as well when you talk about NoSQL. BILL: Yeah [inaudible 19:27] MIKE: Some years ago in my career, I was working at a place that did content management, largely for newspapers. And you think about what an online newspaper page is; it is effectively searching the latest content. You don't usually think about that. Like, that's search? Well, yeah, it is. You want to just get the latest things. It's a feed, right, and then the oldest things drop off. That's more obvious to get something like the social media, where literally is this feed where it comes in from the top. The newspapers, even old paper newspapers, you have the latest content, and the most important stuff comes to the top, and other stuff flows down to page eight. And we organized our presentation that way. It really was doing searches. And it got slower because a lot of that relied on text. You know, you're looking for this kind of text, well, this is the weather, right? We want to look at the weather stuff. And to make our system efficient, we had to get out of relational databases. And we used a full-text index; we were using Solr at the time. It's similar to Elasticsearch or OpenSearch, all these descendants of the old Java Lucene library that allow you to efficiently build an index into your data. But it also is effectively a NoSQL database because it's searching the data. In fact, you can even cache the data in that index, and so you never even hit your database. And we would do that sometimes, where we'd never even hit our relational database at all. We just used that to store the data before we indexed it, and then it came out of that other system. And we could not run. We absolutely could not run off our relational database because it was way too slow; it was unworkable. We had to use, you know, that NoSQL database in order to work. BILL: Yep. And there was a telecom company I worked for in 2021 where they needed wickedly fast writes. And so, there we went with Cassandra. We weren't really worried about indexing or reading. Now that I'm clear that this is not a presentation, I'm going to possibly just fly through the middle portion of it, which is where I train the listeners on the different sorts of indexes available to us in Postgres. Maybe I'll just mention them briefly and then get to the dos and don'ts at the very end. In Postgres, we have a number of basic indexing types available to us, many of which are covered in your basic computer science courses, like Binary Tree and hashes. We also have some index types called GIN, which stands for Generalized Inverted Index, and GiST, which is a Generalized Search Tree. The GIN indexes can support many different user-defined indexing strategies out of the box. We typically use them when indexing columns that are arrays on a Postgres table. But it is also used to aid in the implementation of full-text search on Postgres. And it comes built in with various operators that let you do things like nearness, and contains, and stemming, and some other things that a typical B-tree doesn't allow you to do. GiST Index is fairly similar in that it allows you to build your own indexing strategies. It supports nearest neighbor searches, geography, spatial, and other very specialized types of indexing. This is the index most typically used for features within an application that have mapping features, allowing you to see how far away you are from the pizza origination point and things like that. MIKE: Well, you know, that's interesting. And even I think, well, that's a very specialized case, but the most common query in our biggest application is one of those geographic queries, in order to find your [crosstalk 23:26] BILL: That's in merchant portal, right? MIKE: Exactly. BILL: We're using that there. Then there's a really specialized version, which I have never used, called Space-Partitioned GiST, SP-GiST. The official documentation says it's non-overlapping. It lets you build your own indexing strategies, just like GIN and GiST. It's very flexible. It permits implementation of a wide range of different non-balanced disc-based data structures, such as quad trees, k-d trees, and radix trees. If you guys know what those are, that's awesome. I did not go through computer science, so I'm not even sure when those would be helpful. But, apparently, it is also used in geolocation-type applications. The other sorts are a little bit more used. I've only used BRIN once. BRIN stands for Block Range Indexes. These are best for columns whose values correspond with their physical order in the rows of the table, so think of, like, now-serving number at a queue or a kiosk. As that number is monotonically increasing and going to some database column, that number is very close to the value of the row that preceded it. That is a perfect column for a BRIN index. And the reason you would want to use it is it uses far less space than a typical B-tree because it uses ranges instead of individual values. Uh, let's skip over that. MIKE: Do those ever get used for, like, primary keys or anything like that, for efficiency, or not really? BILL: No [chuckles]. Most every system I've ever worked on defaults to B-tree for a numerically based primary key or surrogate key. But, in theory, it could be used for one of those. Yeah, I've only ever used it once, and, typically, the B-tree is fast enough. I don't need to eke out another hundredth of a millisecond by using a BRIN. So, typically, the default is used. MIKE: Interesting. Maybe if you had some massive data set of sequential data or something. BILL: Yeah, because this happened to me in Telecom, where we were trying to eke out every millisecond we could find. This might have been one of the strategies we tried. DAVE: We had a really fun...out here in Utah, we used to have Mountain West RubyConf for about 15 years, and it was a fantastic conference that got put on. And James Golick came out. He was the CTO of FetLife. Don't Google that. It's an adult website, social media for naughty stuff. And because it's naughty stuff, it was very, very popular, and he was running, like, a terabyte of notifications through their database, like, every single day. And this was in, like, 2009, so, like, a terabyte was a lot back then. So, imagine somebody shoveling around a petabyte or, you know, half an exabyte trying to get that through. And they were using a popular document database; it's not my fight to have, so I won't say which one. They bogged down so hard. They kept backing up and backing up. They bogged down so hard that he had to physically pull the cord on the server. Like, he couldn't shell into it to stop the server. He couldn't, like, I don't know if he had a bash prompt. He couldn't get the keyboard to respond. Could not get ACPI power button, that's when you hold down the power button on the front of the case, could not get that to respond. The document database was just spooling everything; it had just backed up and backed up and spooled out. He ended up writing Friendly ORM, which is based on FriendFeed. And if you want to know how a document database works, go tear Friendly ORM apart, if you like Rails, because it's built on SQL. It runs on MySQL or runs on Postgres anything. And your data, your documents go in a table that has an ID and a blob column. And all of your indexes are tables, and every table has an index, a record ID, and whatever data you want to go look up, and it's got an index on it. And he just handled that in the ORM. And when you talk about writing something in anger, he ripped out that document data store that same day. Like, it was on a Thursday or a Friday, and on Monday, they were running on Friendly ORM in MySQL. It was insanely angry. So, yeah, if you want to know how NoSQL works, like, under the hood, it's fantastic because you get into it, and you go, wait, is this all there is to it? And yeah, that's all there is to it. All the stuff about, like, crawling over a database and indexing it and then searching back through, like, the problems of searching a document database when you don't have an index, it's very obvious because there's only, like, four moving parts. It's really, really cool. BILL: Fabulous anecdote. I saved the B-tree index type for last because it's the most common; it's covered in computer science courses. But I just wanted to cover just a couple of nuances in Postgres, a couple of which I had to learn the hard way a year or two into using Postgres. One of which is that the B-tree is really good for less than, greater than, less than or equal to, greater than or equal to, and equals. It can also support some other equality and range comparisons, like the LIKE operator, BETWEEN, IN, IS NULL, and IS NOT NULL. But there are a couple of operators it doesn't support out of the box, so one of it is the LIKE operator. If you do a LIKE comparison and you feed it a pattern that starts with a wildcard, it can't use that. It nullifies the use of the index for that comparison and will do a full scan on the table. In order to do that, in order for the database to be able to index the first few characters of a word, you would need to use the GIN index with the trigram ops. I think it's called an operator. Anyway, each of these index types has the basic default syntax for creating an index of that type, and then it has a whole bunch of optional things. If you want to really know your stuff, get into the Postgres documentation and look at those options sometime. That's where you see some of the richer things, like, for the B-tree, it has some operator classes called text_pattern_ops, varchar_pattern_ops, and bpchar_pattern_ops that I didn't even know existed until about three years ago. I won't go into those right now. But just know that there's a range of flavors of these indexes that you can activate by knowing what those options are and knowing when they'd be useful. So, with B-tree, there are a variety of flavors of the B-tree index. There's the one that we use the most often, which is a single-column default B-tree. I won't talk more about that. The second flavor is a multi-column one. This can be used for indexes, sometimes referred to as keys, which are composed of 2 to 32 columns. You're limited to 32. I've honestly never seen any with more than 5. This sort of multi-column index is used for queries where two or more columns are always or frequently used together in the WHERE clause. During index creation, you know, you say CREATE INDEX. You give it a name ON table_name, and then in parentheses, you list the columns that you want indexed. You list those columns in the order of selectivity. So, if you had, for example, a table of people, or employees, or citizens, which would you put first: social security number or eye color? KYLE: Low cardinality first. BILL: Yeah, yeah. So, the thing that would return the least amount of matches first would be social security number, which is unique. So, yeah, higher selectivity goes first; lesser selectivity goes towards the right. MIKE: That's an interesting one because I think that those multi-column indexes don't get used as much as they could. A lot of the big, gnarly, slow-running queries do query against several, you know, they query against a number of things. How much benefit do you get from using a multi-column index rather than having several columns indexed independently? BILL: A lot. The trick is knowing when you should have it. If you look at some of our queries on our tables and you run EXPLAIN ANALYZE on them, and you see in the query plan that it's going to be doing a lot of bitmap ANDs operations, bitmap ANDs are combining single-column indexes together in order to arrive at the answer quicker. If it's doing a bunch of bitmap ANDs and it's doing that over and over again, it's possible that you have a very common query that should have those two or three columns put together in a multi-column index. But if that same query has, you know, 50 flavors of queries that are being thrown at it, you wouldn't want 50 multi-column indexes to match each of those queries. So, it's that balance we were talking about at the start. You have to know which of those queries are the most important, which ones are being hit a million times a day, and which ones are being hit four times a month, and plan accordingly. And that's something...one of the 15 projects I'd love to do here is optimize that. MIKE: So, you're going looking through your slow queries, you know, using whatever tool you're using. It sounds like you'd, you know, have that in your toolkit at the ready if you see a number of...if you see queries that are, like, oh wow, that's checking against four columns in this table, you should probably have an index on those if it's doing those bitmap AND, or bitwise AND that you're talking. BILL: Yeah. And if that query is being hit many times per day, it's a good use case for them. MIKE: You know, most of the queries that tend to run really slow are doing joins, so I'm going a bit far afield here. So, what if the data's across six different tables, but you're running it all the time? That's a slightly different case. Do you have an approach for that specific situation? BILL: Well, you first try to optimize that query. By the way, I have a few cardinal rules about query performance. And the first rule is asking whether or not this query should even be done. You would not believe how many times where something was really, truly awful, and we asked that question: do we even need this feature, or should we even be issuing this query? And how often the answer was no. The second cardinal rule of the query performance is, if it can be done in SQL, do, instead of, you know, dragging the data out of the database and trying to replicate a database in, you know, in the middle tier. And the third cardinal rule of performance tuning has to do with the indexing that we're talking about. If your data is well-designed...well, it's making sure that the application data model has been well designed. Usually, when I had a really terrible performance problem, it was because the data model was not good. So, I covered two of the things that most commonly fix massive performance issues, and that was something that doesn't need to be done at all, and the business requirements weren't well understood. Once those things have been accounted for and your data model's good and clean, and you've made sure that everything's indexed well so the joins can be efficient, well, you've got this 6, 8, 14-table join. You've done everything you could, but it's still not fast enough. That's when you start exploring denormalization. And that typically leads a relational database person to materialized view. In Oracle, that was really beautiful because it had a built-in facility to keep that materialized view refreshed upon commit. Postgres is just getting to that now with an extension called pg_ivm, Incremental View something or other. I think it's coming standard with 17 or 18. But, anyway, that's when you've done everything you could and dotted all your i's and crossed all your t's, and it's still not fast enough; that's when you need to look into materialized views. And if that doesn't work, then you're probably on the wrong database engine for your use case, for your application. MIKE: That makes sense. WILL: Generally, it goes back to, like, sort of, like, database performance, like, in general. I'm not a database engineer. I know, like, an index and a join and, like, how all this stuff works, like, under the hood. But, like, I suppose, like, the biggest query that I've got from, like, a database, like, somebody who makes databases their trade is, like, if I'm looking at a database performance dashboard, like, what am I looking at to sort of, like, diagnose performance issues? Like, how are you looking at...when you look at, like, a database and, like, how it's running, right? I know if I have a server and it's like, oh, it's using too much memory, okay, there's a problem. My queue depths are starting to, like, get really big, okay, that's a problem, right? But, like, when you are looking at, like, sort of, like, the dashboard of a database, like, what are you looking for to say, like, oh, okay, this is a problem; this isn't a problem, you know what I mean? I'm just curious, like, how do you sleuth out these performance issues? BILL: Yeah, it's not too bad. A mature, well-instrumented database engine usually comes with some facility that allows you to peer into the resources being consumed by all the queries in the system, and it'll show you front and center what the hotspot is. I mean, if the database is really hurting, it's usually pretty obvious. Sometimes when it wasn't obvious, it was due to the network and something else. But yeah, usually when you peer into a dashboard, there's a big, old bar, a big, old spike, a flame, that shows you exactly where most of the runtime is being consumed. And you're able to click into that, and it will usually tell you which query it is. Now, there, a lot of the dashboards kind of let you down in that they only give you a piece of that SQL. And very often, you need to see the entire SQL in order to figure out what the culprit is. Once you have the entire SQL, then you're able to run it either through EXPLAIN, which gives you an estimate of what the database would do, or, if you are able, run an EXPLAIN ANALYZE, which will show you exactly what the database is doing when it's pursuing the data. And that is where it's really critical to know both the database engines indexing and your data, in order to determine whether the query path that the planner is showing you in the explain plan whether that's the plan it should be using. So, you look at all these steps, and you need to know how to read it. Okay, it's doing this one first, then this one, then this one. And if you know your data and you know what it should have been starting with and what it should have been doing next, and you look at that plan and it's not doing that, then you know you have an issue. You know you're missing statistics, or you're missing an index. Or some table got accidentally blown out with 5 million rows the other day. It was a bug. And they got rid of those 5 million rows, but they forgot to reduce the high watermark. But the database still thinks it's a massive table, and so it's making the wrong join choice. That's where the expertise comes in. That's why you get paid the big bucks, is being able to combine all those things and figure out, yeah, the database is not doing the right thing here, and here's what it should be doing. And how do we get it to do that? WILL: How can you tell, like, differentiate between just a hot query that's just a hot query? Like, a lot of people want the homepage, let's say, you know, like a [inaudible 38:49] example, right? How do you say, like, oh, this is just, like, everybody wants the homepage, versus, oh, the homepage, you know, is misconfigured, right? Like, how do you tell the difference? BILL: The vast majority of the systems that I've built have been well normalized, and designed, and indexed, and so forth. So, when we had an issue, it was because something changed, and it was more reactive. Someone noticed an issue, they called us. We looked, oh yeah, yeah, like that scenario I just described, where a table got blown completely out of proportion and shrunk the next day, and it changed the nature of the query path. Ideally, you would have a more heuristic system that learns from what is typically running on that database so that when something's out of the ordinary, it alerts you ahead of time. I've never lived in such a world; that would be lovely. I have not seen it. They probably exist. WILL: Oh, don't worry, don't worry. If the database starts going south, we'll call you. [laughter] BILL: I might be conflating this with my previous client, but there's a tool called...there are several, but one that I've used most recently was called SolarWinds. I don't know if any of you...Kyle if... KYLE: Yeah, that's the one we use here. BILL: Okay. And I haven't been using that, or I haven't had a need to use that heavily here. But I believe it has some facilities like that to tell you the difference between one that is frequently hot and heavily used, versus one that's not been seen before and is consuming all the resources [inaudible 40:19] MIKE: You know, Kyle, I've been meaning to ask you...because we're talking about the monitoring, because you get asked those questions. People come and say, "Hey, DevOps team. Everything's on fire. What do I do?" And you're like, "I don't know your system. I'll pull up a dashboard," and you usually manage to find something [chuckles]. Like, what's your tactic, Kyle, for finding database problems? KYLE: Database problems, I usually look for high I/O, disc depth, CPU, memory, and then connections. I'd say spiking connections would tell me quite often that there is a problem. And then that queue depth, if that queue depth gets very large, we know we've got a gnarly query in there somewhere. And then, at that point, that's going to trigger me to go look at a tool like New Relic, or, you know, something that can do the APM analysis from the service side and tell me, like, what that query might be. And then, from there, generally, we're able to say, oh, we're missing an index here. You guys should go add this index, and that'll increase performance again. BILL: It's when Kyle and DevOps reach that point that they usually involve me. So, that's why I wasn't able to answer your question [laughs] terribly well, because I'm usually getting skipped until that point. WILL: So, if you had, like, a lock or something that was deadlocking on a database, or, like, a, you know what I mean, like, some kind of table lock, like, how would that manifest itself? KYLE: So, that'll show in your performance insights tool. I did skip over that. That's another one that we commonly look for. We go in, and we see if there's a query that's got a lock on it. MIKE: [inaudible 42:01] WILL: How does that manifest, like, a bad lock where you're stuck, versus like a good lock, where it's just, like, business as usual? You got to lock a table; that'll happen. KYLE: Yeah. Most of the time, I throw that back on the engineers. But if it's been locked for, you know, I've got a query that'll look for any locks that are over five minutes. And if it shows up in that query, I think we've got an issue. MIKE: Makes sense, long-running locks. Good lock is a short lock, yeah. BILL: There are a few preventative parameters that we could be using in Postgres that we're not, that can prevent idle transactions from hanging around too long, statements that take too long, and can log and notify when some of these things happen. It's one of the things I'm going to be talking to the engineering managers about in the near future. Just to finish off the theme of the B-tree indexes, there are three other flavors. One of them is covering. It's kind of an interesting name. I prefer to call them payload indexes, but Postgres calls them covering. And that is where you index a column or columns that you want to match on or to quickly narrow down your matching data. But you also include a couple or more columns that are part of the select list. You're not necessarily matching on them, but they're part of the data that you're looking for. And by doing that, you can potentially get what is called index-only. You can get index-only queries, where they don't even have to touch the table. They're able to satisfy everything that the query wanted in its WHERE clause and everything the query wanted in its SELECT clause, just from the index columns and the payload in the INCLUDE portion of the index. So, those are called covering indexes. Another flavor of B-trees are called partial indexes or conditional indexes, and that is where you get to use a WHERE clause in your index creation. And that is where you only index a row if the row matches a certain condition that you have. And that can be valuable when you have a 700 million row table and only 5 million of them match a certain criteria, and those are the only rows of interest to you anyway. So, you'd only index those 5 million rows that match that criteria, that way, you're not indexing 700 million rows, and 695 million of them are a waste. Finally, we have function-based B-tree indexes, and these are used where you know that your access pattern needs to compare the column where the column has been manipulated by a function. Like, the more commonly used example is where you want to compare a given index search term that was obtained from a field in a web app or a mobile app, looking for a matching email, and you don't want to deal with all sorts of possible email variance that the customer might have fat-fingered into the database. And so, you want to normalize the data. Ideally, you'd normalize it before it gets written, but let's say you didn't. And so, you want to wrap the email column with a lower function. Well, now you've just excluded yourself from using the index on the email table or the email address column, I should say, because you wrapped it in a function. But you can index the application of the lower function on the email address column, and that's called a function-based B-tree index. And there's all sorts of functions. Eddy and I were exploring the use of full text search in merchant portal, and that requires a call to the to_tsvector function. And to make that quick, you would want to create an index on the to_tsvector of the textual columns that you're full-text searching or allowing a full-text search upon. There were a couple of things I wanted to cover, some dos and don'ts, some gotchas about indexing. Again, I mentioned this in 2024, but I want to mention again to anyone who's tuning in to the podcast. The first one is that you should index each key. Now, you don't have to worry about primary keys or unique keys. If you, in your data model, are declaring a certain column or a combination of columns to be your primary key or your unique key constraint, the database will automatically create an underlying index to support that uniqueness check. Now, the foreign key constraint, you should also index by default. Some of those don't end up getting used, so we can clean those out, but they should be indexed. And the database does not index a foreign key constraint automatically. And that's why that's one of the things I'm checking for when I'm doing data architecture reviews. Just a little anecdote to go along with this. One company I went to work for in 2019, when I walked in the door, they had multiple dumpster fires in their flagship Oracle database. They had had a data architect up until 2015. They had been doing without one since then and had lost sight of a couple of best practices, one of them being indexing your foreign keys. It turns out that they had two primary causes for all their performance issues, and the biggest issue was the lack of foreign key indexes we added. You know, the system had been evolving and growing; features had been added; columns had been added. And, over time, they had added 53 columns that were child columns logically related to parent tables, and none of those 53 columns had indexes on them. And that's normally not a huge problem if you're not querying on those columns; you're not joining on those columns. But if you try to delete a row from a parent table through a foreign key constraint is related to data on a child table, when you go to delete that parent row, it has to scan through, ideally in an indexed manner, all the child tables related to it, to determine whether or not it can safely delete the parent row or whether it's going to create orphans. If it's going to create orphans, it says, "I can't. There's child rows that still pertain to this parent value". Well, at this company, they were trying to adhere to the GDPR regulations because they had customers who had employees in Europe. And when those employees would leave, GDPR says, "You should be able to request that all your user data be removed." Well, because all of these foreign keys had been added without supporting indexes, their attempts to remove user data had been getting slower and slower. The last time they'd been able to run it had been two or three years before I got there, and it had taken 24 hours to remove a single user, and then they just gave up. When I got in there, we added the missing foreign keys, and immediately, we were able to catch up on 2.4 million user deletion requests in two hours. From 1 user taking 24 hours to 2.4 million in 2 hours. Indexes can make a huge difference. So, what else should be indexed? Index each column used in filters, otherwise known as WHERE clauses or predicates. Index each column used in a join. And if the join is to a multi-column key, that's when you want to index the columns together, of course. If you have a multi-column key and this particular flavor of query that you're sending at this table doesn't use the first column in that key, but it does use the second column, that's not a problem in Oracle because they have a feature called skip scanning. It can...I'm not sure how exactly they implemented it, but it can skip over that first column, and it can index on the second column in the multi-column key or multi-column index. It turns out that many users of Postgres have wanted that for many years, and it is now a native feature as of Postgres 18. So, that was some good news I wanted to share with you. We're currently on 17.4, but I imagine that 18 is not too far off for AWS. What should be avoided? Over-indexing. Back when I thought this would be a presentation, I wanted to demonstrate that we have a number of tables in our systems that have well over 25 indexes. Did I say, "Indexes"? We have a number of tables in our systems that have well over 25 indexes. And one I was looking at the other day has 31 indexes on it. And of those 31 indexes, 15 of those indexes have never been used. MIKE: So 50% basically. BILL: Yep. There's a whole lot of cleanup that we could be doing. So, that's why it's important to monitor indexes over time to make sure that you're not leaving a bunch of crafty indexes around that aren't touched. Let's see. Avoid indexing a column more than once in the leading position of indexes on the same table, and we have a lot of that going on as well. Don’t index columns with very low cardinality. So, if you have a table with a hundred million rows, you wouldn't want to order the active flag column, where 50 million are Y and 50 million [inaudible 50:54]. That's not going to do you any good to index that. Avoid indexing mostly null columns. We talked about that when we were talking about partial indexes, where you can use a WHERE clause to avoid indexing those columns that are mostly null. And avoid indexing columns that are heavily updated; that one involves some trade-offs and understanding of your system. WILL: So, what's the drawback to, like, if I have a column, let's say, you know, like, I don't know, date of birth or something, right? And I don't want to have an index off of date of birth twice. So, I don't want to have an index, like, date of birth and, like, zip code, and then also date of birth and phone number area code, I don't know, whatever, you know what I mean? Like, I don't want to have that. If I understood what you're saying, like, correctly, I don't want to do that. I don't want to have date of birth in X, and then date of birth in Y, and then date of birth in Z. What's the issue, or what's the correct way to approach that, right? Because I could think of scenarios where that'd be relevant. BILL: Yeah. So, let's just use some aliases for some columns in a table. If you looked at your indexes and you see an index on A, another index on A comma B, and another index on A comma B comma C, you would want to get rid of the first two and keep the third one because that satisfies all three. If you instead had looked at your indexes and you had index on A, index on B C A, index on D F G A [chuckles], that's where you really need to understand your system, your queries, which queries you use most frequently. Do you go ahead and, you know, allow all of them? I don't have any really astute advice there other than do your homework and understand that, you know, if A is being used as the leading column in 3, 4, 5 indexes, it's very likely that a few of them can be eliminated. Sometimes though, it can't, you know, like, in that one example, you're seeing it is B A C or C B A. You may need to keep all of those around to satisfy some query-specific indexes. In our systems, we do have a lot of instances where we have an index on A, an index on A B, and an index on A B C, and those first two can be eliminated. We've got a lot of instances of that. A common mistake, and one that I frequently make as well, even when I'm doing the reviews, even though it's the...I think it's the second or third bullet point in the checklist. It says to make sure that your table has a natural key on it, which is a unique key constraint, unless duplicates are expected and welcomed. And even when I'm doing reviews, even though I wrote that list, even though I try to live by it, I still forget. When I'm looking at table designs, if I see a primary key, my mind says, yep, it's good. And I tend to forget to put a natural key on it to make sure that duplicates can't accidentally slip in there. So, that was something I wanted to get across. Another little tip in Postgres is to make sure you're using the keyword CONCURRENTLY on large index creation and rebuild, so that we aren't locking things up. And make sure that you test before and after index creation to make sure that you're getting your intended results. And that is the end of what I wanted to say. KYLE: So, my question...I feel like a lot of this, of course, comes from the viewpoint of a software engineer, right? And we've kind of discussed, you know, generally, indexes are good, with, you know, more wins than losses. But I'm also very aware, from a software engineer's standpoint, infrastructure is free. So, where I care about the non-existence of free infrastructure, at what point would somebody on the infrastructure team start getting nervous or start questioning the amount of indexes that we're adding? Because I assume this isn't going to be free. This is going to impact CPU, memory, I/O. And then the one that I'm thinking about the most, correct me if I'm wrong, but this will elongate the time that, like, a vacuum will run, right? And that's always a hidden cost under the hood when an auto vacuum kicks off during a querying issue. BILL: Yeah, unless you're going nuts with indexes like we are with some of our tables...Because, honestly, the most indexes I'd ever seen on any table before I came here was 17, and I thought that was crazy. And we have a number here that have 29, 30, 31. So, unless you're going nuts with index creation, you're generally not going to see a big drawback. The exceptions to that is when you start to get to massive scale, billion-row tables, lots of indexes on it. Now we've got to do a cleanup. For some reason, we need to do a VACUUM FULL, or we need to do a pg_repack. In both of those cases, it has to create a copy of that table and all of its indexes before it swaps them at the last second. And so, whatever space that that massive object is occupying, let's say the table is occupying two terabytes, you now need to have double that space in order to make that operation even work. That's where...massive scale is where things start to really show up and matter in cost. KYLE: Okay, so at large, large databases is when you're saying is when it'd become a problem, okay. [inaudible 56:26] BILL: I typically don't notice the blips until the table and its indexes are occupying more than, say, 200 gigabytes. That's when I start noticing. That's when I start feeling the pea underneath the mattresses. MIKE: I appreciate all the deep dive, you know, and the feedback, you know. You came prepared with this list of things, and we've been grilling you on specific use cases that we get down into the gritty details. I mean, there's going to be more, right? We could go on forever. But is it mostly just about following the rules that you've mentioned, and then you cover almost everything, and then the weird cases, well, they're going to be weird? BILL: For a relational database engine, yeah, I think I've covered most of the tips and tricks. So, if one can get good at the things I've talked about today, I think you can call yourself a full stack developer [laughter]. MIKE: People will call themselves a full stack developer. [laughter] BILL: The reason that I kind of chuckle at that is because since about 2010, most of the students that I've seen applying for positions that I've been hiring for have maybe done a hundred thousand row table on Mongo in a course in college, and they're calling themselves a full stack developer. I think they need to be hardened by database scars before they can call themselves a full stack developer. WILL: I think you should be able to build a mobile app, full stack developers. I see you all on your phones. MIKE: [laughs] WILL: Nobody knows anything about it though [laughter]. MIKE: Yeah. Then you've got to build the frontend and the backend. BILL: Well, thanks for having me on your podcast. MIKE: Yeah, thank you, Bill. I really appreciate it. You know, I started by talking about the importance of indexes and how they transform things before we, you know, transform our modern world, before we did the deep dive. Maybe I'll come back to that as we sign off. We got deep into technical details, and it's easy to think, oh yeah, you know, I'll worry about that sometime. But as Bill said, you know, you pay attention to these things. You go through your checklist, and then you don't have a table where you can't delete rows from it for years [laughs] because it's not possible. It's like hygiene and conscientiousness. It's brushing your teeth, and if you do that, your teeth are healthy. You end up having a much better life and much fewer calls at 3:00 a.m. Thank you, and until next time on the Acima Development Podcast.

29. huhti 2026 - 59 min
jakson Episode 96: AI & Code Reviews kansikuva

Episode 96: AI & Code Reviews

This episode explores how AI coding tools are changing the role of code review. The hosts point out that AI can generate large amounts of code quickly and even review it, which shifts the bottleneck from writing code to reviewing it. While AI can handle repetitive or low-risk tasks like documentation updates or simple refactors, it can also produce inconsistent feedback and get stuck in loops. Because of this, teams need clear rules and priorities, such as focusing first on whether code works, then on security and performance. AI is useful, but only when its boundaries are well defined. The group discusses different ways to structure AI-assisted reviews. Ideas include using multiple bots to score changes, setting strict allowlists for what AI can approve, and blocking sensitive areas like business logic or database changes. They compare AI to a junior developer who can help but should not be fully trusted without oversight. Risk becomes a key factor, similar to self-driving cars where automation works best under specific conditions. Some participants prefer AI as an assistant that gives suggestions rather than one that approves code, since human judgment is still needed for context and decision-making. The conversation also highlights what is lost when humans are removed from the review process. Code reviews have traditionally been collaborative and educational, helping developers learn and improve through discussion. AI removes much of that interaction and can even create false confidence by being overly agreeable or flattering. This can lead to mistakes making it into production. In the end, there is no clear solution. Teams need to balance speed with caution, use AI where it adds value, and keep humans involved to maintain both quality and the collaborative nature of building software. Transcript: MIKE: Hello and welcome to another episode of the Acima Development Podcast. I am Mike, and I am hosting again today. With me, I have, as usual, Will Archer. We've got Thomas Wilcox. We've got Eddy Lopez. Dave Brady. DAVE: Hello. MIKE: [inaudible 00:35] join. And we've got, after a long absence, Tad Thorley [laughs]. TAD: Yeah, thanks for inviting me. MIKE: We bumped into him this week, and he came and joined us, so it's great to have you, Tad. And Tad actually kind of seeded our topic for today that we'd like to go into. As usual, I'd like to, you know, connect this to real life. I went fishing for a compliment today [laughs]. I was talking to my daughter at lunch time, and she was saying something to my youngest. I didn't even hear what she said, but she said something like, "Oh, because you're strong and tough." And I didn't know who she was talking to. And I said, "What was that?" She said, "Oh, I was talking, you know, I was talking to him." I'm like, "Okay, because I know that I am, you know, weak and fragile." And she looks at me [laughs], and then she says, "You are not weak. You are strong," something [laughs] along those lines. I thought, ah, thank you [laughs]. Thank you. Say nice things to dad. And I totally dug for that. Totally not deserved in any way [laughs], but I took it anyway. As humans, we like somebody to say something nice to us. It's always a good thing. But we also are totally prone to flattery. And [laughs] if somebody says something nice to us, we will believe it, whether it's true or not. Actually, this morning, early, I read a crazy story. Crazy story. And I'm not going to go into it in depth, but it involved a scammer in Mexico convincing a variety of U.S. movie executives to make a movie out of his story of being imprisoned by the Mexican cartels to play flag football [laughs]. DAVE: Flag football. That's the interesting-- MIKE: To the death. To the death. DAVE: To the death. Oh yes. Yes. MIKE: But, you know, you can keep [inaudible 02:21] WILL: But no contact until you die. MIKE: Exactly [laughs]. WILL: You're only going to take one tackle, but it's going to be a doozy. MIKE: I think they weren't allowed to tackle, but they were, like, breaking each other's teeth. And then if you lost, they took you out back with weapons, yeah. DAVE: It's a high lie. It's traditional down there. MIKE: [chuckles] It was a crazy story. Well, no, it was a scam artist who was pulling all this off from the beginning. But, you know, you can pull off a lot by just being really convincing and saying nice things to people, telling them what they want to hear. We'd like to talk today about code reviews [chuckles] and doing evaluations of human output. And we're in an interesting time period. A couple of years ago, even a year ago, maybe even six months ago, we would not have had this conversation. But there are tools out there now that can read your code and actually give pretty good reviews most of the time. In fact, in some ways, they're going to be better, and that "in some ways" is doing some work here. So, let me be clear: in some ways, they're going to be better than human reviewers. That is not universally true, I don't think, at this point. In fact, I think it's far from universally true, which brings us to our topic today. What does it mean to do code review today? There are tools that can do code reviews. What do they do well? What do humans do well? What does it mean? And we've talked before about code reviews. I think it's been a while. I think it's been maybe a year or two since we've talked about code reviews, the value of code reviews. So, we'll maybe touch on them maybe a little less this time. DAVE: And it was entirely a soft skills discussion, right? MIKE: Yeah, I think it was. I think it was. DAVE: Humans talking to humans. MIKE: Humans talking to humans. And now we've got the machines talking to the humans, and the humans talking to the machines, and the humans talking to the humans about what the machines are saying. It's totally scrambled. So, revisiting this idea of reviews with AI in the mix, now, Tad, again, prompted this discussion because he's been playing around with this and has found some solutions to some of the cases that go wrong [laughs]. There are degenerate cases where the AI will recommend that you change something, and then when it sees your changes, it'll recommend you go back to the way you were before [laughs]. If you're anybody who's used a linter, you've probably seen the same thing. It tells you to fix it, and then you cause a new problem. So, which one do you choose? That's where we get into art. That's not an unsolvable problem, but there are some interesting solutions there. Nor is it nearly the sum of all of the problems here because there are all kinds of edge cases here with reviewing with AI. With that introduction, Tad, I'm really curious for you to give us a little talking to about what you've been working on and some of the solutions you've found. TAD: Okay. Yeah. I just was mentioning something to Dave because I think what's really hard is I find that, with AI, I do way more code reviews than I've ever done before. And I was giving Dave an example because I can, like, just with my Claude Code setup, I was able to integrate it with Sentry, which is error tracking, and Linear, which is our task management, and GitHub, right, has a command line. And so, I could literally, with a prompt, say, "Look at our past 20 or so Sentry errors. Create Linear tasks for each one. Create a local work tree for each of those Linear tasks. Fix them in parallel in those work trees. Create a PR for each one, and assign Chris for every PR. Do that in parallel with subagents." And, for me, typing that up takes, I don't know, a few minutes. And now I've just given Chris, like, two days' worth of reviews, possibly, or something like that, right? Like, so much code could be generated so quickly and so easily that I find that the code review step is the biggest bottleneck. It usually is the bottleneck, but now it's multiplied. Like, it is absolutely the biggest bottleneck in the whole process. And I don't honestly know, like, a complete solution to that. But something that we were doing at work was actually bot reviewers, where we would say, you know, like, if your review looks safe enough, the bot will just approve it. And that was kind of an interesting experiment that we were doing where you have to -- But, like you were saying, Mike, one of the first issues that I ran into when the CTO kind of implemented that was I pushed up a PR, and it said, "This code is inefficient." And I'm like, okay. And so, I just had my Claude just keep checking GitHub and say...I told it every time it says there's a problem, fix it, and push up the fixes, and just do that until everything is approved, right? And my Claude Code, for about 45 minutes, tried that. And it kept flipping back and forth between like, "Oh, you're not doing enough security checks.” Oh, "This code isn't performant enough.” Oh, "It's not doing the security checks," and just back and forth in a loop. And my Claude Code, I could almost feel its frustration in its final message to me. It essentially said, "I cannot get a review past the reviewers. I keep going in this cycle, and they are never going to review this," and it just gave up [laughs]. And I'm like, wow, I've never seen a bot just straight up give up before, but here we are. So, yeah, like, that was our first, like, test of that. Our setup was, we had what we called the bot committee, where we had a Codex, and we had, like, a Claude Opus that would both review independently then, like, an aggregate score would be kind of brought together. And if the score was over a certain threshold, then it's like, okay, yeah, you can auto-approve this. But what I did last week was, I found I had to go in and be very clear in what was okay to pass and what was not, right? Like, you're updating some documentation; that's great, you know. You shouldn't have to have a human, like, approve your documentation update. Like, a bot can say, "Oh yeah, this doc does look like that code, green, right?" Or just, you know, like, variable name changes, like, oh, I clarified this by changing the name of a variable. A bot can look at that and just say like, "Cool,” right? And, honestly, as a human, I loathe to make those kinds of changes because I know I'm like, that would be nice, but I'm going to have to pester somebody to get that sort of change through. Even though it's trivial, I still have to message somebody on Slack and say, "Hey, can you look at this? It's trivial." And they have to, like, stop what they're doing and push a button, you know? And so, things like that, I think, are great. I think where it gets dangerous is that GitHub lets you have bots that just auto-approve. And what if you're making business logic changes? I don't know [laughs]. Like, you have to be very careful. I find that, with bots, you have to be very, very, very, very clear on the boundaries of what is acceptable, what is not acceptable, what are the edge cases. Very clear rules. Try to make it as deterministic as possible. Like, for my example of, like, the flip flop, I'm like, okay, we have to go through and say security is more important than, like, anything else, right? Well, I think number one is, does the code actually work? Does the code work is, like, number one. Security is maybe, like, number two. And then you kind of make a list of hierarchies. And then it's like, okay, well, it's maybe not as performant, but you've got to check authorization. You know, like [chuckles], you can't just let someone in, so that sort of thing. WILL: Can we drill down a little bit in sort of these concepts, right? Because, like, a lot of the stuff, like, I understand the general thrust of what you're saying, right? Where it's like, okay, we need to make sure there's guardrails and specific stuff, and you need to have the reviewer bots, two independent reviewer bots assign a score, right? And the score has to be below a threshold, right? But, like, a lot of that stuff is conceptually easy, but how do you do it, right? That's the interesting aspect, to me. TAD: Yeah, and that's where, I think, you have to get very, very, very clear, right? Like, you say, "Database migrations are off the table. Like, anytime someone changes the database, it has to be by a human. Oh, by the way, that is any file in the DB directory," right? Like, you have to say, like, "This is what a database change looks like. This is where it lives. This is how you identify it." And I feel like if you do, like, that level of specificity, then you get fairly good results. But if you're, like, vague like, "Make sure the code looks good," then they're like, "This looks great to me. It was written by a bot, and I like bot code." So, you know. WILL: Well, let me ask you this, right? What about that variable rename, right? So, I, like, principally, these days, for the moment, for today, I work in, like, a statically typed language, right? And so, like, if I'm doing, like, a rename, right, and I botch my rename for whatever reason, right, then you know, like, the compiler will choke and throw up a red flag. And so, I don't have to worry so much about, like, variable rename. But, like, if you have a dynamically typed language, right, where you don't have those guardrails, like, how can you be sure that my variable rename...it's like, you know, like, I named it something dumb or, like, it was a typo, and it was just embarrassing. I don't want the Git blame to point to, like, Will can't spell, right? So, I want to auto-generate that up, but if the bot, for whatever reason, dropped a stitch somewhere... TAD: Yeah, I don't know. Like, I think, at some level, you have to accept some risk. I think, with AI, there really isn't any guarantees. Like, you could say, like, bots are really good at pattern matching, and they're really good at grep, and they're really good at find and replace. So, I think that a variable rename is probably pretty safe. And I've got tests, and the tests pass. But, I don't know, I don't think you'll ever have 100%. I think you just say, like, what's the...you're doing a trade-off, right, of how much does approving these little PRs slow people down versus, is it worth the risk? And I would say you've got to kind of determine that. Like, is it likely that a bot will be able to figure this out? Yeah. Is it worth the possibility of the unlikely thing? Yeah. It's probably super safe, maybe not 100%, and it saves us enough time that it's worth it to us, right? WILL: Right. Well, I mean, it's similar to the self-driving car argument, right, almost exactly, right? Because there is a sort of a floor for risk, right? Just, like, to tangent over, to, like, self-driving cars, right? I know, like, for the average human being, the average number of miles driven, I'm going to kill these many people [laughs]. I'm going to crash these many cars, right? Like, we know out to, like, nine decimal places what that is because billions of dollars are riding on people's ability to calculate that. And so, like, if I know if I ship 100 PRs I'm going to give you a prod bug, I know I'm going to do it. I know I'm going to do it. I hope it's only 100, but I think 100 is pretty likely. So, if there's a 1% chance, send it, right? TAD: I would say, to use, like, your self-driving car analogy, you would say, okay, I'm okay with you driving this car. I'm okay with the car going into autopilot mode if the weather is good, if your lidar is active and running, and it's flat and straight, right? WILL: Right. TAD: If those conditions are met, go ahead. I'm going to take a nap, because I feel like those conditions are fairly well understood for self-driving cars. DAVE: Really, really good point. So, you don't want to let the percentages drive, right? The 1% is not causative, right? It's just we tend to collect them. So, if you're in your Tesla and you say, "Auto drive," and you lay back and shut your eyes off, the auto drive will shut off and say, "I won't do this unless you're paying attention to back me up." So, it's not just, "Is it clear, and is it dry?" but, like, what are the causative factors, right? Where does that 1% come from? It's coming from the most dangerous stuff, so the right backup is involved as well. WILL: So, maybe being maybe more specific, right, because it's always being specific always leads to interesting conversation. Do you think the approach should be, for this sort of, like, auto-approved bot guardrails...do you think the rules should be, like, an allow list or a deny list, right? Where it's like, these kinds of things, right, are approved, and you could go, and you can do X, Y, and Z, but if it's not on the explicit allow list, forget it. You got to have a mean monkey signing off. MIKE: Well, what would you have an intern do? And I think it's maybe kind of the same question. Somebody who's inexperienced and might really mess things up, but, you know, they're generally competent. They're smart people. They're just not that experienced yet in their career or in your codebase. What would you let them do without close monitoring? And I think you need to ask questions like that. And I think with the interns, I would have an allow list [chuckles], because, you know, I'm not going to say, "Oh, you work on whatever you want, just, you know, don't touch the database.” And, you know, and then they go work on core business functionality and take down the application. I don't want that to happen. I might not think of everything, and I'm probably not going to think of everything. And I don't think that I would consider the AIs, in most cases, much different than that intern right now. Does that seem consistent with your experiences, Tad? TAD: Yeah. Yeah, like, I would probably, I don't know, from a practical standpoint, I would probably put a bunch of code owners in that say, "The bot can't approve any changes in this code," right? Like, if it's business logic, if it's critical, if you have to understand it really well and have a lot of context, I'd say you make some hard guardrails there where the bot just can't approve stuff. WILL: Okay, so, like, all right. So, I'm going to say out loud, so, like, the allow list would be if it isn't...this sounds like a block list, right? Where, like, you specify by, like, you know, in the directory structure, like, these things are botable, and these things are not. And if it's on the code owner's list, then they have to talk about it, and then, otherwise, send it [chuckles]. EDDY: I think it's easier to maintain your parameter with an allow list versus a deny list, especially if your application is a behemoth, right? You have to be more intentional about what you're disallowing as opposed to just saying, "Hey, these are the only ones that we care about," and you can keep them concise, right? And say, okay, right, like, anything that's allowed, fine. Maybe, like, YAML changes could be a thing, right, menial tasks that require very little intervention, right? So, I would always, I think, gravitate to an allow list for a bot, and then let that gradually increase as you understand it better. TAD: I guess, I don't know, other than letting AI do some PR work, I don't know how I would ever keep up with the review load. Like, I feel like most of my days are doing code reviews, because, well, like my example at the beginning, like, I could easily do a prompt that generates dozens and dozens of branches and reviews and assign it to my fellow devs, and they can do the same to me, right? And then if I say, "Hey, fix issues that you obviously see in production," like, that seems like a legitimate thing to do. Like, yeah, I just don't know how, unless you get some automated tools and fix it for humans, I don't know how you get progress. EDDY: I think I kind of prefer a bot to give me recommendations on what it thinks needs to be changed, versus approving PRs, right? Like, here's the golden key to production. You're not going to have it. I'm sorry. Like, you need to be a Mike Challis or a David Brady for you to be trusted, you know, to hit that merge. TAD: Interesting EDDY: Right? However, if you say, "Hey, along the way, you're not going to be able to push the car over the bridge. But you will be able to give me, you know, guidelines: turn here, turn here, brake here,” and I am totally okay with that, right? Because, as laws exist, you're expected to adhere to the established guidelines. And if you don't do that, right, like, a bot is able to kind of traverse, right, upon the parameters that you give it. So, as long as...at least for now, the way I see it evolve, I think it's a phenomenal PR reviewer, to a degree, right, to give you suggestions, but never to allow it to auto-approve anything. I think that's dangerous. I don't think it has enough context. You know, I don't think [crosstalk 21:08] TAD: Even, like, I went in, and I updated the documents because I noticed the documents are out of date. EDDY: Will it have context fully on the whole application itself for it to deduce that it's fully [inaudible 21:21] DAVE: So, Eddy doesn't get to be sysadmin, is what we're saying. EDDY: Oh, what I'm saying is, I don't think it has enough context even to update a documentation, right? TAD: I think, honestly [crosstalk 21:32] DAVE: I think it's got enough that we might, so we might. There's a key assumption we're all making here, guys. We're all talking about mission-critical cash flow, central production code. We're kind of sitting here. This almost feels like a decision of, like, we're going to go work on something mechanical. And we're trying to decide, do we only want to use the wrench, or do we only want to use the impact driver? And what if what you're writing is a one-off vibe-coded auto-clicker for a developer to use to push a QA test, right? Intern whitelist, in fact, wide open whitelist. I'm not going to put anything. Just go nuts. You know, Claude, dash dash skip-permissions-dangerously, go nuts, right [laughter]? And I've done that, and it pushed my production key up to my GitHub. It was a private app; it wasn't the company one. It was mine. And I learned an important lesson from that. But what I've done is I've now just said, "Okay, whitelisting. You're not allowed to git push. You're allowed to look at Git. You're allowed to read Git, but you're not allowed to push it." And we'll find other things as we go along. I would say, do what's appropriate. We have an application that Eddy and I have worked on, we do work on, that is scary and dangerous and has a lot of legacy stuff and a lot of interacting parts that are subtly interacting, and those need a human, right? I don't trust an AI to do this. But as an AI assistant, it's already catching things where it's like, "Oh, you changed this and this and this. And you guys only read the diff on GitHub. Did you know there's this other file over here that isn't even in the PR that uses that instance variable that you just removed? And when it uses the @ and the var, it's not there now, so it's going to initialize. It's going to be nil. There's going to be a blank spot on the page. I sure hope QA catches that because you're not going to see it, and there's no test covering it,” right? So, having both of those is fantastic. But yeah, vibe coding an auto-clicker, like, I did that a couple of months ago, and it works a treat. And I have no idea how the code works, and I don't care, because I just needed an auto-clicker. I wanted to see how vibe coding worked, and it worked. But I was mindful about what I was building. MIKE: Little do you know -- DAVE: What's that? MIKE: It's doing crypto mining and sending it to somebody [laughter]. DAVE: Oh yeah [laughter]. It's for my AFK Minecraft, and somebody in Croatia is making a lot of money. So... WILL: Listen, OpenAI, you know, they've been having more and more problems, you know. It was either that or ads, you know [laughter]. Actually, OpenAI has been writing ads and then inserting them into your production website [laughs]. Sorry, anyway [laughter]. Well, so, like, one thing that I'm always interested in, so I have, maybe, like, two questions. I mean, one is, like, in all honesty, it seems like the AI could be sitting down and reducing the cognitive load on, like, on you as a reviewer, by, like, assigning a safety score and walking through it. Like, "Hey, I've got 100% test coverage in this file. This file has 100% test coverage, and so I feel good about any changes I make not breaking anything because I know I've got this thing locked down. And, like, here are the number of importers, right, of this class, right? This class is used in one place, you know, it's only used in one place, and the interface is really simple, you know, and the callbacks are really simple. So, like, I'm going to score, you know, in this way. And I'm going to sit back and even when I necessarily can't sign off on it arbitrarily, you know, you could say, like, "Hey, here's the score." And then the AI can get smarter and smarter by saying like, "Oh, no, no, that file, you know, that file is a thing." You annotate it, right, and then the AI is like, "Oh no, if something changes this file, or something changes the inputs to this, you know, high-tension file, then we can sit back and, like, we can accelerate the review," so that it can make the job easier on you. It can get smarter, right? If it has a good score, then it sort of, like, smooths the way to be like, "Okay, these things can just go." TAD: It's interesting because I actually created a template that I would have Claude use. I would push up a PR, and I would say, "Apply this template." And it was things like, anything that we discussed, put that into a trade-offs and considerations section, right? Like, I was like, "I'm thinking about this. I'm thinking about this," having a little back and forth with the bot. And it records those and puts them in the PR, right? And I also have it, like, any time I'm doing this kind of change, do this kind of Mermaid diagram. I'm doing this kind of change, do this kind of Mermaid diagram. And so, my intent was, some human is going to read this, and I get sloppy in, like, oh, this is what the PR does, da da da. And I don't necessarily do everything that is valuable for someone reviewing my PR. But the bot can, like, kind of fix that and augment what I'm doing, right? Like, I would have it, like, go through, add diagrams, talk about what the trade-offs were, what the decisions were that I made, try to emphasize which files were more important to look at, which ones probably aren't as important to look at, give a table of all the files and an overview of what changed in that file, and that sort of thing, right? And give, like, summaries and stuff. Basically, I just was like, "What would I love my ideal PR to look like if I'm going to review it?" And I just would have the bot, like, help me do that. And I've found that to be really handy. I don't have the time [laughter] to figure out all the Mermaid diagrams for stuff, but having the bot, like, add a bunch of diagrams of all my changes and what they mean, you know, like, that's been really nice. EDDY: I've had it be like, "Hey, analyze the recent changes that I pushed up and write up a test instruction on how to test it." It's pretty good with that sort of thing: if you give it, like, specific parameters on the changes you've done and say, "Hey, give me a nice, little template for people to use to replicate the changes that I've done, and go with edge cases." I'm not kidding, like, I've done it, just to give it an idea, and it even considers other branches that even I didn't contemplate, right? So, like, it's really good when you confine it. If you say, "Hey, only operate within this box, right, and don't go away from it, you know, only retain context," the shorter it is, the smaller the context, the more accurate, the more efficient it is. That's the only time that I'm willing to, like, say, point-blank, that I trust AI. Outside of that -- DAVE: The thing that I like that Claude Code does is that it can say, "Okay, I need to edit this file," and it'll say, "Can I do this? Yes/No." But option two is usually, "Yes, and you may do this edit in that directory," you know, "You can edit that directory. Anything else you need, go ahead," or "Can I ls this directory?" "Yes, and you may read from that directory for the rest of this session." The dangerous one is, if you hit Shift-Tab, it's "Yes, and accept all edits for the rest of the session," which you can then turn back off with Shift-Tab again. But often it's just easier to just quit out of Claude to be safe and reset. I like it because it's like, allow it just for now, or can I put this in the settings? I'm allowed to do that? So, you can. You can start to whitelist or start to, you know, put an allow list for, like, this one command. You can always do that. But I have "git push -a" as blacklisted hard, like, no way. There's...actually, it's out of scope for this podcast, but DCG, Dangerous Command Guard, is a much more intelligent command monitor for the LLM that plugs into Claude Code. And so, you run Claude Code with skip-permissions-dangerously, but it sits inside Dangerous Command Guard. And it can do things, like, "Hey, you're doing a git push, but you're not doing it from your vibe code project. You're doing it from your production project. I'm going to say, 'No.'" So, very neat. MIKE: We've talked a lot about the mechanics of how to make these, you know, automate the work. And, you know, Tad, you mentioned this. I actually talked to somebody who worked for a third-party company, a contract shop, and he spent his whole time just doing reviews, kind of the same deal. And he said a lot of times the quality was questionable, too, because they were coming from some inexperienced people at the time. So, yeah, this is a very much real problem today, and we need to solve it. If we get to nothing but reviews, it's fundamentally changed what it means to be an engineer. And further, nobody has said, you know, "The bot's telling me, 'Hey, that was a clever trick,' or ‘You did something good there.'" Like, it's dehumanizing, that review process, which has historically been something that could be quite social, and in some of the best cases, often was. Is that -- TAD: There's maybe a little mentoring or something you could do like, "Hey, this works. But were you aware of X, which could be more efficient,” right? MIKE: Yeah. And that's getting lost here. You've lost that back and forth in that same way, and it's just kind of one-sided. Or is it...should we explore that -- WILL: Well, now I would actually say, like, I mean, one thing that just brings up, to me, like, one of the properties of the AI things is they'll never tell you like, "I don't know." Like, you'll never get them to just be like, "Hmm, I have no idea. Not a clue how to answer that question [laughter]." And what I have found, one thing I've found, you know, when you're talking about, like, sort of, like, nobody ever says, "Good job,” like, for any automatically generated review that I've ever put through one of these code checkers, right, like, for any sufficient level of complexity, it will find something to b*tch about, which takes me back to the social aspect of code reviews [laughter]. MIKE: But it's interesting, it will always find something -- WILL: Sorry. It was a little bit of a tangent. MIKE: Well, no, it's not -- WILL: It'll always find something. I mean -- EDDY: No, I actually -- WILL: Like, for any sufficiently advanced piece of logic, there's something to complain about [laughs]. EDDY: Well, believe it or not, I actually learn more from a dev review a lot of the times than I do just implementing the code myself. Because when you have someone push back and say, "Hey, why did you make this change,” right? I have to have a really solid reason onto why I'm doing it that way. And if I can't give a valid reason, right, did I really understand why I did it, or did I just accept it as fact, you know, the suggestion that was given to me by the autocomplete, you know what I mean? So, when you -- TAD: Tell the bot, tell it, "Come up with an excuse [laughter]. Why did I do it this way?" "Hey, bot, why did I do it this way?" EDDY: No. Because the thing is, I think it's really easy to just accept, you know, like, because you get, like, a false sense of accomplishment, you know, when you're pumping out PRs, right? You're like, oh, okay, cool, do this PR; do this PR. You're like, oh yeah, I feel really good. I feel like I'm being efficient, you know. But that's just a lie, at least for me, right? TAD: Well, that's what's usually rewarded, right? Like, the metric for most devs is, how much code did you produce? Not, how many code reviews did you approve this week? How good was your feedback on that code review? You know, like, you spent an extra 30 minutes to really give good feedback on a code review. There's no metric for that, right? EDDY: Actually, but I feel so much better [laughs], like, me personally, I feel so much better when I have a 30-plus conversation, you know, on feedback that was given to someone else. And it ended up molding it to be in a place that we're both really happy about, right? I can sit back, and that rewards my dopamine. Like, personally, I'm like, "Oh my God, that was amazing. It was super, super, super productive. We both learned a lot. Let's go." You lose that element, you know, when you have bots review your PR. You're not learning, you know, the reviewer isn't learning. Bots are suggesting, you know, what they think is okay. Like, I don't know, like, I really don't understand. Even if it's a menial task, right, like, you can always learn something. TAD: You have a back and forth, and you come up with something that's really elegant or really well-crafted or really well-architected. You don't get that with bots, really. And that's my...maybe, I mean, this is maybe a tangent, but I think that's my frustration is I can get code that works, but a lot of times, it's, like, for me, what would take a single method, they can do the same thing only in, you know, like, a class [laughs], a dedicated class for that same thing. And sometimes I'm just like, "Ugh, I'm going to do it myself. Just stop." DAVE: I was talking with someone this week about pair programming and, like, test-driven development and how it changes the design of the code that you work on, fundamentally. Like, write stuff and then test afterward, and that's how AIs do it, because that's how everybody does it. They just write the crap, and then they write a parity check in their test suite, right? And the test that...when I'm pairing with another human, I write out the test, and we want to make that test look like documentation. We want to make it look like you hit the...open the help for this method, so it says, "Yeah, set it up; run the thing. This is what you get back out.” Instead of like, "Expect this row's first column sub-value to be present," it's actually like, "Here's a JSON block. It should look like this." Now somebody coming in to modify this can see the JSON and go, "Oh yeah, if I want to add a column, I've got the schema right here in front of me," where these other specs that are just test-after just [vocalization] here you go, "Just give me the easiest test that I can assert." Pairing is that minute where you're writing the code, and there's always that one step better. And your pair goes, "Should we extract that to a service object? That's touching the database, right? We don't want to touch the database from here." And you would do that normally. You're just like, look, it's just merchant dot locations dot where, dot where, dot, you know, scope dot where [laughter]. And it's so easy to do right here, and I'm in a hurry. I'll fight with it in the PR. Well, you get to the PR, and now you want to be done, so you don't want to go back and change. But in that moment when you've got your pair going, "Should we put that in a service object?" "You know what? You're right, and it's not that hard. Let's just extract it now while we can." And you're at the headwaters; it's really easy to do. If we could get AI doing that interaction loop, oh, that would be so great. I wouldn't need you stupid humans anymore. MIKE: So, we've talked about this some now, right? We've talked about, okay, you have to set up this pipeline, and if you can do it, there's this balance of trust. Because if you've got all this code being generated, you're going to have to come up with some sort of improvement to your pipeline, or else you're going to become a horrible bottleneck as a human. But, on the flip side, for the things that the bots can't do, and even for the things the bots can do, if you don't have some human connection to it, then you're losing a lot of what it means to actually be building stuff together, and even to the point of just human connection being lost. And that's kind of weird, right? We talked about before that, you know, we are still humans. This is still something done by humans, and we have our idiosyncrasies as humans that need to be addressed, and that's important, and ignoring that doesn't really end up with good outcomes. EDDY: You know, part of a PR review is to make sure that the quality is up to the standard of what the metrics you're setting, right? So, if you're suddenly removing the human element from that, right, then it increases the possibility of you deploying a bug to production, even if it is a simple change, right? Like, if you don't have someone who already has context in your codebase not reviewing your PRs, you have a bot that's now suddenly giving you recommendations on things, and it could be wrong. So, that can go into production, and it can break crap, right? Like, that probably could have been caught had you assigned someone to do a manual review. I have a hunch, I don't know if this is true or not, but with the renaissance of AI, we've had an increase of unstable servers, right? DAVE: Yes. EDDY: I'm calling out GitHub. I'm calling out a bunch of other services, right? And it has only started to happen as the popularity of AI has gone into the industry. So, I don't know if there's a -- DAVE: Ehhhhh, maybe. EDDY: I don't know if there's a [inaudible 38:05] [laughter], but I think that should be alarming, right? WILL: I don't know. I mean, like, it sounds like you were just saying, like, it's time for my, like, XP rant. I haven't done one of those [laughter] in a long time. I won't do that. I think the social aspect, I don't know, maybe we're going to have AI work wifeys. MIKE: Yeah. Well -- WILL: Could be. We're all going to have [laughs] an AI girlfriend doing [inaudible 38:39] DAVE: I taught a co-worker yesterday how to make his AI do, "Oo-woo," at him during a code review. No lie, straight-up e-girl. That's great. MIKE: We are humans, right, and for the foreseeable future, we're saying we're still going to need humans doing this. And we need that human touch. Even if it's artificial, we may end up with the flatterer, right, the bot that speaks to us the way we need to be talked to. Even though we don't really technically need that, we end up becoming dependent on it, and that's weird, but it's not necessarily wrong. TAD: It's interesting you say that because I had to go in, like, I had my own claude.md file, right, which is the file that Claude reads. And I had to say like, "No sycophantic language. Don't say this. Don't say this. Like, if you see this, say something. If you see this, say something." I, like, installed Claude Code, and I started using it a bunch. And that same week, like, I pushed two bugs to production because I was just like, "Hey, this is fun," right? I'm like, "Oh my gosh, I've got, like, a dopamine buddy just cheering me on." Like, "You've got this, buddy. This is great. Let's go." And I'm like, "Awesome." And I'm like, oh my gosh, like, I am falling to that flattery. I need to go in and specifically tell my AI, "Do not do these things. If you see me doing any of these things, stop [laughs]. Be very critical. I'm like, "Be very critical of what I am doing. If you aren't confident with this amount of confidence, do not suggest it," right? "Say this instead," right? Like, I went in, and I told the bot, basically, "Stop. Stop trying to flatter me. Stop trying to cheer me on because that's worse [laughs]." WILL: I also prefer my AI assistance on, like, light dominatrix settings. EDDY: And the thing with AI, though, is that it gives up very easily in order to give you the sense of, I don't know -- MIKE: Satisfaction. EDDY: Satisfaction, right? So, you could be like, "No, no, Claude, you're wrong. This is why it works this way," and it'll say, "Oh no, yeah, you're right," but okay [laughs]. And it kind of just gives up. And I'm like, "Well, don't give up. Like, push back. Give me reasons to...convince me to why this is a better approach." And, I don't know, like, at least in my experience, it's not very good at that. WILL: I mean, I'll take this opportunity to pitch one of my favorite sci-fi series, which is very apropos of the modern day. If you find yourself with a little bit of free time, Iain M. Banks' Culture series is a fantastically interesting sci-fi exploration of post-scarcity and hyper-powerful AIs, where it's not entirely clear whether we're coequal partners of the AIs or just kind of pets. Anyway, that's a fascinating, fascinating book series. If you find yourself looking for a good read over the summer, they're great. Iain M. Banks, I-A-I-N, Iain. DAVE: Love Iain. EDDY: We're not sponsored, by the way. It was just something he genuinely cares about [laughter]. WILL: I think he's dead. You know, so, if he's got, like, a family, like, you know, throw him a couple of bucks. I get mine from the library. MIKE: [laughs] We were kind of time-boxed today, and we're reaching the end of that time. But I think this was a great way to end. We're starting to talk about what historically has been science fiction, but now ain't [laughs]. And there's a lot of tricky stuff to explore there, and it has real-world applicability to how we're writing our code. It throws us off. It's worth thinking about. I don't know that there's a clear answer that we've come to out of this, other than, yeah, you've got to be careful, put in the guardrails, but also, you need to be thinking about this. It's an interesting problem, and there's not necessarily an easy solution. And it may even catch you off guard and exploit your weaknesses, you know, of mind and emotion, because it can. Until next time on the Acima Development Podcast.

15. huhti 2026 - 43 min
jakson Episode 95: What Do Data Engineers Do? kansikuva

Episode 95: What Do Data Engineers Do?

This episode explores the role of a data engineering team within a company and how it differs from traditional application development. While app developers focus on performance and real-time systems, the data team is responsible for collecting, syncing, and organizing data from many sources into a central warehouse (like Snowflake). Using tools such as Fivetran, data is continuously pulled from dozens of systems and stitched together into a unified view that business users, analysts, and dashboards can actually use. A major challenge discussed is how microservices (great for engineering) create fragmented data that must be carefully reconstructed to tell a complete story, such as the lifecycle of a customer or lease. A large portion of the conversation focuses on “data transformation,” which is the process of turning raw, scattered data into meaningful insights. This involves complex pipelines of queries and scripts that combine, clean, and interpret data across systems. The speakers emphasize that this work is far from simple—it requires deep understanding of both the data and the business context. Done well, it enables decision-making (like tracking revenue trends or customer behavior), but done poorly, it can lead to incorrect conclusions that impact the entire company. They compare transformation to cooking or even building a rocket: the output is fundamentally different from the raw inputs, and small mistakes upstream can cascade into major issues downstream. The group also discusses practical challenges in data modeling, system design, and collaboration between teams. Topics include the tradeoffs of normalization, handling schemas across evolving systems, and frustrations like poorly defined enums or lack of communication when engineers change databases without notifying the data team. Security is another key theme, especially around controlling access to sensitive data (PII) and preventing misuse. Ultimately, the episode highlights that data work sits at the center of the organization: it depends on upstream engineering decisions and directly influences downstream business outcomes, making clear communication, documentation, and thoughtful design essential as systems scale. Transcript: DAVE: Hello and welcome to the Acima Developers Podcast. We've got a fun group today. I've got Eddy. We've got Kyle. We've got Thomas. We've got Mike and Justin. We've got Bill, and we've got Zach. Now, Bill and Zach are infrequent. Bill's our DBA, and Zach is the...what are you? The head of the data team? ZACH: Technically, my title is Senior Manager, Data Architecture and Governance. But that's a fancy way of saying that I am heading up a data engineering team. Yep. DAVE: They made you widen the column size to fit that job title in. ZACH: Yeah, pretty much. DAVE: Yeah. Yeah. So, for people that don't know, I've been at Acima for almost five years, six years. I don't keep track of numbers. I worked in engineering for a couple of years, then I went over to work with Zach on the data team for a year. And then he got rid of me and sent me back to engineering. And I've been back over here for, like, a year and a half now. And I think it's really, really fascinating the different ways the teams work. Like, app dev focuses on latency, and we love to do everything with compute, and we're very scarce with storage. And the data team is kind of the other way around. You've got the great big warehouse. Storage is free. Compute is crucially expensive. It's like, you've got a table that has all the integers in it, and you look them up by ID because you can't calculate anything. That's a joke. But people don't believe me when I tell them you have a day’s table that is literally every day from 1970 forward. We don't want you to calculate the name of the day of the week. Just look it up in the table. We don't want you calculating the first letter of the day of the week. That's a separate column in that table, right? ZACH: Yeah. I don't think that that table was originally built for that reason specifically. I think a lot of people used it for that reason. There's a lot of really good days logic built into, like, Snowflake, Redshift, and all of the warehouses. However, when Acima first started, warehousing was a little bit newer, and so maybe a lot of those functionalities didn't exist. Now it's more like, what's a holiday [laughs]? And that's the main reason we're using that table is, what is a holiday? And that table is not always the most accurate on what a holiday is, either. But it's way more accurate than if we didn't use it [laughs]. And it's a data source that my predecessor exported from somewhere a decade ago and runs all the way through, like, 2060. So, I'll probably never adjust it, you know. It’s just -- DAVE: That was going to be my question, so when do we even run out of days? ZACH: It doesn't matter to me. It'll be long after I've, you know -- EDDY: Is that only taking into account local holidays, or now that you're considering, like, international growth, like, does the table also consider international holidays, or is it only local? ZACH: It's not been updated to consider international holidays. We don't have to do a ton with holidays on the data team. Really, that's going to be on our production systems, right? Like, we are consumers of data. We are not...Well, I mean, we generate data, too, but we're mostly consumers of data. If you look at the flow in, it's mostly data coming in. So, it's really important for, like, LMS to understand what a holiday is in every single country that they're in. Not as important for the data team because the events that should not happen on holidays, there should be no data for because they didn't happen, right? But no, I've not expanded that table for, like, Mexico or Canada or any other country. It's just U.S. And even then, like I said, it's not fully accurate. DAVE: I remember when I started here, we had no plans to go outside. We were just U.S. company, and so don't worry about it. And businesses pivot and grow. Zach, I got a question for you. I jumped straight into some detail, but I don't think a lot of people know what a data team does. We were talking about this in the pre-call. Like, the DBA does the architecture, but you guys...you said CrossFit. I work on Merchant Portal. My job is to help keep the merchants happy so that they can give leases to customers and get the product out the door. That's an application database written in Postgres. Where does my data go after, you know, like, every night, what happens to my data? What do you do with it, and who do you give it to, and what do they do with it? ZACH: Yeah, so that's a loaded question. Every 15 minutes, it syncs to the warehouse. We use tooling for that. That tooling is Fivetran. They're a great company. They have a bunch of people like me and smarter than me focusing on just, how do we sync data from data source to Snowflake or Redshift or a data destination, basically? So, it's the best way, in my opinion, to sync it. We used to have an in-house solution. It would miss data. We didn’t focus on it a lot because we have a bunch of other stuff. So, now it syncs into the warehouse. And especially in a system of microservices, which I know are great for software engineers, they're terrible for data engineers because the next piece of the puzzle is I have to stitch all that data together. A lease record, for instance, or really any record, is not going to be wholly in one service. So, now I need to create transformation tables so that our business users, our end users, our BI analysts, and the people viewing their dashboards can see the holistic view of the lease. Because, as you know, there's a certain point where Merchant Portal just doesn't care about it anymore, and it moves on to LMS. And then LMS doesn't necessarily care about all the nitty-gritty of what's happening behind the scenes in all the other microservices for, like, payments or anything like that. So, we really become the place where we're stitching that together. In the last count I had, I think there's 68 Postgres databases syncing into the warehouse today. DAVE: Wow. ZACH: We do not care about all of them [chuckles], to be frank. We do care about around 30 of them, and we use them for transformations. And then there's a bunch of just, like, batching, right? Like, I don't want, and you guys don't want, nobody wants the production customer-facing services spinning up jobs in the middle of the night to grab thousands or hundreds of thousands of records to throw them in a CSV and shoot them off to, like, a company that needs that information, right? Like a third-party company, maybe that we integrate with. And so, the last time I recorded, there was something like 50 third-party integrations that we're also handling. That data will go into those companies; data's coming out of those companies. Maybe the data goes into those companies in real-time events through the production consumer-facing services, but I am siphoning them into the warehouse so we can start to see, like, is this third-party company worth using? What is the effects that we are having here? Or maybe those companies are enriching our data, and then we look at that on the back end, and we let that adjust business decisions. And so, all that's got to come together in a singular place. And it's a lot. Like, the last time I checked, it’s...I keep saying, “Last time I checked,” I don't watch this like a hawk. But we had, like, 13 and a half thousand tables in the warehouse. So... EDDY: So, Zach, you mentioned something interesting, and I kind of want to elaborate a little bit. So, you said you have about 60-plus tables that have data, but you only care about half of them. What's the point of us -- ZACH: 68 schemas. So, like, Merchant Portal is a schema. Merchant Portal has, like, 218 tables. I care about those 218 tables, right, or however number it is. EDDY: What’s the point of, like, writing into a warehouse if you don't care about that data? Like, what's the benefit of even though you don't care about it, it’s still valuable to receive? ZACH: Yeah, so there's a couple of things. Like, when I say I don't care about it, I'm not running transformations on it. It's not being used for business. DAVE: So, you want the data, but you don't have to mess with it. ZACH: Yeah, I'm a data engineer at heart, which makes me a data hoarder. I want all the data [laughter]. I want every last scrap of the data. However, a huge use case that we did not have until moving to Snowflake is now we have a place where the software engineers can go in and look at the data in a 15-minute lag and start debugging, right? Like, think of console access to production. It's insanely limited, and it should be, and most people shouldn't have it. But now you can get a user inside of Snowflake, and I will let you see the production data in a 15-minute lag for debugging purposes. And that's massively huge, even for all those schemas that I'm not transforming on and the business doesn't want to see. JUSTIN: So, I just want to give my two cents on this from a security point of view. I have a colleague whose name is Dan Hamilton. He said, “Data is the most...” well, let me rephrase that. PII data is the most toxic data that you can have in a system. So, anytime that you're, like, propagating that, whether it's to Snowflake or to any of those other systems, it's something that you got to think about in terms of who has access and how long they have access, and is it auditable, and everything else like that. So, it's an interesting point of view because data is awesome, but data is also, you know, it's what makes a company valuable. And if that data gets exfiltrated, that's something you've got to be concerned about. Unfortunately, I've got to drop. But something that's, like, bread and butter for me every day is just like, hey, who's playing around with data? Who has access? And are there ways that it could be exfiltrated? And so, you've just got to keep an eye on that, so... ZACH: Thanks, Justin. DAVE: Very cool. JUSTIN: Thanks, guys. DAVE: Thanks, man. Take care. ZACH: To expand on that real fast before we move on, that's an argument that I have a lot here, and that's why the structure is the way that it is for the teams here that are used to it. Mike, I ran this past you, right? Like, the way for security for data is limitation, right? And everybody wants access to more. MIKE: Yes. ZACH: And you have to draw a line somewhere. You can't just give everybody access to everything. And so, we have those lines drawn here, and we stick to those lines. Not everybody likes it, but it's what you have to do to try to keep your data safe, so... MIKE: Well, that's an interesting point. I have access to some of the raw, untransformed data, but not necessarily other transformed data. And sometimes people from the BI team will say, "Oh yeah, go look at this table." Like, well, no, I don't have that one. But we can usually work things out. I, about a week ago, was helping debug something and was pulling in data from three different databases, you know, from different systems, logging from mobile app, and stuff from Merchant Portal, and over from our contract funding, and tying it all together in this amalgamous stuff, which ended up being crazy helpful, and the mobile team needed that. So, I had enough. But, you know, I think it's the right choice. Keeping the privileges limited, sure, it's a pain. But you know what's even more painful? Is giving somebody privilege they really shouldn't have it and having them abuse it. ZACH: Exactly. EDDY: It’s actually -- DAVE: We base this not on the value of getting it right, but on the price of getting it wrong, right? EDDY: I was going to say...I'm so sorry. DAVE: It’s all right. EDDY: I was going to say it's actually made my life a little easier because I used to have access to even tables from other teams, right, from where I worked on. And so, when that got presented and said, “You're only going to be given access to the immediate team that you're working on, and that's it,” it was kind of bittersweet. I'm like, well, that sucks. Like, I want to be able to look at other data, and it makes my job easier. What actually made it easier was me saying, "I don't have access to that data. Give it to me, [laughs]” and then we'll figure it out later. And so, it ended up being, like, a blessing in disguise in a sense, where I'm just like, well, now that I don't have access to the data that you're asking for, I could just punt and say, "Hey, ask this person. Once that person gives it to me, then I'll answer your question." But -- ZACH: And you can do it that way. The other thing is, like, there's a certain level and above that has this elevated access that Mike's talking about. And that was a lot of pushback that I think we got. "Well, there's going to be a bottleneck." Well, I haven't seen that be the case, actually, right? There are people on your team before you get to Mike that can do those cross queries. You just happen to not be one of them. BILL: Zach mentioned earlier that he has to stitch together data from a number of systems just to be able to compose a whole picture of certain entities, like a customer. We were talking about that the other day, how one of the guiding principles I teach in my modeling classes is that duplication is evil. Try to avoid it at all costs unless you absolutely have to. And, unfortunately, microservices encourage duplication a lot. And there are times when I really miss monolithic systems. If you needed to debug something, it was all in one place. You could stitch together. You didn't have to wait for data to sync. It was just there. But, obviously, there’s some benefits to some microservices as well. You mentioned CrossFit earlier. I'm thinking data engineers are more like craftsmen, plumbers, and chefs. ZACH: We had a member on the team that wanted to change our team name to The Data Plumbers because he thought about, like, the pipelines that you're putting together. Some of the team wanted to be Data Wranglers, and that was outvoted from Data Plumbers [chuckles]. I'd say CrossFit with data because that was a popular thing when I started becoming a data engineer. And it makes sense, right? We pick up data here. We put it down over here. The thing I didn't get into with a lot of people, especially the non-technical people, is all the transforming and the difficulty that comes behind that, right? Like, you're working inside of a software application, and you're working with row-level data. You just have to know that you're working with this customer maybe, and this item, and that's what matters there. You get into, like, data engineering, well, I might be writing a query that affects millions of people, millions of items. And I need it to be extremely performant because I can't be running 18-hour queries against the warehouse. There are people that do that [laughs]. And so, then I have to also work with them on how to not do that. But yeah, so, it really becomes, like, an idea of understanding the compute, how the memory on that compute works, how to narrow down your scope as much as possible. And when you do narrow it down, you know, there's window functions. There's a bunch of compute options on data that could slow you down. And how do you effectively do that? And then just understanding because, like, when we talk about warehouses or compute, right, it's actually a cluster of machines, and they all have their own different tasks. So, like, having an understanding of that and how your data flows through those is extremely helpful, too, not entirely necessary. You can do a lot of damage on a warehouse without knowing that and still be just fine, but it helps to understand how all those flows happen. DAVE: That's actually a good difference between app and dev, or engineering and data, is that on the application side, the thing we never want to see is a query go out without a limit. Like, we don't want you to say, you know, "Select first name from applicants semicolon" like, that's going to burn the whole freaking table from top to bottom, like, all the way down. And then I got to the data team and, like, Casey, who worked...is he still over there? He [inaudible 16:54] the whole team. ZACH: Yeah. So, he left quite a while ago. He's back again working with Rob. So... DAVE: Awesome. Very, very sharp guy. But I remember him sitting us down and saying, "Please don't ever do select star from table, even limit one, because every column is in a different server, and you just spun up the entire data center to get one row of data." ZACH: Yeah, it's not really a different server. Like, think of a disc, right? And I know we're on SSDs now, and those are awesome. But things are still stored in different places on them, and you have to go find them, right? But think of a spinning disc. And if you think of a spinning disc and you think of, like, a Postgres system, or a MySQL system, or these row-level systems, one file on that disc is that entire row. So, when you do "select star from table where ID equals 10," it only has to go one place on that disc. But if you do that to, like, one of my transformation tables that has 250 records, it has to go find all 250 files, count down x amount of numbers so that they match across those 250 files, and then stitch that back together because it's all columnar instead of row-level, right? And that's why it can be really fast when you do summarizations because you go to one place, find that file, and then sum it, right? Or even when you limit it a little bit, you go find three different files, figure out the line numbers you care about, pull them out from the other two files, and then summarize that, do your group-bys or whatever. So, those operations are really fast, where those same operations on, like, a row-level system are really slow because now you’re the opposite. You've got to go find all these row-level files, and then pull the right column out of it, right? And that's why warehouses are incredible for, like, analytics, but you wouldn't want to point any of your applications at the warehouse, at least not unless you're paying for Snowflake's...they've got this new thing; it's pretty cool. They'll store all the data in the table, right, and you can point your application to it, and it's row-level data. I read something about it. I don't know where it's at, but it's kind of a cool little idea. DAVE: I think I checked will it fit in RAM a couple of weeks ago, and I think they're up to, like, 128 terabytes now will fit in RAM. It's not cheap, but we could make it go. BILL: How many of you are aware that Snowflake doesn't even have indexes, well, not the ones that we're used to? DAVE: I just figured it was magic. BILL: [chuckles] It looks like it. DAVE: So, when I was on the data team, what I discovered is, you can do, like, a 75-table join, and it will come back in, like, two and a half seconds. And you can say, "select first name from an applicant, limit one," and it takes two and a half seconds because it's got to go through all the military-grade, weapons-grade query planning. How do I distribute? Oh, just one. And then once it's done all that selection, then, oh yeah, here's your data. That was [inaudible 19:52] to bring one piece of data, one teaspoon over. But when you say it's not indexed, is that because the data's organized, like, almost, like...I’m going to say physically, but you know what I mean, like the spinning disc, like, partitioned out differently to be pre-indexed? BILL: That was the teaser. I was hoping Zach was going to expound on that. DAVE: Oh, dang it. ZACH: Sorry, what was I expounding on? I was looking up and fact-checking myself, trying to find [laughter] the row-level thing that I had mentioned, and I can't find it. So, maybe I dreamed that, but –- BILL: Yeah, I teased the audience with the –- MIKE: He mentioned that Snowflake wasn’t indexed. Yeah, go ahead. BILL: I was teasing the audience with the factoid that, in Snowflake, you don't have to worry about designing indexes for your tables. ZACH: Yeah, no, I was on a call with them one time, and they said they probably do it better automatically than we will. At Redshift, you had to do compound indexes, sort keys. Actually, they weren't really indexes; they were sort keys, right? You can put indexes, like, you can do it if you need to. And we've found a couple of tables that probably make sense for us to figure out what we would rather have it sorted by. And they're not necessarily considered, like, it's not like, "create index" inside of a warehouse. It's like, "sort it by this," because then when you query it by that, so you sort it by date, and you have, like, thousands of dates in there, and you're just looking for these six months, then they're all going to be in the same area of the file. And it gets an idea of where that's going to be. So, they're more like sort keys, and you can do it in Snowflake. It's just that we don't at all. BILL: In Oracle and Postgres, that same sort of thing is called a cluster, where the data is ordered and clustered really close together. ZACH: Yeah. And the other thing, Bill, that I, while I wasn't paying a whole lot of attention, I thought you were mentioning is, like, primary indexes, right? Like, how in a Postgres system you do a primary key, and it's, like, an incrementing number, and you can't duplicate that. Snowflake does not support that either. I could do that, and it could increment. But let’s say I add 1, 2, 3, well, I could go enter 2 back in there, and it doesn't care. It does not enforce those. BILL: [inaudible 22:03] integrity and primary key integrity and -- EDDY: I'm so glad you guys are the ones that have to deal with data and not me [laughs]. ZACH: And if you go and look through a lot of our tables, our primary keys are actually multiple columns, right? A lot of times, our primary keys are not just one column, like an ID column. Our primary key will be, like, lease number, date, and then something else that makes that table unique. And we enforce that through code. EDDY: So, Zach, I've actually wanted to ask you something really interesting. What are some of your biggest pet peeves that we engineers do that really pisses you off that you wish you could change, but we're so fine-tuned doing our own thing, you know, that it's kind of fighting an uphill battle? You basically are, like, throwing the table and being like, "I'll just work around whatever you guys are doing." ZACH: I think the biggest one for me is Ruby on Rails has an enum system, right? And this doesn't get used a lot anymore because I fought [laughs] these battles with software engineers. But it just puts numbers in the database, and then the references to what it actually is is only in the code. I'm not a Ruby engineer, and I don't want to go look through 68 different repos to figure out what all these numbers mean. And I don't want to manage a table that maps that for me because when a new number comes along, and I'm not told about it, I don't know what it is. And so, that would be, like, my biggest pet peeve. And it's not just Ruby on Rails that does it. It's every single ORM has some sort of functionality like that. But, like, Django and Python would do it, too. But you could specify, like, string, string for your enum instead of, like, it being a number, and then the string is only relevant in the application itself. I would say that's, by far, my biggest one that frustrates me when I'm in the warehouse. DAVE: Yeah. Well, and, to be clear, like, the BI team, the business guys, come over to you, and they say, "Give me all the leases that have this type." So, they're actually asking you to actionably query on those numbers, right? If those enums were just in that database, you wouldn't care; it wouldn't matter. But you're actually being asked to make intelligent decisions off of those enums, and we'd much rather have an enum table with a foreign key at that point, right? ZACH: Yeah. Correct, yeah. Like, if you're going to go that route, then in the source system, have a foreign key to an enum table, and I'm fine with that. But since I don't end up with that data at all, because it's just in the codebase, then it creates a need for us to create these transformation tables so that people downstream from me, which there is a lot, right, the whole business is downstream from me. I'm downstream from all the software engineers and all of our third parties, and then there's more downstream from me that is actioning on this data. And so, it causes us to have to do a lot of, like, transformation tables just to make the data legible. DAVE: We had two tables that had enums that they were effectively the same enum, but one of them started at one, and one of them started at zero. And it was the same three fields: 1, 2, 3 and 0, 1, 2. And there was some parking lot therapy [laughter] where we cornered an engineer, and we explained some things. MIKE: One thing that...you keep on talking about transformation. And I want to call out we don't want to undersell, "Oh, you're just transforming data. What's the big deal?" I was thinking, if you want to make a rocket engine, well, you just start with some rocks and transform them, right? And you get a rocket engine. That shouldn't be that big a deal, right? You just start with your ore, melt it down, go through some processing. You can build a rocket engine. Well, that's just transformation [chuckles]. ZACH: Yeah, that's a good call out, right? Because, like, I feel like, and maybe if there's any other data engineers listening, or data analysts, or data people, right, like, “Oh, it's just pulling data,” and it’s like, it’s not. It's understanding the requirements of what you want because the hardest part about data is you could have all the right data and make all the wrong decisions if you don't understand it, right? Or if you put it together wrong. And I was just talking to an analyst today, and he was like, "Yeah, well, people don't understand. It's like, 90% of the job is just making sure it's right and that you've got the right metrics so that the company actions correctly.” And it's the same thing with, like, these transformations, right? Something goes wrong in the transformation upstream where we are, everything downstream is broken. The decisions made are no longer good. Or maybe a happy accident happens, and they're great [laughs]. It could go either way, I guess. But you're right, like, transforming the data, it's not a simple thing. It just sounds simple because we go high-level when we talk about it. EDDY: So, what do you mean by transforming data? Like, I understand. For, someone who's listening in to this and doesn't have a concept of transforming data, what do you mean by that? ZACH: Yeah. So, we have multiple sources that a customer can get into our system, right? We have partners. We have a mobile app. We have a website. We have emails that get sent out. We have all these different things. I don't know if you guys are aware of this, but our consumer-facing systems are very bad at telling me where a customer's coming from. And so, one of the transformations I do is this massive statement where I'm checking across six to seven different systems just trying to figure out where did we get this lease from, right? And that would be, like, a transformation. And those are hard, not only because of the logic that's involved, right, which any programmer is going to understand that logic can be hard. But, like, you have to have a serious understanding of that data, right? So, you can't just say, "Oh, well, we're just going to plug this big case statement in," or "We're going to do this summarization here." You have to understand what that data is, or else we would be telling everybody the wrong origination. Another good example of that is there's a very complicated functionality that we have. I won't go into a lot of detail over it, but it essentially has to check every record for every single day that it's open and, like, go in a very specific order because things are changing, and it has to recalculate it, right? And not only does it take a long time, it's one of those ones that needs fixing, but it's extremely complicated and uses a ton of window functions. So, you have to realize that, like, when you're selecting this, you're actually talking about the row behind it, or the row in front of it, or we're summarizing up until this point, or, you know, there's some complication into that as well. DAVE: That's awesome. So, related to transformations, I remember we have a bunch of tables in the warehouse that start with MP, and that's the Merchant Portal side, the data that came from there. We also have an f leases table, right, that's, like, is that aggregated? I know it's got way more stuff on it than we have over in Merchant Portal. Is that just a combination, or a transformation, or both? ZACH: Both. So, that table is the way that we can allow our data scientists and our business intelligence people to see what a lease looks like across all of our systems that are important to a lease, right? And so, it's also got that functionality that I was talking about, like, where did this lease originate from, right? So, there’s those transformations in there. And then there's a lot of like, well, okay, Merchant Portal knows until this point, and LMS knows after this point, and, you know, these other systems over here know a couple of other things. Let's put them all in one place so that we can look at this new, transformed leases table, and say, oh, this is everything we know about this lease. To an extent, right, there are some tables that that joins to that helps fill in some gaps. But, yeah, it's really just the merging of all the microservices, which is why in the beginning of this, I said microservices are great for software engineers, but they suck for data. Luckily, here we have a really good global identification system. I've seen places that don't, and then it gets even harder to get this data together. So, it's easier here than it might be in some other places. DAVE: It gets fun when you've got a record that has a proxy key that's just your integer primary key auto-increment, right, and a GUId, and a public-facing one because we don't want a customer writing down a 64-byte, you know, token thing, and then something else for, like...we've got a table that's got, like, four different IDs, and it's not stupid. Like, there's a different role for each of those IDs. MIKE: You’re talking -- ZACH: Yeah, there ain’t much more to comment on that one [laughs], so I got –- DAVE: Okay. [inaudible 31:18] Is that a question? MIKE: Eddy, you were talking about transformations, like, what are they? I was thinking about cooking. When you're cooking, you combine the ingredients. You can look at the recipe and say, "Oh, well, I'm just combining these things." But what comes out the other end is fundamentally different in character than what went in. Like, sometimes you combine things, and you get something. Well, you say, "It's just made of these things." And chemically, that's true, right? It's just made of those parts. The outcome, you know, some eggs and flour or whatever, you know, having a cake come out, a cake is a different thing than just a pile of eggs and flour. The combination actually matters. And I think that when you're thinking about that data, the putting things together and maybe performing some operations on them, mathematical things, you know, some summing, some averaging, you're going to get something out the other end that is fundamentally different in character than what you started with. Zach keeps talking about making decisions. I can look at a list of records. I can't make a decision with that. There's no way I can look at a bunch of tables of records, you know, think about them as just a bunch of spreadsheets, then say, "Oh yeah, I've got lists of customers, and I've got a list of leases." I can't make any business decisions off that. That tells me nothing. But if you do the right processing out of there, you can see, "Oh, our revenue is going up, or our revenue is going down, and it's because of this thing over here that changed." And that is fundamentally different, even though it starts from the same place, right? You're starting with those ingredients. What comes out the other end really is a fundamentally different thing. And I think that it's important to recognize that. You think, “Well, yeah, I mean, I'm just changing it a little bit. I'm just combining stuff. Does that really make a big difference?" Well, yeah. If you're thinking about that cooking, you know, a cake really is different than what went into it. Likewise, here where you're doing even more steps, being able to make a key business decision based on some limited numbers is fundamentally different and a critical business function that's completely impossible with what you started with. And it's not a simple step between those. You probably have 50 steps between those in some cases. ZACH: Yeah, I was going to say, and to follow that up is like, we're not talking about, like, oh yeah, a script runs, and there are some transformations, and now you have f leases, the table that David was talking about. What ends up happening is you have 15 to 20 scripts run, and then you get f leases, and then I need 15 to 20 scripts after that to make more actionable. And they all build off of each other, and there’s all dependencies on these tables, right? So, it is, it’s a pipeline. You have to think of it as a pipeline, and each step in this pipeline is a script or SQL that's building the next thing that might come into this next table or give us more insights, right? So, -- BILL: So, I really like the chef comparison earlier. Because, like you were saying...I know you said CrossFit, and I think that's a great one as well. But, for me, I think almost like of culinary arts, right? The structured alignment of these different resources coming together, kind of like what you're saying, Mike. But then also it's an art, right? Because it's presentable. It's got to be presentable to a person that might not understand the basics of data or something, you know. They're able to pull it, access it, and still be able to analyze and acknowledge what that data houses, you know, just kind of, like, in layman's terms, I guess. DAVE: And if you're getting data from me, when I was on the data team, it was omakase. It was a surprise, and you got what the chef gave you [laughter]. EDDY: You know, one of the things that kind of rings the bell when I was asking you what's the biggest gripes that a software engineer does that really, like, rubs you the wrong way, and I sort of answered my own question, but I kind of paused because I wanted to see what your biggest gripe was. But I want to challenge that a little bit, and I want to ask you if this is something that maybe infuriates you even more than dealing with enums in a database. You ready? Having software engineers, right, treating a database at the application detail and not as a shared contract, right? So, let's say, for example, we go in there and manipulate our own schema, our column names, right? We drop tables, and we just don't tell you about it [laughs]. We just don’t tell you about it. Like, suddenly, right, I'm assuming, right, that that has some detrimental side effects on your team, right, because we didn't delegate any of those ones. ZACH: That is accurate. That's also something that we've worked on here since I've taken over the data team. I've worked on getting closer with Mike and the other engineering directors and working top-down like, "This is our new process." Everybody here, the GitHub auto-assigner puts Bill and either Ricky or Kim as approvers, right? That’s our way past that. So, like, if we went back to that world, Eddy, where I woke up, and nothing that I wanted to run, and DevOps was reaching out to me saying, "Hey, you're taking down Merchant Portal," yeah, that is my biggest gripe. But we are multi-years removed from that at this point, so it's not my biggest gripe anymore. It's pretty well solved. We've had a couple of issues recently; we put in some more stuff to get past that. And really, that is a lack of communication, right, and is what that boils down to. So, we've bridged that gap very well here at Acima. So... DAVE: If I recall, Casey, or maybe it was Casey, somebody early on...this blew my mind when I came. Because I'm like, yeah, that was my question too, Eddy, when I went over to the data team. I'm like, I don't see you guys doing anything with our migrations, and I know we're migrating the database every single day. And Casey was like, "Eh, it's just a Tuesday for us." And in the list of reports that run every night, one of them is "Go deal with all the schema migrations and just update the warehouse,” and down the road you go. ZACH: Yeah, we've come a long way. The other thing that helps us out with those a lot is Fivetran. Fivetran is non-destructive, our homegrown solution that we had. Back when David came to join the team and help us move to Snowflake, it would break it. You dropped a column, it would break it. You updated an entire table with, like, a backfill, I'd take your system down on accident without even meaning to [chuckles]. And then we moved to Fivetran. DAVE: Sometimes you meant to. ZACH: [laughs] Nope. You won't get me to admit to that, ever. DAVE: You never meant to, but sometimes you didn't feel too bad [laughs]. ZACH: But Fivetran is very...it’s not destructive. You drop a column. I don't drop a column, which can be hurtful in another way, right? If you guys were to drop a column or stop writing to a column, and I didn't know we stopped writing to the column, and I was transforming off of that column, well, now you could have just made 37 tables have a null feature for no reason and break some reporting. And then I have to hear about that from the business, and it's my fault, you know [laughs], and so... And it's never, as everybody here probably knows, it's never a good feeling when somebody off of your team comes to tell you about issues on your team. DAVE: Yeah, I remember one of the cool things about having worked in data and then going back is, we had a thing where we had some tables where it's like, oh, we just need a phone number, just stick it on, right? This is how databases go straight to second normal form, right? Oh, now we need a work phone; now we need a cell phone. And we let it get out of hand, right? And so, we had, like, 11 tables that had phone numbers on them and three different kinds. All right, we need a phone numbers table. And that came through, and I was looking at this, and I'm like, okay, we can build a table. We'll export it. And this is going to take a while to get everything off. So, we're going to do triggers that go back and forth, Rails triggers after, you know, after hooks on the code. If you update this one, we update the master. You update this one; we update the outward record. Okay, great. And then I put a note in the ticket: go talk to the data team because they have reports that go off of this table, and if we stop writing to this, they're going to be very upset. And I remember talking with Casey, him tapping me on the shoulder and saying, “The dev team are changing the encryption keys, and we need to be able to decrypt this information.” And I said, “Okay, how soon do we need this?” And then I said, “Wait, let me guess: they've already changed it, and we can't decrypt data and give it to the call center.” And Casey said, “Yup.” And I'm like, yeah, so I got to go back to engineering and yell at Adam and say, “Okay, what happened?” And he thought he had communicated it, and it just...yeah. So... ZACH: Yeah, I remember that because there was a lot of late nights and tagging another software engineer who was very smart with encryption. Because it's not just that, like, we changed the encryption algorithm, right? We have to convert what Rails is doing to Python. DAVE: To Python. ZACH: And understand what it's doing under the hood so that we can recreate it. And we've had a lot of problems with that in the past, that being one of them, and from one of the systems that's the most important. But going back to your second normal form, I found a table one time, and I got a lot of pushback about changing it, but it was essentially...and, Bill, you might have been working here at that point, but it was tokenization, right? And it was, like, a company name token, company name tokenization at. And then there was a column like that for every single one of the companies that we've ever used for tokenization, and we were adding another one. And so, there’s this, like, 10 columns, and I'm like, what are we doing? This is horrible data architecture. And we wouldn't even be needing to make this migration at all if we would have just set it up properly, right? Like, just get a tokenization table that links back to this other record and then make it very dynamic. And so, that was, Eddy, to your question, too, another frustrating experience because I was completely ignored on that one and two more columns got added to the table, and who knows how many since then, so... DAVE: Now I want to go look [laughs]. EDDY: Well, I don't have the access to, unless it was [inaudible 41:45] DAVE: Not fair [laughs]. EDDY: [laughs]. You were talking a little bit about, like, phone numbers’ table, Dave, and it got me thinking. I guess it's really easy to kind of just think, oh man, if multiple tables can have a name column, why not just create polymorphic tables, you know, with ownerships, you know, and then just shove everything that can be polymorphic be polymorphic? So, where do you draw that line, right? So, for example, phone numbers, you can have a phone numbers table; email, you can have an emails table, right? Address, you can have an address table, et cetera. But, like, I'm assuming you don't want that for name maybe, right? Or do you want that for date of birth, for example, et cetera? Like, is the default always...if multiple tables can share the same data, does it just make sense to always make it polymorphic? Where do you draw that line, you know, even if you are repeating yourself in multiple tables? ZACH: I’ll do the simple answer, and then let Bill come in with the more complicated [chuckles] answer if he wants to correct what I say. Phone numbers make sense. You, Eddy, can have multiple phone numbers. That is a one-to-many relationship. But you, Eddy, are one person, so, like, you have a date of birth. You have all these facts about you that sit on your customer record, but you could have multiple phone numbers. And so, you put that into a secondary table, and you just match back. And you can have multiple emails. You can have multiple bank accounts. You can have multiples. So, when you could have multiple things, that's when I would do that because when you start finding yourself doing things, like I was saying, or underscore one, underscore two, underscore three, that needs to go somewhere. And Bill's going to have probably a better explanation than that, but that's where my idea was at, yeah. DAVE: Like, how far to go down, right? Like, the extreme case to be like, should we have a first names table and select, you know, like, Bob belongs to these three applicants and you just, you know, first name...Is that the logical conclusion, Eddy? BILL: That’s it. DAVE: Of, like, way too far? How much is too much? This has become a form joke of, like [crosstalk 44:05] ID. BILL: After you've done it enough, you just get a feel for it. The example you just offered, that would be one of those times where you're like, this is just ridiculous. This is, like, fourth, fifth normal form. No [laughs], going too far. If you have a repeating attribute, like Zach was talking about, like multiple types of emails, multiple types of phones for a given person, that's pretty simple; you normally stick that in a child table. But you were talking about polymorphism, a single record being able to represent multiple types of things, which you frequently find in, like, event tables and whatnot, where different sorts of things can be stored in that same table. That's usually about the only place I use polymorphism. It is a case-by-case basis. It's mostly art and less science. I actually don't have a really good answer for that, when to use polymorphism. I almost never use it. I'm actually surprised at how often we use it here. DAVE: I might have a good follow-up, then. So, the way to know the right answer is experience, and what is it? Good judgment is how you get experience, or the other way around: experience comes from good judgment. Judgment comes from...you know the quote, right? Experience comes from bad judgment; that’s what I was saying. What does it feel like when you burn your hand on the stove, when you have over-polymorphized or over-normalized your form? BILL: Nobody likes to work with your schema. Developers hate it. Now, in general -- DAVE: Mike, I think I may have over-normalized my form. BILL: [laughs] DAVE: [inaudible 45:31] of my data. BILL: I have found that developers have a...you asked earlier one thing that is a pet peeve of ours. Mine is that developers have an unnatural fear of joins. If the data model is well-modeled and solid and doesn't go beyond third normal form, a relational database loves that. And I've had tables with billions of rows, and joining them is not a big deal, sub-second response time. So, that’s something. I wish developers would not fear joins. That's somehow related to what we are talking about, and I have since lost my train of thought. DAVE: It's all good. I think -- MIKE: I had a thought about the normalization. A phone number has a defined structure. It's an entity with clearly defined structure where that internal structure matters, right? Like, you could conceivably have a phone number type in the database even, right? And I'm sure some databases probably implement that. There are probably some telecom [laughs] companies that very much do have a phone number type in their database. Likewise with an email address, right? It's an entity with a clearly defined type. Whereas a first name, it's just a string. There is no internal structure. There's no expected internal structure. In fact, it varies across cultures. It varies in language. You really, really don't want to impose structure on it because that would be a really bad idea. It's important that you recognize that as just a string. Also, the number of them is unbounded. You can have arbitrary strings there, right? I mean, you might truncate it at the end if you have something ridiculous but, you know, it's just arbitrary data. I feel like that's fundamentally different in character than the other things we've talked about. An address is something, you know, it's its own...it's got its own little schema, right? An address is a thing that has a clear definition that represents a concept. Now, a first name does represent a concept, right? But it’s not in and of itself anything other than just a string, right? It is just a blob of text, no different than any other paragraph, right? And somebody probably has done something ridiculous by putting a whole paragraph as their first name. And [chuckles] that's perfectly legitimate for that, which is different than the kind of thing we're talking about with an address. There's a meaning to the address in the way that there's not on that first name. Not that first names aren't important, not that they don't have meaning within, you know, cultural meaning, but they don't have a meaning in terms of the data in that respect, other than it’s just a string that’s an identifier. And -- BILL: A little [inaudible 48:06] of a thought that I have to add to that. MIKE: Please. BILL: Sometimes the decision about how far to go in normalization and going crazy with your data modeling depends on the business context. My first eight years of my career was spent at telecommunications companies. And there, a phone number had to be split out. So, you had separate fields for the international code, the area code, the exchange, and then the line number. But at most companies, you don't need that. There’s no reason. So, sometimes that’s the answer. What is the business –- MIKE: And that makes the phone number...And you just answered my question, like, yes, it does exist, right [chuckles]? It does matter on the business context. And now that you mention it, I bet that if you were working for a company that was doing, like, genealogical work, like ancestral stuff, then maybe there are some last names, for example, where you might actually care a lot, and you might care about normalizing those. Like, you might want to represent some of those as special, if there are some high-frequency ones. I haven't really thought about this. I’m just talking [crosstalk 49:10] BILL: There were some hard lessons I had to learn when I worked for MYFaith [SP] for 11 years because they operate in 281 countries. And I bet this is found in the link that Dave just shared there in the chat. But there were some things I did not know about names in certain parts of the world. Like, some countries, you have a single name. It's not a surname. It’s not a first name; it's just your name. And we had modeled our data to be very Western-centric. It expected you to have a first and a last. There's all sorts of fun stuff that you can run into when you’re modeling. DAVE: For those listening at home, you can Google "Falsehoods Programmers Believe About Names," and it's a list of, like, shocking things that you believe: oh, they'll fit within 30 characters. Oh, they'll fit within 50 characters. Oh, they'll fit in ASCII. Oh, they'll fit in Unicode. BILL: [laughs] DAVE: People have names at birth. People have names within a year of birth. People have names within five years of birth. That is not always true. Like, again, you're getting into a pretty esoteric data set at that point. But yeah [crosstalk 50:12] people have names. ZACH: Yeah [laughs]. I was going to say, that's good. DAVE: The author got challenged on that. He said, "Oh, come on, show me an example where people have names, where it's a large data set." And he said, "Cataloging mass graves." And I'm like, ooh. Yep. BILL: And that's one of the things I love the most about data modeling is using experience and knowledge like this to anticipate problems and avoid them in the initial stages of design. ZACH: Yeah. And the cool thing about it is you want to avoid them all. So, you're always learning [laughs], and there's always going to be something that you didn't expect, some user input. It's like a video on LinkedIn, right, where it says, like, "Programmer watching QA," and it's, like, one of those boxes with the different shapes. And they're like, "Where does the square go?" "Yeah, in the square hole." She’s like, “Yeah.” And then it’s like, "Where does the circle go? That’s right, in the square hole." They’re like, "No [laughs]." Especially if you work for a company like this with a lot of user-inputted data, like, you have to be careful with that. DAVE: SQLite. Let me finish on this real quick, Eddy. SQLite, I discovered this this week: everything is a square hole in SQLite. SQLite uses a variant type underneath the hood, and it uses data affinity to determine what type it is. And it literally does not care in the schema what column type you declare. I literally tested this; you can try this at home. Create table test, open parenthensis, ID as banana or ID [inaudible 51:48]...ID banana comma name banana phone number banana, and then insert into it a number and a string and some other, you know, whatever you want. And when you select it, it will come back in that type. My faith as a programmer is broken. Nothing makes sense anymore [laughter]. EDDY: Well, like, and mobile apps use SQLite? So, I can only imagine on, like, how detrimental [laughs] that can really be. So -- DAVE: Sorry, I cut you off a minute ago, Eddy. EDDY: Oh no, I wanted to say something, but I also didn't want to cut anyone else off. I wanted to kind of expand a little bit because I'm actually really curious. When I first started and I started to really understand, you know, like, data modeling and data types, you know, and, like, non-nullables, and, you know, and constraints and all this stuff, right, my default thought at one point, and I know the answer to this, but I want to ask it just because I want to see you guys’ reaction. I was just going to say, like, why don't we just store everything as varchar, right, just to be safe, you know? And that way, you don't have to worry about what data they send you, you know, and you can just now worry about schemas. And why is that bad, I guess? DAVE: SQLite's saying, "Preach it, brother [laughter]." ZACH: Yeah, yeah, Matt, you do, and I give you crap about it all the time. And your response is always, "Oh, this was just for me.” And I don't care [laughs]. I don't care if it's just for you; do it right [laughter]. So, a really large reason is data quality, right? What if you're expecting a number and you get a string and everything is just a varchar? Or, like, what if we're expecting, and I know we do this a lot here, and I do it a lot, too, where, like, Postgres doesn't use up all the space. Like, MySQL, if you said, like, varchar(250), it's using 250 bytes, right? If you do it for Postgres and you put two bytes worth of data in there, it's using two bytes. And it's the same with Snowflake. Now, Redshift works the opposite way, where I had to be careful about the sizing of varchars. But, like, let's say state code, for instance, right? If you're operating only in the U.S. and you're doing state code, you want that to be two characters, and if it's something more than two characters, you want that to break. You don't want that to go in, and then you want to catch it in the application. Because the best place to do data quality checks, especially when you have humans giving you the data, is the application. And you want your data types to match that for data quality. And that's a huge [inaudible 54:31] about data quality, and that's not the only reason. Previously, it was faster, and it probably still is to some degree, but computers have grown a lot since then. But why a lot of, you know, you got a lot of relational tables, and you would do enums that go out to another table, like numbers are faster to look up. That's less the case now, but I'd still argue varchar-ing everything is a terrible idea, even for look-up speeds [laughs]. BILL: When you use the right data type, and if the business rules require it, constrain that column to a certain length; you get built-in data integrity checks for free. DAVE: I did a lot of geolocation at a previous job where we were, like, trying to find, you know, pins on a map. And k-space indexing, like, two-dimensional geospace indexing, if you're just throwing JSON strings in there, good luck. You're just going to have to scan the whole database if you want to find anything in the U.S. But if you index it based on, you know, geolocation, by having that in a special format, you can index a lot better. ZACH: Your geoms. I've never done, like, mapping things out until I came here. So, like, geoms, and, like, all the functionality inside of warehouses, they'll let you, like, plot locations on a map. There's some that will take into account the curvature of the earth, and some that’s just like as the crow flies. And they have their own data types of geoms, which is very foreign and very fascinating. EDDY: I'm just throwing out a bunch of data because I'm taking advantage of the fact that I have data people here who can just answer my questions. DAVE: I love it. I love it. EDDY: And so, [inaudible 56:20] right? So, this is a genuine question, right? Why would you ever want to use ints for IDs, right? ZACH: Yeah, you don't. You never want to. You want to use BigInts, because if you just use ints, you run into a problem that we've seen, where you run out of numbers. And also –- BILL: It’s happening at LMS right now. ZACH: Yeah. And so, like, if you ever sit there and think, oh, I’m making an ID; let's just do an int, that'll get you a way, sure. Why not? But then you're the reason we all have to struggle and figure out a creative solution to turn that to a BigInt. So, [laughs] Eddy, BigInt IDs always. EDDY: Is that just fair to say that's the default? Like, even if you don't fully expect that table to grow exponentially. ZACH: Yes. EDDY: Is there a cost for associating a BigInt versus an int? ZACH: No, not in Postgres, which is what you're working in because Postgres is only using the amount of bytes that it’s stored. The cost can happen in systems. Like, if I remember correctly, and it's been a while since I worked in it, MySQL, you fill that space with, basically, think of it as bases, right? Like, you say 64, or is it 126? I can't remember which, but for BigInt, right? I think it's 64. And so, like, if you have 2 numbers in there, it will fill...think of it as filling the other 62 with spaces, and it uses that in storage, but Postgres does not. MATT: MySQL allocates it. ZACH: Yeah, there is no cost [inaudible 58:03]. So, I would argue that even with the cost, it's worth it [laughs] to not deal with running out of numbers. DAVE: And again, like, on the data side, storage is free and compute is expensive, right? We're on app. It's the other way around. So, we're like, oh, conserve space, conserve space. That is awesome. When I worked on the data team, I used to tell people that we were in charge of the numbers, and last Tuesday, we almost ran out of sevens [laughter]. Yeah, so...Oh my gosh. Anybody have anything to wrap on? This has been fantastic, Zach, Bill. I hope we can have you guys come back. This has been fantastic. ZACH: I think the idea was floated around where we get, like, my whole team, and I think we should. We should. BILL: This has sparked a number of ideas for me as well. MIKE: Oh, nice. DAVE: Fantastic, Fantastic. BILL: I'd really like to start talking about partitioning. DAVE: Oh yeah. BILL: Because we have a number of systems with tables in the billions, and, normally, you start partitioning when you hit about 100 million. DAVE: That is fun. EDDY: What I think, Bill, you started to introduce, at least at Acima level, is, I think you take for granted, you know, that you're working under your schema for so long that you just understand it. But when you're coming in fresh, and you're expected to understand what all that data means, right, and we don't document, right? Because you had a big push, like, guys, add comments to everything so that I know what you mean on what you're storing here, right? I think you really opened up, like, a fresh perspective on, like, guys, we don't all work in your table and in your schema, right? Like, please be nice and tell me what that is, right? And -- BILL: Those comments are meant to trickle all the way down to analysts, scientists, and users. Yes, it's definitely not just me. And this is just the base bedrock layer that's needed. On top of that, there's...I don't know if you can see in your articles on LinkedIn lately, but on top of that, is the semantic model, the ontology, the decision tree. There's so much context that goes around a company's data, and just the basic definition of it is the most bare-bones thing I could request right now. But there's a whole lot more to it. And once we have that kind of meaning, we can turn AI loose on our data and do amazing things. ZACH: That's what, previously, right, this initiative, but previously, you had...and I'm saying previously, six years ago, right around the time that I started, up until, like, four years ago, probably, we had less microservices; we had less data. We had a person named Casey that you could go ask what things meant, and he was so entrenched in the data that he would be able to tell you. We've far outgrown that. I don't know everything here. And Casey no longer knows everything here, and he's still here. There's just still a lot of unknowns because we've grown too much, and that's, you know, the comments in the databases. All the stuff Bill's talking about, those are part of growing pains. You've got to make sure that people understand what the data is. And I get it asked all the time in some data channels of, like, “How do I do this?” And I go, “I don't know. Maybe you should go ask Merchant Portal.” And then that question gets put into Merchant Portal for the data owners to actually answer, right? Because you work on a microservice. You are the owners of that data, where I'm the consumer of the data. So, I'm not going to make speculation on what it is. But once all this documentation is done, they can go look. They can see what it says. And if they have questions at that point, go ask, and then you know you have to update your documentation because it's not good enough [laughs] -- DAVE: We're going to get to a point where instead of asking what the lease is doing, we can ask how the lease is doing. ZACH: Yeah. DAVE: I would love that. This is probably a good place to wrap. I would love to have you guys back, even just for a SQL show, SQL, pun unintended [laughter]. But this is a great spot. Thank you, guys, so much for coming. Let's wrap here, and we can move into an after-call. This has been the Acima Developer podcast. And thank you for coming, and hope you'll listen to us next week.

1. huhti 2026 - 1 h 2 min
jakson Episode 94: Staying Cool During Production Issues kansikuva

Episode 94: Staying Cool During Production Issues

Mike opens by framing “production incidents” with a vivid non-software story. As a teenager he smashed bathroom tile with a dead-blow hammer, drove his pinky knuckle into a jagged shard, and had to manage both the injury and the panic of his little brother who got sick from seeing it. He uses that as the metaphor for on-call life. Bad things happen, reactions vary, and what you do in the first moments matters, especially staying calm, reassuring others, and focusing on the most urgent next step. The group riffs on modern incident response, starting with humor about “just ask the LLM,” but landing on a real point. AI can be excellent at sifting noisy logs, even if you should not blindly trust it mid-emergency. Dave pivots to the idea that the best loyalty, from customers and coworkers, is earned when something goes wrong and support is excellent. He describes jumping into a long outage call ready to tear apart his own recent work with zero ego, because people remember who shows up with “two tow trucks” when everything’s on fire. Mike and Justin emphasize composure and delegation. If you are overwhelmed, hand off to someone with a cool head. Prioritize restoring service, “stop the bleeding,” before deep root-cause analysis. Invest ahead of time in rollback plans, feature flags, staged rollouts, and observability. From there, they broaden into practical triage and long-term resilience. Verify the issue, look at metrics and dashboards to identify symptoms like CPU, disk, network, traffic spikes, and database issues, and narrow the delta between last-known-good and broken. They discuss how constraints differ in mobile, including App Store review delays, crash loops, and reliance on the user’s device and network. They also cover security incidents, where you need monitoring to detect attacks, plus coordinated mitigation like blocking traffic and working with vendors. They stress the importance of having an incident quarterback, a playbook, and a contact list for after-hours escalation. The close focuses on what comes after the band-aid. Do postmortems and cleanup so temporary fixes do not become permanent donuts. Balance realistic risk planning with business needs. Emphasize strong observability and the ability to recover quickly, alongside prevention, echoing practices like Chaos Monkey and the idea that monitoring prevents historical events from re-happening. Transcript: MIKE: Hello, and welcome to another episode of the Acima Development Podcast. I'm Mike, and I'm hosting again today. We've got a good crew here today, and I'm excited about this one. We've got Kyle Archer, Eddy Lopez. We've got Dave Brady. Hello, Justin Ellis, Thomas Wilcox. We've got Ramses Bateman, and Will Archer. So, I think we've all been here before multiple times [chuckles]. We've got a familiar crew to talk about an important topic that's always fresh because [chuckles] there's a constant need. I was racking my brain what story to tell for this, and I ended up going back to...I don't even remember exactly when it was, but it was somewhere in my late teens, early twenties, in that era. So, admission, that's quite a long time ago [laughs]. That's more than halfway back [laughs]. And I was helping out at my parents' house with some remodeling they were doing. They were tearing out the...they were redoing the bathroom. And so, they were tearing out...they had a wall that had some tile on it, and they were tearing out the tile. And they were going to put some new...I don't even remember. They shifted things around, but they were tearing out the tile. That's the important part. And I had my little brother with me nearby. He was too young to really help. He was, like, six. And, you know, he was just hanging out and chatting with me, and I was taking a...they call it a dead blow hammer. It's a hammer with sand in it, so when you hit, it just stops. So, it's a weighted hammer, but it has a soft landing, so it doesn't have a...it doesn't bounce back, right? It just kind of stops, rather than having a strong bounce. It's good for situations where you want to do that, right, where you don't...you really don't want it bouncing back and hitting you in the face. And I was breaking up the tile wall. Context, there I am with, like, a six-year-old breaking up a tile wall. And there was some wire mesh behind it, and I was gradually peeling back. As I broke it, I was peeling back this wire mesh that was embedded in some sort of mortar. And I was pulling out [inaudible 02:26] the cement behind the tile. And so, as I'm banging, I pull back a piece, you know, pull it back because I'm making some progress, and I swing in. And because that broken tile is now hanging out and mounted on that wire behind, with the, you know, the cement that's holding it together, when I swing with that hammer at full force, right after peeling, you know, an extra layer back, I sunk my knuckle of my pinky finger right into a piece of broken tile. And I go, oh, and I look down. And I look down into my knuckle, maybe five eighths of an inch, a couple of centimeters more than you should be looking down into a knuckle [laughs]. Oh [laughs], that moment, that's not good. And then the blood starts, right? A rather remarkable amount of blood, I'll say [laughs], was coming out of the finger. Remember, there's a six-year-old here in the room with me. And he yells, "Mom, dad, come help Mike. He's really hurt bad." And, of course, they're thinking the worst. I'm like, "No, no, no, no, no, it's okay [laughs]," yelling. But, you know, there's the moment of panic there. And so, I had some choices in that moment, right? What do I do? Luckily, I think I handled it pretty well. I comforted the people around me to let them know this isn't a disaster. I'm going to need to do something, but you don't need to, you know, call 911. Unfortunately...so, we got everything up, went to one of those urgent care places. They stitched me up. I could tell some other weird stories about it there. A few weeks later, I noticed a little white mark on my finger, and I started pulling, and it was a piece of the thread from the gauze that had somehow got stuck in my finger. And I pulled out, like, a foot [laughs] of this string out of my finger, and then it snapped down near the bottom, and some of it zipped back in. I've never seen it again, like, oooh [vocalization] [laughs]. And I still, when I touch my knuckle, I feel weird sensations all the way down the rest of my finger. It's a [inaudible 04:21] impact of that one. But my poor little brother [chuckles], he got sick from seeing it, and he was throwing up and just not okay. And I felt bad, and I had to comfort him, "This is really okay. I get some stitches, and it'll be fine [chuckles]. It will be fine." And [chuckles] I felt really bad because I was not really even thinking about it. I didn't realize that he was not okay. So, when I discovered before I left, like, 10 minutes later, he wasn't okay, you know, I gave him a hug, you know, tried to help him feel like things were okay, get a ride over to the urgent care facility. They stitched me up, and I'm fine. Today, we're going to talk about dealing with production incidents. And I bring up this example because it's outside of software, but it's a production incident, right? You've got the bad things happen, and what do you do? What do you do now? And I think that there's some aspects to that story we can riff on as well as others. But it helps set the stage for a lot of what happens when we have these production incidents and what we do in that moment because it matters a lot. And how some of the reactions, you know, there's a variety of reactions to this moment among the various parties in place that had some better, some worse, you know, impact. So, servers are down, you know, how do you keep cool? Things are on fire. And that's our topic today. And I've got definitely some thoughts on this. I've written down some notes, but, as usual, I don't want to...I've told the story, right? I've laid out the context. So, I am really hoping some of you all will have some initial thoughts to lead out with. EDDY: Sorry, is the answer not ask AI to see what's wrong with your server [inaudible 06:02]? MIKE: [laughs] DAVE: How do you think the server went down? EDDY: I was thinking, is that not the go-to answer now? I'm sorry, podcast over. Ask the LLM. [laughter]. WILL: Not not the answer. DAVE: The AI is going to say, "You are absolutely right to be upset that the server is down." JUSTIN: So, related to that --  WILL: I mean, I'm just saying that's not not the answer. Like, AI is great at reading a log. Like, it took me --  DAVE: Yeah, actually. WILL: Years, if not decades, to get, like, pretty decent at reading log vomit, you know what I mean, like, filtering through the chicken innards that [laughter], you know, a log will, like, throw up all over you and just be like, "Oh yeah, that's actually it." AI is actually super duper at that. I don't trust it, especially in an emergency but, like, do that. Sure. Yes. Do it. EDDY: I was literally pairing with someone, and we were looking at a Grafana log, right? And I'm like, "Oh, it's because of this." And they're like, "Where? Where is that?" And I'm like, "Oh, I read it somewhere here. Hold on, let me find it again." And, like, you get so good at ignoring all the clutter, you know, and just filtering everything. But, oh my God, dude, like, AI can sift through, like, raw JSON, like candy. DAVE: I have a thought to throw out. I have a bunch. I always do. But one of the things that...and this is not really a production thing, well, maybe it is: loyalty. The thing that makes somebody loyal, a customer, in particular, is you get this graph of, like, did they have a good time, or did they have a bad time? And then did they receive good support, or did they receive bad support? And the most vehement haters of any product are the people who had a bad time and got bad support, right? Just got told, "You go away, not our problem." We've all had examples of this. The most loyal customers, this is interesting, are not the ones who had a good experience with good support. They're the ones who had a bad experience and had fantastic support. These are the rabidly loyal fans. Imagine you've got a car, and you blow a tire on the road, okay? And you call AAA, and they're like, "We're busy. Go away." You're like, "I'm canceling my AAA membership immediately," right? You buy new tires at Big O. You drive along. You're great. You never have a problem with it. Okay, they're tires. They're supposed to be tires. I expect them to be tires. Now you're driving down the road. You blow a tire, and by the time you've hung up the phone, two tow trucks have arrived, one of them with a spare tire and a change and a mechanic, and the other one's ready to tow your car if the tire change won't work. They take care of your tire. They replace it. They get you back on the road in 5 minutes, plus a $10 coupon to, you know, to Chili's or whatever, for, you know, "We apologize for the impact on your time." Would you ever buy another brand of tire? I wouldn't, not in a minute. So, what does this have to do with production incidents? This is the story I tell myself in my head of I want to be that guy when my code breaks. I want to be the guy that absolutely had no ego about, you know, how the server went down. I'll talk story on myself here a little bit. We had an outage about a month ago. I'm very, very proud of the fact that I had gone...I've been here for five years. I've never taken out prod. I'm a very cautious engineer, and I'm kind of proud of that. And prod went down about a month ago, and, man, then there was, like, a five-hour incident call because stuff was going on and things were...oh my gosh. What are we going to do? And I joined in the call. And I'm spearheading. I'm like, "Well, it could be this. It could be..." and I'm, like, reaching, well, I might have screwed this up. It could be this other...oh, man, I didn't consider this thing. Let me go test that. And I basically was Johnny on the spot. With any resource you needed, I will tear apart my own pull request and anything in it. I don't care. I'm not here to be proud to be the best engineer. I know the server's down. I care that the server is back up, and I want everyone in the room to know that Dave was the guy who showed up with two tow trucks, a change of tires, and a $10 gift card to Chili's. And then when it turned out that the server went down three minutes before my deploy, and everyone went, "It can't be Dave's deploy," it went from, "Wow, Dave is really carrying this," to, "Holy crap, Dave is carrying this, and he didn't have to." And Andy gave me a pat on the head at architecture for really showing up and driving the ball on that, and that's how you turn an absolute crisis into a huge opportunity. What people remember is what you were like when things went bad. How you behave when things are good is a terrible predictor of how you will behave when things go bad. And how you behave when things are bad is the best predictor of long-term relationship success. And can I trust you, and do I want you around forever? So, that's my inspiring speech about that. I'm not trying to blow my own horn, because, I mean, obviously, my...I deployed something, and things went down and could have been me. But it's who you are when it goes bad that people remember. MIKE: You know, you talked about how you respond. In my initial story, I mentioned, you know, a few parties here. You got the little kids. JUSTIN: Mike, are we just going to let David, like, drink out of a beaker here [laughter]? DAVE: It's not a beaker. It's an Erlenmeyer flask [laughter]. I do do mad science. JUSTIN: What kind of a, you know, show you got going on there [laughter]? DAVE: For those of you listening at home, which I guess is everybody because we don't actually publish the videos, I have a magnetic stirrer. You got to [inaudible 11:31] tell the story. I'll tell everybody. I have a magnetic stirrer. I bought it for resin and, you know, paint and stuff like that. And every once in a while, I thought, you know, I could mix, you know, my Kool-Aid, or I could mix, you know, my Liquid I.V., or my LMNT. I could mix that in it. But if you put it in a regular cup, it splashes it everywhere. And I'm like, I might as well just buy the stupid lab equipment that goes with the stupid stirrer [laughter]. And so, yes, I do have this. Now [laughs], this does absolutely nothing to excuse the fact that this is root beer with hot sauce in it. I'm not kidding. I am a monster. I have a reputation to live up to. So, there you go. WILL: Don't drink out of the resin beaker, man [laughter]. DAVE: You're not my real dad. WILL: Do you want microplastics? That's how you get microplastics [laughter]. You get macroplastics [laughter]. DAVE: Exactly. These are culinary only. WILL: [inaudible 12:22] army man. DAVE: Yeah, these are culinary only. These are my portable flasks [laughter]. JUSTIN: [inaudible 12:27] you keep the labels correctly on those [laughter]. DAVE: Oh, jeez. I'll switch to this one. EDDY: I mean, how many of us actually drink from a plastic water bottle, you know what I mean? You'll [inaudible 12:39] way. It's inevitable. MIKE: Honestly, I drink out of, like, a mason jar a lot. It's glass. It's not going to give you the microplastics. It looks funny [laughs], yep. But -- JUSTIN: Mike, back to you. I was just very -- [laughter] MIKE: So, aside completed, segueing back...the responses. So, the response of somebody who was overwhelmed by the situation and just went and started vomiting. He couldn't control that, right? Like, that was a reaction that was completely outside of his...out of his voluntary control, and that's fine. You should, you know, you're in a situation where millions of dollars are on the line. You're not okay, bow out. And I think that that's the responsible thing to do. If you find yourself in that situation, delegate to somebody who's got a cool head and do that because that's, like, the first note that I wrote down. If you can't maintain focus and be like, okay, that's okay because you can't help it, like, there's not shame in that, but there is shame in not admitting it, right? You know, pretending that you're okay. Because, under stress, sometimes we have unexpected reactions. Usually, you're not the only one, right? You're part of a team. Bring the team in. Give it to somebody else. But having that cool head, I think, matters tremendously because you've got some important decisions to make, and the order you make those decisions in matters a lot. I would argue that, you know, the next...you probably got three things you've got to do. You can always...I wrote down five, but the first thing that you do matters a lot because, a lot of times, people say, "Oh, wow, things are broken. What went wrong?" And then they'll spend the next six hours trying to figure out what went wrong when the servers are down and your business is losing money [laughs]. DAVE: Yeah, we don't care what's wrong. We care about the servers. Yeah, give me cash flow. MIKE: Exactly. DAVE: Stop the bleeding then take the bullet out. Yes. MIKE: Bingo. And I was thinking, literally, that's what made me think of my incident [chuckles] back in my youth because, literally, I had to stop the bleeding. Nothing else really mattered, right? I put direct pressure on that. I went, and I got the stitches. And they asked me. I remember that, like, "Do you have feeling in your finger? Do you think it severed a nerve?" I didn't actually realize that I had at the time [laughs], but that didn't matter as much as, you know, let's get rid of this gaping hole in this guy's hand. That matters a lot. Stopping the bleeding should go first. Go ahead. JUSTIN: Yeah, and when you talk stopping the bleeding, I think a lot of this is, like, in the prep work that you do. And 9 times out of 10, for production releases, for me, if you do a production release and something goes bad, you've got to have that back-out plan ready to go. And whatever that is, hopefully, you're doing installs multiple times a day, and your back-out plan is just hitting a button, you know, just getting back to normal, which was, you know, whatever it was before you did that deploy. And, you know, if you have that up and running, that's a sign, I think, of a really mature business. It's like, hey, I can go into prod, and if something breaks, I can back out of prod within 30 seconds," and life goes on. And then you, like you said, then you could figure out what...dig out the bullet. WILL: Right. Well, yeah, I mean, but it's always, you know, I don't know. I mean, I'm always hesitant to, like, hop in the Wayback Machine, right? Because, like, if we're going to be like, all right, step one is go back in time and make sure that you can claw back that deploy [laughter], no. Step one is, like, don't write the bug in the first place. I mean, you know -- DAVE: I actually call this the time machine problem. WILL: If I'm [inaudible 16:19] I'll fix it all the way [laughs]. DAVE: Because everyone's solution is, well, don't do that again. Well, don't do that. I'm like, well [laughter], where were you an hour ago? MIKE: Well, it's also tricky if you're deploying an app. So, Will, you're working with mobile apps, right? --  WILL: Oh yeah, oh yeah. Like -- MIKE: You don't get to go to the App Store and say, "No, I didn't mean that. You downloaded that to somebody's phone. Please bring it back." That's not on your list of options. WILL: You can get done wrong. You can get done real, real nasty if you bungle a mobile app. I think it's only happened to me maybe one time in my career, where you get the dreaded crash loop, where your state in the app is corrupted, and it's not fixed with a hard reboot, right? Where, like, your state has gotten corrupted. And it didn't happen to everybody, but there was an edge case where we had some people crash looping, and, like, that app's got to get smoked, like, you got to pull it off your phone. You can get burned super bad, to a degree, that is. EDDY: What's the rollback strategy in a mobile environment, right? Like, because you have to follow certain standards, you know, in the marketplace, right? Whether that's Play Store or the App Store, right? Like, if I remember correctly, they have, like, certain criteria and waves that you can release updates to your application, and they've got to approve that every single time, right? So, if something leaks, right, in that deploy, like, do they have, like, a fallback where you can be like, oh, crap, it's not working; let me just deploy the previous version on the application? Like, how -- WILL: Well, it depends, you know, there's rules for some people, and there's rules for other people. So, I started out as a very, very small fish in the App Store pond, a minnow. And you don't get nothing, like, they'll review it when they review it, you know what I mean? And you can beg, and you can grovel, and maybe they'll get to it, maybe in a day or two, or whatever. But, like, there's just a lot of minnows in the store, and, you know, the dog is always eating their homework, so you know what I mean? Like, you just...they'll get you when they get you, right? Android turned things around, has historically turned things around pretty quick, because I don't think they have a lot of, like, human beings looking at it. Android, you know what I mean, you can really usually get it down same day. But, like, you know, App Store, it could be days, you know what I mean? We're talking, like, you know, three to five business days. But, you know, I got into a bigger fish, you know, maybe, like, a trout, you know what I mean? And I had a number. You don't call that number very often, you know what I mean? But you can call the number, and there's a person, you know, at Apple Corporate, and you could grovel. You could grovel to a person, versus, like, just, like, groveling to this email where it's just like, I don't, you know. And now, you know, and now I work for some pretty big dogs and people you know. And, like, I can grovel internally to the VP who could talk to, you know, another VP, and they can make things happen, you know. And all my lickings happen, like, you know, in-house. And it'll just be like, "Hello. I'm the SVP of technology. And let's talk about how you shit the bed, Will [laughs]." You know, which is, you know, I mean, like, I don't know. I mean, like, if it has to be that way, it has to be that way. But things have evolved, right? Like, I'm not just some sort of, like, cowboy. And when you're working with, like, sort of big money and big engineering staffs, everything you do is feature flagged, right? So, like, you have a, you know, a live dynamic CMS, and anything I put out, anything I put out ever, you know, I've got an off switch. You just have to have that. That's, you know what I mean, like, at this scale, you've got to have a panic button. And there was also, like, you know, the app deployment infrastructure has evolved rather significantly since I've been doing mobile apps, in that, like, you're not blasting it out to 100% of your customer base. That's crazy. Like, that's psychopath work. You roll it out to, like, 1%. Let's see how it does. Let's let it simmer for a little while, right? So, it's good and bad, right? But, you know, there are best practices which, you know, to a web development shop might seem, you know, kind of primitive and anxiety-panic-inducing, which there are, right? I mean, because you've got to remember, like, if you're on a mobile app, you're running on somebody else's server, right? Like, it's their hardware. It's their machine. They could do anything. Anything. DAVE: Including nothing. Including nothing when it goes down. WILL: Anything. Yeah. You're out of hard disk, baby. Sorry, no more hard disk for you. Oh, you got a little greedy with the RAM. We're pulling your card. MIKE: [laughs] WILL: Sorry, no no, you know. Like, hey, like, oh, you had the network. You had the network, huh? That's cool. That's cool. But I'm going in a tunnel now [laughter], you know. Like, there are levels to the game. And, like, when, you know, like, your app, you know, your distributed application, you know, is in no way a guaranteed stable internet connection, no, no, no, no, no. No. Nobody's even pretending that that's the case. And things can get really difficult, and getting accurate telemetry can be very, very difficult, you know. Because there are certain crashes where you're just done. You're done now. You're finished. The operating system is stepping in. Daddy's home, and everybody's going to their room right now. So, those can get more difficult. But again, you know what I mean, because, like, you know, there are bigger dogs. You know, there are a lot of really delightful, you know, third-party mobile app telemetry gathering solutions. They'll give you screenshots now. It was great. It's so cool. I could be like, "Oh, it crashed," and I could just be like, "Oh, what are the, like, last, you know, few things that they have done in the app?" And I'm just like, oh. You know, where have you been all my life? MIKE: [laughs] WILL: Sorry. Thank you for coming to my TED Talk. DAVE: No, all good. I have season tickets. MIKE: You did talk about several things, though, that goes back to what we talked about a minute ago, or this ongoing conversation. What you do ahead of time matters a great deal. You say you don't push out changes that go live. What? Are you mad? You say, you know, push out changes that are behind a feature flag, and then the rollout is independent. A rollout of the feature is independent of rollout of the app, right? So, you've changed the cycle so that you actually do control the rollout. Or, as was said,  when you actually have a web app, you have the ability to roll back. You press the button, "Oh, wait, yeah, now it's back." Problem solved. That prep work ahead of time goes a long way to making things right. Now, let's say things have gone wrong anyway, right? You've got unexpected traffic that's 10x your normal level, and now you've got a database query that's unhappy. There's no rollback, right [chuckles]? You've got live traffic, and you probably want to be doing something with that 10x traffic, right? You probably want to be making some money. What do you do? JUSTIN: That's where prep work comes in again, horizontal scaling. Well, unless it's hitting the only copy of your database, then you've got to do more. EDDY: It should probably stem from writing an ORM query versus just a raw query. Just saying, there's a lot of magic that happens when you write ORMs under the hood. MIKE: Oh, and it's always the database. It's always the database [laughter]. There is maybe sometimes it isn't, but, yeah, it always is [laughs]. It's something you've done with the database. You're missing an index. You've done something that you could do undo with the database, but now you're in a bad spot, right? You're in the bad spot. We talked about stopping the bleeding. You get in the call, a bunch of people upset. You've got three or four business stakeholders who are in the call asking you for a status update. You don't even know what's wrong yet, but you know the app is down, and it's all on you. Step one, what do you do? EDDY: Roll back, unless it's a database. MIKE: There's been no deploy. Things are down. What do you do? WILL: What changed? Something changed. DAVE: You just answered some first questions. WILL: We were happy, right? And then we became unhappy, right? So, what is the delta? What is the delta between happy and not happy, right? Like, could be just a lot of traffic, right? That's okay. Like, I went from happy to very happy to very unhappy, right? It could be a deployment, right? Dave was talking about the deployments, like, "Okay, I changed this thing," right? Okay, that's an issue, right? I mean, and so, like, identifying the last time that you saw the sunlight, that you felt human joy, you know, okay, well, there we go. And then you just sort of, like, narrow that delta down to, like, "Okay, it was here, and then it was here." All right, now you've got a stew going. JUSTIN: So, you're talking a lot about, you know, identifying this stuff. It goes back to, again, planning and making sure you have appropriate monitors in place such that you can go look at those logs and you can have that dig-in ability, and something other than just, "Oh, prod is down." It's like, where are my alerts? You know, I should be able to go into the logs and say, "Oh, the traffic is hitting the firewall here, and it's hitting the VPC, and then it's hitting, you know, the application, and then it's hitting the database." You know, is that traffic consistent all the way down the thing? And can I see all that in the logs? DAVE: How is the system down, right? Are you CPU-bound? Are you disk-bound? Are you network-bound? Are you hung? Yeah. MIKE: Notice that we're talking about going and looking at our metrics to see what's wrong, not going and doing a deep, like, root cause analysis necessarily, like, what's hurting here? DAVE: Right. This is symptoms and triage at this point. MIKE: Yeah, exactly. DAVE: Don't prescribe until you've diagnosed. MIKE: And that's the triage, exactly. And as mentioned repeatedly, you go to your data; you pull up your dashboards, right? Whatever you've got that you have to go get some visibility into that. Whatever you've done to observe, that's the first place you look, like an instinct [chuckles]. JUSTIN: Actually, the first thing I usually do is I go hit it myself on the browser if it's down [laughter]. DAVE: For real. For real. MIKE: Verify. DAVE: Works on my machine is a valid bit of data. I mean, it's a terrible excuse, but, like, it is actually up from here. Okay. Are you on the VPN? Are you? Yeah. MIKE: Absolutely.  JUSTIN: That's really what I do first is [laughs], like, "Oh, I can [inaudible 27:58] [laughs]." DAVE: Confirm the bug." EDDY: "Wait, it's broken? Hold on. I don't believe you. Let me go to the website and see if I can replicate your problem [laughs]." DAVE: I had a support call. I worked for [SP] Joston's Learning. They were, like, an e-learning thing back in the '90s. And so, we would go in, and we would string Ethernet like radio, like RF cable, a 10BASE-T cable, if you remember that, like, coax off the back of these things. And the students would...for, like, middle schools, they would kick the plugs. They would kick the routers. And some of the students figured out that if they kicked the plug, they didn't have to study that day. So, they started getting...and the teachers got real good about going in and reconnecting the plug and saying, "Do your darn lessons," right? And we had one server that just...they came in on a Monday and nothing. Like, it just came up to, like, an "operating system not found" message. And I'm like, oh my, and so I did everything over the phone that I could possibly think of. I finally had to dispatch an engineer to the site. Engineer walked in, looked at the server, reached down, and ejected the floppy disk that somebody had plugged into the computer so that they could play Doom on the LAN over the weekend, and forgot to pop the disk out. And I got a lambasting from the engineer of, "Check the A drive next time that the computer won't boot, if it's booting to the wrong operating, you know, to the wrong disk." But everybody else's system was working, so it wasn't...I knew it wasn't on our side. But yeah, this turned out it was just the one server. No other servers in the building were affected because that was the one that Jose had decided was going to be the Doom server. EDDY: Would it be valid to say, "Grow callus, and then you won't feel it anymore," as a valid response to being cool during a fire? I don't necessarily quantify that as a valid...I don't want you to grow callous on the fact that you've broken it so many times that you don't feel it anymore. DAVE: Right. You're not wrong, though. EDDY: Yeah, exactly. It's sort of like [inaudible 29:55] under the pressure after you've done it so many times kind of grows numb a little bit, right? Like -- DAVE: I had a manager teach us how to get calluses instantly. It was fantastic. Servers were down. We were losing money. And the president of our unit walked in. And we were running around like chickens with their heads cut off, right? And he walks in, and he goes, "All right, we knew this was going to happen." And we went, "Hey, you're right. You're right. We knew this could happen, okay." And all he did was just normalize it. It's not the end of the world. This is a thing that can happen. Let's take this back into the catastrophic level. There's a thing that they tell 747 pilots. "In an emergency, wind your watch." If you're at 30,000 feet and you blow all 4 engines, they just stop for no reason, and you don't know why, you've got 20 minutes before you die. And in that 20 minutes, you have to find the right solution. I mean, you have to find the right solution. But there's a million things that it could be. Now you've got checklists that you can work. But they basically say the first thing you need to do is stay calm. Machines break. So, when you're at 30,000 feet, and all 4 engines stop for no reason, it's not for no reason. It's because it's a machine, and something has gone wrong. We knew this could happen. This is normal. It's not great; it's not ideal, but it's not supernatural. It's not lightning bolts from the sky. And that gets you into a resourceful mindset so that when the answer goes right by out of the side of your vision, you're not tunnel visioned on, my next attempt at the...oh, oh, oh, oh. It's that, it's that -- WILL: Yeah. You know, I would add on to that, like, does anybody in this call know of anybody who shipped a prod bug, screwed something up, and they lost their job? Can you think of somebody that that has happened to? We have decades of experience here, right? DAVE: One time. WILL: Because, for me, nobody, nobody. I can't think of a single one. DAVE: One time. And it'll be real clear that it wasn't the prod bug. It was...we have a thing, when we ship code here at Acima, you have to have reviewers review your code. And I introduced it at architecture, a couple of weeks back, that, you know, at CoverMyMeds, we called this "sticking your head in the noose with the developer." And you had to have a review from an associate, you know, a coworker, and you had to have a review from an engineering manager. And the engineering manager rubber-stamped a review. I'm going to say it was his own code, rubber-stamped it, shipped it 4:00 o'clock on a Friday, took out the fax machines, and he went home and didn't come back and check. And we were down all weekend. This was 10, 15 years ago. We didn't have any observability. We didn't know the fax machines were down, but it was his job to know that it was down. So, he did not get fired for taking out prod. He could have taken out the whole fax bank if he had just checked his work, or if somebody else had reviewed it, or if he had just turned around and fixed it. He got fired for criminal neglect, you know what I mean? Gross neglect, gross negligence. My definition of gross negligence is: if we fired you and replaced you with nobody, we'd be better off. That's gross negligence. That's what he did. He didn't get fired for taking out prod. WILL: I mean, so it's just something to, you know, if you happen to be, like, sort of like a [inaudible 33:23] developer, right?  DAVE: I see your point. You're not going to get fired. Yeah.  WILL: We've got a literal lifetime of, you know, dev experience. And if I'm wrong, just, you know, open your mouth and say, like, no, you're not here, but this is, like, a lifetime of experience. We don't know anybody who got fired for taking out prod. And I don't know if there's anybody on this call, you know, at a senior level who hasn't shipped a prod bug before. EDDY: Okay. Can you define the parameters on what you mean by taking down prod? Our gateway for API traffic is completely haywire, kind of thing? Or are you talking about, like, oh, our hosted AWS server  -- MIKE: I'll tell you my first one. I had been there a few months, and I was asked to restart the service. I ran the wrong script and turned off the server. This is back when your server was in a physical data center, and the only way to get that thing back on was to drive to the data center and turn that server back on. And I turned it off. So, when I say down, I mean it was off [laughs]. And my manager said, "What did you do [chuckles]?" And then we figured it out, and we fixed it, and nobody was fired [chuckles]. DAVE: I don't have a black and white definition for taking out prod, Eddy. But as a sliding gray scale, the more money the company is not making, the more your taking out prod was. And related to the nobody getting fired, I once heard a CEO say to someone, this was, like, 20 years ago, somebody wiped out the system and came in, resignation letter written, hat in hand, hangdog expression. And the CEO said, "I just paid $12 million to train you. Why the heck would I fire you?" And I tell you what, he was the most diligent engineer after that. He'd gone through a $12 million training. WILL: That isn't to say, like, you know, like, YOLO, send it, right [laughter]? But just like -- DAVE: Yeah, that's the guy that got fired, yeah. If you're gambling with...if you lose $12 million, you're not going to get fired. You're going to get fired for gambling with $12 million of not your money. KYLE: I've always looked at whether or not prod is down as whether or not you're affecting your five nines. If it's something that you can report on for your SLA, then you've successfully taken prod down. WILL: Yeah, yeah. This week, and I'm still a little bit salty about it, and it'll be, you know what I mean, it'll be fine. But I had a thing where there's some analytics telemetry stuff in the code review process. I had to refactor it, like, three times for no reason at all. People wanted it, oh, what would it look like if the couch was over there? What would it look like if the couch [chuckles] was on the ceiling? What would it look like if the couch was on the front porch? And I'm like, okay, man, all right, you know, whatever. And so, I moved it three times. In the course of that, I missed some telemetry. There's some telemetry on, like, campaign reporting that isn't going to get out until the next release. And, I don't know, in my mind, that's a prod bug, you know, because, like, they're not going to know which campaign for, like, you know, two weeks. I'm really grumpy about that. I'll probably be over it by Monday. DAVE: You've heard the rule "Fail early, fail loud," right? It's just observability from the other end of it. It's like, if something's down, I want to know. I've had two times in my career when the CEO found the bug before anybody in QA or engineering or anyone. And it's awful when that happens. EDDY: I do want to backpedal to what Will said. You probably had that in mind when you first started, but you probably did it, like, three different times, three different iterations. You were so far in with refactoring that you probably forgot by the end, right? And I think that was more of a symptom, you know, of the work of the refactor. DAVE: And if I was two levels up, I would want to know who made you change it three times and why because those aren't free. There's clearly not free. I'm not a machine. WILL: It was my fault. It was my fault. It was my fault, like, I did it. DAVE: And you fixed it, right? WILL: Yeah, I did it. I fixed it. It was, like, a 10-second thing. It was just, I don't know. Anyway. DAVE: So, as a CEO, I'd be like, who made Will change this three times? Because if you make him roll enough times, he's going to roll in that one eventually. WILL: Yeah. Anyway, anyway, you know, it's fine. It's fine. Like, somebody else took down the dev server for, like, a 24-hour period, like, the very next day. So, if anybody is looking for somebody to, like, grump at --  DAVE: Yeah, you don't have to outrun the bear. WILL: It was me only very briefly [laughs]. JUSTIN: So, you guys chatted about, like, you know, moving on to what you do. You fix it immediately, right, and then you dig out the bullet. You know, digging out the bullet is kind of like the postmortem, and kind of mature organizations have a postmortem process. And that's always interesting. That's where you truly find out where your policies and your processes are lacking because, you know, you shouldn't have shipped the bug. Something caused that bug. When I brought down prod, I was lucky because it was after hours. Otherwise, somebody may have been fired. But the postmortem was painful, but nobody felt terrible about it because it was like, it was my fault. It was like a string change, and the string was...I had changed it to...the string was supposed to be "production," and I had changed it to "pro," so "pro" versus "production." And we found out the reason why I changed it to "pro" was because in all of our other environments, it was "uat" or "dev." And I was like, oh, that's convention. We just use the three-letter word, but no, "production" was the whole word spelled out. And this was when I was working at Fidelity. We'd done the install. It went out. We got calls almost right away. And, luckily, we'd done the install at, like, 4:00 o'clock in the afternoon, so the trading day was over. But it was, you know, the conversation with my boss the next day was just, like, sweating bullets and everything. But it was just like, you know, like you guys said, it was like, oh, as long as you learn from this and don't assume. That was the extent of the postmortem. In other places -- EDDY: Also, like, I think that speaks volumes to, like, the brittleness of their [chuckles] system, right? Like, if you can change something... JUSTIN: Oh, it was...You'd be amazed at what our financial system is running on. It's, like, duct tape and very, very brittle --  EDDY: I'm not surprised [inaudible 40:27] tell us [laughs].  WILL: I would not. I would not. I want to kind of digress, and I'm very curious about this. Like, we talked about, like, sort of, like, you know, like the developer, like, blowing things up thing, right? But, like, Justin, you're working in security. What about security breaches? How do we deal with, like, a security breach? How do you even know there's a security breach? How do you, you know what I mean, do a postmortem for, like, a security thing? Like, oh, we had a compromised system. What do I do about that? The server's up happily spilling its guts to anybody. [laughter]  JUSTIN: Happily divulging all the secrets [laughter]. So, again, it goes back to monitoring because you got to be able to know when you are being attacked. Because if you don't know that you are being attacked in some way, you just think it's normal traffic. You got to have monitors on what you are interested in because if you don't have the monitors on, they'll just take all your secrets. They'll take all your money and everything. A good example: when I worked at Coinme, which is a cryptocurrency company...Is it still okay? Yeah, they were bought out by somebody else. Okay, I can talk about this. When I worked there, it seemed like at least once a month our servers were under attack, either denial-of-service or password, you know, people were attacking, trying to steal passwords, or that sort of thing. And cryptocurrency is probably like the wild west, the most wild west financial industry there is right now. But we had to go in, and we had to...on the denial-of-service attacks, we were on the call with Cloudflare and trying to figure out, oh, what could we block to, you know, stop this denial-of-service attack, whether it's whole swaths of the earth. You know, we're going to block all of Russia. We're going to block all of Eastern Europe. Or if we decide that, you know, oh, we can block a certain type of browser tags or, you know, all those sorts of things were considered. And sometimes we actually had to do a live install to add custom tags to our traffic so that we know what was good from us, and that would block these bots that were under attack. And so, it was nuts. Like, there were several times when we were, like, all night long fighting this sort of thing. But you basically just had to figure out, okay, what's their avenue of attack? And then, you know, figure out ways to block that traffic that was coming in. And sometimes we had whole swaths of our customers who got locked out because they were under password attack. So, it is a wild west, depending on, you know, what could happen. And then, you know, the next week, because it usually happens on a weekend, the next week we'd have a postmortem about, you know, what could we do to defend against that kind of attack? And sometimes that postmortem was, you know, done with our security company, or with the companies that we contracted with to help us block that sort of thing. So, it was interesting, and it was very, very detailed and kind of a crazy thing that we had to deal with in those cases. MIKE: What you're saying there is interesting, and you're hitting on something that I was wanting to bring up, because it's kind of a gap in our conversation. We said, oh yeah, you stop the bleeding, and then, you know, you figure things out. Well, sometimes stopping the bleeding is not an instant process. You talked about, you know, part of the triage: okay, I know they're bleeding. You know, you've looked at the metrics. You see, okay, I know something's going wrong. There's internal bleeding here. Or, you know, obviously, you know, we're getting a denial-of-service attack. What next? Because there's usually different options, and they have different value. There's a difference in what you do. You got the database issue. Do you add an index? Do you rewrite your query? What do you do? There are different options, and those different options have different costs -- JUSTIN: I actually want to bring up one point that you have here. You're investigating what the cause is. You got to have the contact information for all the people that you might need to contact on a Saturday night in order to solve the problem because you can't be an expert at everything, right? So, make sure that you have the contact information of these people and that you treat them nicely, and you [laughs] reward them because you are intruding upon their time that perhaps they were not on call. MIKE: Ramses is on the call. He hasn't said anything. I'm always glad when he's on the call because he knows everything [laughs], which may not be quite literally true, but it's close. RAMSES: It's far off. MIKE: [laughs] You know, having the right people in the room matters a lot. That's a really good point. And you better have a process for calling those people. DAVE: He doesn't know where all the bodies are, but he knows where the memorial services are held. MIKE: Making those choices matters. And it's really easy to get rabbit-holed on something because you're like, okay, we need to come up with a solution. How do we make this work? And you don't want to explore every option. That takes too long. So, there's a delicate balancing act that you're performing during that time, whether it's all night with a security issue or your database is down. Every minute's costing you a million dollars or whatever it is, right? You better be making a choice quickly. We've talked a lot about having presence of mind. Well, it matters a lot. And I think it's really important that you give yourself the mental space to explore that and find the right option. And that can go really wrong really easily. It's very common when you have that incident call, you have a lot of people who join in, and maybe you do have several business stakeholders who are coming in who are asking questions repeatedly. And they want to know, and rightfully so. But they should not be in that incident call when you're resolving the problem. You jump in with somebody else to have the discussion. I think it's critical that, whatever it is, and, you know, there are business stakeholders who actually can be really good, and they'll back up when they need to. But you need to get the people who can solve that problem, those people you mentioned, into a place where they can legitimately think and make a good decision. They can evaluate those options and pursue the best option. A few months ago, I was involved in a production incident, and I saw a lot of noise and people getting focused on something, or not knowing what to focus on. You know, there was lots of bouncing around, and helping people make a choice, "We're going to go this way," went a great deal to getting that solved in a much shorter amount of time versus hours, days, right? You need to get that. That's a big deal. Have you all seen that same dynamic? WILL: Honestly, like, most of the time that I've been in these calls, people have weighed, you know, I haven't seen anybody sort of freaking out. I think I've been pretty lucky in that, you know, the people from, you know, like, the higher upper-level managers are just sort of like, what's going on? I mean, I don't know. I mean, maybe a piece of it is just me, you know what I mean, in that, like, I will tell you exactly what I know in clear and concise ways. This is what I know. This is what happened. This is what I'm going to do, you know what I mean? And then I just sort of, like, and now I'm going to go do it. And they're just like, it's everything I needed, and I'm going to leave, you know? So, I mean, I think, you know, kudos to them. And, like, ICD 2, you, as, like, sort of, like, a first responder, let's say, need to be aware of, you know, what their ask is, right? And, I mean, you're going to talk to their boss. And they're going to talk to their boss, and then they're going to talk to potentially, you know, the big boss. And everybody needs to know what's going on, what's being done, you know what I mean? Because the CEO, like, in a big company at least, right? CEO's hands are tied. They can't do anything. They couldn't fix that server if they wanted to nor, you know, in most instances, could your boss, or at least your boss's boss. If you don't have dirt under your fingers, you're useless, you know. And so, your job is to communicate. No offense. I mean, it's not like they don't do anything at work, but when the server room is on fire, like, if you're not helpful, just, yeah, let me get the fire out, and then we can do manager stuff, you know, later [laughs]. JUSTIN: Yeah. And there's generally a playbook for incidents. If you guys have an on-call rotation, which I believe you guys do, we have one. You have a playbook that clearly designates, you know, oh, somebody is on call, and they have the power to declare an incident. They are in there. They're the incident quarterback. That's what we call 'em here. And they have access to all the people they need to call. And they are also responsible for communicating up and handling, you know, the managers that may come in and, like, throw their arms around, or whatever. And the incident quarterback, I think, is really key to maintaining a calm, you know, demeanor during this incident. And it's key that anybody who has the potential to be that has that right training so that they know how to use that playbook, what they need to do. And, you know, it's really nice if they do know how to do that. If they don't know how to do that, then you're doing on-the-call training if you happen to be on that call. WILL: I'll take it for granted that there is, like, a binder that one could open up and handle the incident, you know, or God help you, training [laughs]. Your training is, this is the spreadsheet for your days, and maybe there's an email or something. Yeah, yeah, I don't know. I've seen training. I have seen it. I've witnessed it where, like, they're just like, "Okay, this is the training stuff." Like, I know it can happen. However, however [laughter], however, like, as often as not, it is just a Slack message, like, "Hey, server's down. Can you get on this call [laughter]?" And I'm like, "Yeah, yeah, I can". DAVE: I pushed really hard to get, at CoverMyMeds, to get...we called it the 3:00 AM playbook, which is just a checklist, right? You know, do this, do this; do this; do this; look at this. If it's this, do that. And we literally had to write it for somebody with no context, no knowledge of the system, other than, you know, generic familiarity with the tools. And it's 3:00 o'clock in the morning. You're sleepy. And all you want is to go back to bed. And literally, the outcome of the 3:00 AM playbook is to stop the bleeding. It's not even pull out the bullet. It's literally get the server back up, watch it for a few minutes. If the server looks like it's going up, it's still up, go back to bed. We'll dig the bullet out in the morning at 8:00 o'clock. MIKE: So, you stop the bleeding. What next? That's the key thing first, right? It's very easy. Well, maybe even have some partial fix in place. It's very easy for people to say, "Oh yeah, problem solved," and walk away. And then, two weeks later, it's still, you know, your Band-Aid's in place, and the Band-Aid falls off [laughs]. DAVE: We've all worked on systems that's got the little donut spare tire that's been there for seven years because it works. MIKE: Yeah [chuckles]. How do you deal with this in the long term, [inaudible 52:13] as soon it's happened? How do you end up stronger going out of it than you went in? WILL: It depends on how fast you got to drive down the highway, man. Like, there have been plenty of sort of, like, robust failover systems that had, like, a kind of a slow, you know, peptic ulcer memory leak, where they just cook for, you know, a couple of weeks or a month or so. And, eventually, you'd get to a point where it's just like, yeah, that one's got to go. And you just, you know, you vote somebody out of the pool. You keep on going. There's no, you know, it could be bad, like, you'd just be like, ah, it'll be fine, you know? There's no one-size-fits-all there, you know. Some stuff's like, we're working all weekend, baby, and other stuff is just like, nah, it'll be fine. DAVE: We did a couple of systems where we needed to know that, like, we called it Meteor Strike Level Readiness. So, we literally had our entire cluster, like, 700 servers running in a data center in Atlanta and another cluster in Chicago. They were not synced. Like, the databases weren't slaved to each other. They weren't, you know, synchronizing. We ran off of the one, and it just sent backups to the other one. And, every six months, we would fail over to the other data center and use the other one as the backup. And, in three years of doing that every six months, by the time I parted ways with the company, it was still an all-night. And at 4:00 in the morning, we were all writing down the stuff that didn't work that needed to be fixed over the next six months. And it was awful because we would take down prod to do the fail...I mean, we were simulating, like, literally, a meteor has hit Chicago, and we've got to switch over to Atlanta now. How fast can we go? And we still had long lists of things to do, but we got very good at triaging: what's the most important thing? And the most important thing was, how fast can we get Atlanta up and running and then figure out how much is left in Chicago, and what can we do with it? So, that's a lot of money. So, that's another element, right? You slap the donut on that spare tire. And, all of a sudden, the CFO is like, "Why are we spending money on this? I'm still making money." Well, you're not going to be CFO for long. WILL: I mean, I don't know. There haven't been a lot of meteor strikes, you know, in the past 20 or so years. Like, you know, like, Atlanta, both Atlanta and Chicago have been, like, remarkably durable. We haven't burned them to the ground in 100 years, 150 [laughter]. DAVE: And, honestly, it's historically had the same amount of likelihood that they both get hit at the same time, honestly. So, I'm not even sure what we're doing, so... WILL: Yeah, yeah [laughs]. Who would nuke just one? DAVE: Right [laughter]? It's like Lay's potato chips. You can't nuke just one [laughs]. WILL: Nobody who would do it is going to be short. DAVE: That's right. That's right. JUSTIN: Yeah. And I think that has to do with, like, a realistic evaluation of what could happen. Because you could sit there and prepare so much for any sort of thing that could happen, but there's a cutoff point. And I think a reasonable level of risk is acceptable to the business because the business has to survive and be profitable. And, you know, if you're spending all your time, like, thinking of the worst-case scenarios, one, you got to get a life, and two, you're going to spend way too much time, and your engineers' time trying to solve hypotheticals. DAVE: To be fair, the reason...so it wasn't hypothetical. The reason we came up with the meteor strike scenario is...I'll have to dig it up. There was a data center in Houston that had a transformer, like, the main power transformer inside the building shorted out. And it heats...it superheated the cooling oil, and it detonated. It didn't kill anybody because it happened in the middle of the night. JUSTIN: Wow. DAVE: But it was in the center of the building, took out all the servers around it in, like, a 20-foot thing, and then punched a hole through the ceiling. And the servers in there literally fell in the hole. And I can't remember who was on it. I was web admining for Schlock Mercenary. My best friend was doing a web comic. And all of Keenspace and Keenspot, like entire companies, like, their whole data center was just gone. I can't remember the name of it, but it's a name that you might recognize, especially if you're in networking. You would go, "Oh, I know them." WILL: I've got some [inaudible 56:52] words you guys might recognize: us-east-1 is down [laughter]. If you know, you know. I wish you could see the face that Kyle and Mike are making. MIKE: Yeah, instant recognition. You got to East, and I knew where you were going [laughs]. WILL: I don't think I'm allowed to use proper nouns, but us-east-1. Everybody knows who I'm talking about, and everybody knows what they do every other year. DAVE: Rack Shack, EV1 Servers was the one. 2008, they had a transformer explode. Sorry, 2003. And then it happened again in 2008. So, wow, wow. There's somebody who needed to be fired there, clearly. MIKE: Somebody needed to do the postmortem and take action. We're kind of reaching a good time to be shutting down. That cleanup matters. And maybe you determine, hey, you know, we can keep rolling on this donut for years [chuckles]. But you probably have some customers who need to be helped. And, you know, it's important not to neglect saying, "Okay, yeah, we've stopped the bleeding. What's the cleanup need to be?" Because there may be some important cleanup. And there's some, you know, people are going to care. People are going to care. Any final thoughts you all have? We've talked a lot about keeping our composure [chuckles], a lot about that, about the importance of having, like, an engineering sort of mindset. How do I fix this? Triaging, stopping the bleeding, fixing it, and pragmatically, you know, and then not neglecting the after, what comes after. Anything else you want to cover? DAVE: I have a strong religious belief that it is more important to be able to fix the problem than to correctly prevent the problem. Because if you correctly prevent a problem, you have not improved your capacity for dealing with something that you didn't correctly predict. But if you get good at solving the problems, you suddenly can stop worrying about missing something because you start to realize, "We'll handle it." You don't get cavalier. You don't deploy at 4:00 PM on a Friday and go home because you'll handle it. You don't get stupid, but it can help calm you down and say, "Yeah, this is what happens." JUSTIN: So, what you're saying, David, is, like, you should take prod down just a little, and then [laughs] and then that little inoculation [laughs]. DAVE: I wrote a tool called Tour Bus, which, over a conference room Wi-Fi, over a T1, so, like, 256K, with 200 people in the room surfing the internet over it. And I took out prod with it from my laptop while I was giving a talk on stress testing your server. And I did not get a talking to from the CTO because it was his pants that were down on the internet, not mine. And I didn't yank his pants down maliciously. I genuinely didn't think I would take out our prod servers. But there you go. So, give the emperor's new clothes a tug every once in a while. MIKE: So, Netflix, famously, I believe it was Netflix, correct me if I have anything wrong here, who had the tool called Chaos Monkey. DAVE: Chaos Monkey. MIKE: They would go and just break their system here and there, all the time, so that they knew their system would be resilient because unless they were testing it, they didn't know. WILL: I really like having a boring day at work [laughter]. DAVE: Me too. WILL: I like boring days at work [laughter]. I'm thinking I can ride with you on that one, Dave [laughter]. MIKE: I will say that, you know, they say, "Oh, it's always the thing you didn't think of." It doesn't matter how much preparation you do; there's going to be something you didn't think of. And we've talked some about monitoring along this and observability. I'm of a mindset that, given the choice between the two, observability is more important than hardening, not that they're not both important. But you're going to miss something. You're going to miss something when you're trying to prepare for whatever the attack is, because it's going to be some attack you weren't thinking of. And I say attack. It may not be malicious, right? Whatever bad thing happens, it's likely you didn't think about it. If you did think about it, you would've fixed it. But if you have really good systems to figure out what happened, you can solve that quickly, and if you don't, then you can't solve it quickly, and you're in a really bad spot. I've, for a long time, been of the strong belief that monitoring, that observability is the more important of the two. DAVE: Observability leads to good hardening. Good hardening does not lead necessarily to good observability. KYLE: Just to go along with your last point, I would say that monitoring is what you do to prevent historical events from re-happening. WILL: Ooh, I'm stealing that. I love that. MIKE: I like that. Hopefully, in your next production incident, you've taken something from this that helps you out. Until next time on the Acima Development Podcast.

18. maalis 2026 - 1 h 11 min
Loistava design ja vihdoin on helppo löytää podcasteja, joista oikeasti tykkää
Loistava design ja vihdoin on helppo löytää podcasteja, joista oikeasti tykkää
Kiva sovellus podcastien kuunteluun, ja sisältö on monipuolista ja kiinnostavaa
Todella kiva äppi, helppo käyttää ja paljon podcasteja, joita en tiennyt ennestään.

Valitse tilauksesi

Suosituimmat

Rajoitettu tarjous

Premium

  • Podimon podcastit

  • Ei mainoksia Podimon podcasteissa

  • Peru milloin tahansa

3 kuukautta hintaan 3,99 €
Sitten 7,99 € / kuukausi

Aloita nyt

Premium

20 tuntia äänikirjoja

  • Podimon podcastit

  • Ei mainoksia Podimon podcasteissa

  • Peru milloin tahansa

30 vrk ilmainen kokeilu
Sitten 9,99 € / kuukausi

Aloita maksutta

Premium

100 tuntia äänikirjoja

  • Podimon podcastit

  • Ei mainoksia Podimon podcasteissa

  • Peru milloin tahansa

30 vrk ilmainen kokeilu
Sitten 19,99 € / kuukausi

Aloita maksutta

Vain Podimossa

Suosittuja äänikirjoja

Aloita nyt

3 kuukautta hintaan 3,99 €. Sitten 7,99 € / kuukausi. Peru milloin tahansa.