Stories on Facilitating Software Architecture & Design

When Fixing an Outage Means Staying Out of the Way

24 min · 31. mar. 2026

Beskrivelse

We often assume that resolving a major outage requires centralised command and control—getting the right experts in a room, coordinating their efforts, and directing the recovery. But what if the most important thing an incident commander can do is resist that impulse entirely, and simply create space for the right person to surface? That's the situation Liz Fong-Jones found herself in during a July 2018 Google Cloud outage that took down nearly every service—not just Google's own, but every customer running on Google Cloud. As incident commander, Liz had the war room assembled, the escalation path triggered, and the right teams on the call. What broke the incident open was none of that. It was an engineer nobody had thought to page, who called in unprompted, said "I think this was my change," and had already started rolling it back. That moment was only possible because of something built long before the outage: a culture where people don't hide under their desks when things break. Liz traces how psychological safety gets constructed—not in crises, but in how organisations respond to smaller failures every day. She shares the quiet signals that reveal when it's missing (the call that goes silent after an acronym nobody understands, the junior engineer who never speaks), and the heuristics she uses to build it deliberately as a senior engineer. This conversation goes beyond incident response to explore what it actually means to build resilient systems and resilient people—and why those two things are inseparable. Key Discussion Points * [00:01] The July 2018 Google Cloud Outage: Liz introduces her role as a volunteer incident commander and the scale of the incident—nearly every Google Cloud service down simultaneously * [06:00] The Fix That Came From Outside the War Room: An engineer nobody had thought to page calls in, identifies their change, and has already started the rollback before the room knows what's happening * [12:00] Why a Safety Feature Caused a System-Wide Failure: How a canary deployment designed to limit blast radius instead pushed metadata globally—and triggered a bug in every front end * [17:00] Distributed Debugging and the Limits of Centralisation: Why the person holding the critical piece of information is rarely in the escalation room, and how you design for that * [22:00] Psychological Safety Built Before the Crisis: Why the engineer's willingness to raise their hand depended entirely on how the organisation handles smaller failures day-to-day * [28:00] The Quiet Signals That Reveal Fear: Silence after acronyms, juniors who never speak, decisions nobody will revisit—how Liz reads the room for safety * [34:00] Design Ownership and Haunted Graveyards: Why accountability for running a system long-term requires input into its design—and what happens when it doesn't exist * [40:00] Building Resilient People, Not Just Systems: If an organisation crushes someone when they make a mistake, they won't be resilient the next time something breaks—and something always breaks Guest: Liz Fong-Jones Hosts: Andrea Magnorsky, Kenny Schwegler

Kommentarer

Vær den første til at kommentere

Tilmeld dig nu og bliv en del af Stories on Facilitating Software Architecture & Design-fællesskabet!

Kom i gang

When Fixing an Outage Means Staying Out of the Way

Beskrivelse

Kommentarer

1 måned kun 9 kr.

Alle episoder