Failure as a management scenario: who decides in the first 30 minutes
On why a technical incident is not only an engineering problem, and how to define roles, escalation paths, and a single source of truth before something breaks.
When something breaks seriously, the first 30 minutes shape a great deal. Not just the recovery time - but also how many people get pulled into the panic, how clearly the situation reads from the outside, and whether anyone makes a management decision at the right moment.
Most companies I have worked with are technically prepared for incidents. There is monitoring, there is an on-call engineer, there is some kind of process. But the management scenario is not written down. As I noted in Public cloud SLA: what it says and what it does not, the gap between a provider's contractual guarantees and what a business actually needs is exactly where the management scenario matters most. Who decides to stop the deploy? Who talks to customers? Who makes the call to switch traffic to the backup?
Why the first 30 minutes are managerial, not just technical
In the first half hour, engineers do not yet know what is broken. That is normal. Diagnosis takes time. But during that same window, other things need to happen in parallel:
- someone needs to assess whether immediate communication with customers is needed;
- someone needs to decide whether to escalate or not;
- someone needs to make sure the team is not working in separate directions without a shared picture.
If none of this is defined in advance, everyone does what seems logical to them. The result is usually predictable: several people call support simultaneously, someone messages customers before a diagnosis is made, engineers spend time explaining the situation instead of fixing it.
Three roles an incident needs
I find a simple three-role structure useful:
Incident commander - one person who holds the overall picture. They do not fix anything themselves. They coordinate: who is doing what, what has already been checked, what the next step is. Can be an engineer, but acting as coordinator, not executor.
Technical executor - the engineer or engineers who actually diagnose and fix. They do not handle communication and do not make escalation decisions.
Communicator - the person who informs the relevant parties at the right moment: leadership, customers, partners. They take information from the incident commander, not from the engineers directly.
In a small company, one person may cover all three roles. In a larger one, they should be separate people. The important thing is that this is settled before an outage, not during one.
A single source of truth during an incident
The most common problem during incidents is information spreading across different channels. Someone posts in a messenger, someone files a ticket, someone calls. After 20 minutes, nobody knows the current status.
The fix is simple, but it has to be agreed in advance: one channel is the source of truth during an incident. Everything else is derived from it. The incident commander updates that channel every N minutes, even when the update reads "still diagnosing, no new data".
That sounds like bureaucracy. In practice it is what lets a manager understand the situation in 60 seconds at 3 a.m. without calling the on-call engineer.
Escalation: who decides and by what criteria
Escalation is passing a decision up a level. It is not always needed, but the criteria must be defined before the incident:
- the incident has lasted longer than X minutes without progress;
- customer data is involved;
- potential financial damage exceeds Y;
- a decision is required that goes beyond the incident commander's authority.
Without those criteria written down, every engineer will decide differently when to call the manager. Some will call too early, some too late.
A quick readiness check
Ask your team three questions right now:
- If something serious fails tonight at 11 p.m. - who is the incident commander?
- Which channel holds the current status so everyone sees the same thing?
- By what criteria do you call the manager or owner?
If the answers are fast and consistent across the whole team - you are in good shape. If there are pauses or different versions - that is where the work begins.
An incident is stressful. Stress goes badly with ambiguity. A management scenario exists precisely to remove ambiguity from the things that can be removed in advance.