When Production Breaks: Incidents, Disasters, and How to Respond Calmly
An outage is not a test of how fast you can type. It is a test of how well your team turns uncertainty into a shared picture—who leads, what is broken, what customers feel, and what change is safe next. Calm is not personality. It is procedure practiced before the pager fires.
In short
Treat incidents as time-boxed engineering work with clear roles and one change at a time. Reserve “disaster” for scenarios that exceed normal runbooks—region loss, ransomware, credential compromise at scale—and rehearse those paths before you need them. Stay calm by slowing the room down: state impact, assign roles, document timestamps, and separate mitigation from root-cause work.
Incident vs disaster: words that change the playbook
Teams blur these terms. Precision helps.
- Incident — an unplanned event that degrades or threatens a service: elevated errors, partial outage, security concern, bad deploy. You respond with on-call, runbooks, and your normal toolchain. Most pages are incidents.
- Disaster — loss or unavailability at a scale that normal mitigation cannot fix quickly: entire region unavailable, datacenter fire, widespread ransomware, loss of primary identity provider, catastrophic data corruption. You invoke disaster recovery (DR) and business continuity (BCP) plans—failover regions, restore from backups, alternate work sites, executive comms.
The boundary is organizational: if your runbook says “fail over to secondary region” and you have tested it, failover is still a procedure—but the event that triggers it is a disaster in impact. What matters is that everyone knows which playbook applies before adrenaline peaks.
Severity: agree on pain before you argue about fixes
Without a shared severity model, engineers optimize for “make the graph green” while product hears “catastrophe” and leadership hears “minor blip.” Define severities in advance—usually SEV-1 through SEV-4 or P0–P3—with customer impact, not internal inconvenience, as the primary axis.
| Level | Typical signal | Response shape |
|---|---|---|
| SEV-1 / P0 | Major revenue or safety impact; large user population down; data loss risk | Incident commander, war room, executive updates, all-hands pause on unrelated deploys |
| SEV-2 / P1 | Significant degradation; workaround exists; key customers affected | Dedicated IC, frequent status updates, focused engineering |
| SEV-3 / P2 | Limited blast radius; workaround acceptable short-term | On-call owns; next-business-day fix may be OK |
| SEV-4 / P3 | Cosmetic, internal-only, or low-risk drift | Ticket queue; no wake-up unless policy says otherwise |
Re-evaluate severity as facts change. Downgrading too early erodes trust; upgrading late wastes calm you could have spent on structure.
The incident lifecycle (what “handling it” actually means)
Think in phases, not heroics. Google’s SRE book popularized a similar flow; adapt names to your org.
- Detect — monitoring, synthetics, customer reports, or security tools raise a signal. Good detection is specific (“checkout success rate < 95% in eu-west-1”) not noisy (“CPU high”).
- Triage — confirm real impact, assign severity, open a single incident channel or ticket, page the right people.
- Mitigate — restore service or stop bleeding: rollback, scale, disable feature flag, block malicious IP, failover read path. Mitigation can precede root cause.
- Resolve — stable state sustained; monitors green for an agreed window; customers informed.
- Learn — blameless postmortem, action items, runbook updates, error-budget conversation if SLOs were breached.
During mitigation, resist the urge to “fully understand” before acting when users are waiting. During learning, resist closing the ticket without understanding—otherwise the same incident returns with a new timestamp.
Roles: one brain for coordination, many brains for fixes
Chaos often comes from ten smart people doing ten different things. Assign roles early—even on a small team.
- Incident Commander (IC) — owns the timeline, priorities, and “one change at a time.” Does not have to be the deepest technical expert; must be decisive and calm.
- Technical lead(s) — dig into logs, infra, code, data; propose mitigations to the IC.
- Scribe — timestamps, actions, hypotheses, command outputs. Future you and auditors will thank them.
- Communications — status page, support macros, executive summary, legal/comms if breach suspected.
- Subject-matter experts — database, network, security, vendor TAM—pulled in briefly, then released.
Say out loud: “I am IC,” “You are scribe,” “Everyone else routes changes through IC.” That single sentence removes overlapping SSH sessions and mystery deploys.
How to stay calm (the part that looks like magic but is trainable)
Calm under pressure is not suppressing fear. It is narrowing the problem faster than your nervous system widens it. The companion essay on DevOps psychology explains why brains race during outages; here is the operational counterweight.
1. Slow the room with language
Phrases that sound simple but change group behavior:
- “We don’t know the root cause yet.”
- “What is customer impact right now?”
- “What is the smallest safe change we can make in the next ten minutes?”
- “Let’s capture that hypothesis; scribe, note it.”
These sentences buy seconds. Seconds buy structured thinking.
2. Breathe and body-check (thirty seconds, not thirty minutes)
Before you type the destructive command: one slow exhale, shoulders down, read the command twice. Fatigue and adrenaline narrow attention; physical reset widens it slightly—enough to catch “wrong region” or “prod instead of staging.”
3. Use a personal checklist
Many responders keep the same mental list every time:
- Am I on the right account / cluster / environment?
- What changed recently (deploy, config, traffic, certificate)?
- What does the dashboard say vs what customers report?
- What is the rollback or kill switch?
- Who else needs to know—now vs after mitigation?
4. One change at a time
Parallel fixes make graphs ambiguous. IC approves one mitigation, waits for signal, then next. If you must experiment, label experiments in the scribe log so you can unwind them.
5. Separate war mode from diagnosis mode
In war mode you rollback, scale, or failover. In diagnosis mode you read stack traces and trace IDs. Mixing them produces “we restarted everything and still don’t know why.” Schedule diagnosis after impact is bounded unless the fix requires understanding (data corruption).
6. Escalate early, not angrily
Escalation is a resource request, not admission of failure. “I need a DBA for connection pool behavior” at minute fifteen beats heroic solo debugging at hour three.
| Stress-led reaction | Calm, trained response |
|---|---|
| Restart everything at once | Rollback last deploy; measure; then next hypothesis |
| Blame whoever shipped last | Timeline of changes; neutral facts in scribe doc |
| Hide bad news from leadership | Short factual update: impact, ETA unknown, next step |
| Skip documentation “until later” | Live scribe; postmortem draft starts during incident |
Disaster recovery and business continuity in depth
DR/BCP is the engineering and organizational answer to “what if the primary way we run is gone?”
Core concepts
- RTO (Recovery Time Objective) — how long the business can tolerate the service being down before unacceptable harm.
- RPO (Recovery Point Objective) — how much data loss is acceptable (last backup / replication lag).
- Backup — point-in-time copies; useless without tested restore.
- Replication — continuous or frequent copy to another site; watch for corruption replicating too.
- Failover — traffic or workload shifts to standby (DNS, load balancer, database promotion, Kubernetes cluster in another region).
- Failback — return to primary after disaster ends; often harder than failover.
Patterns by layer
- Application — multi-region active-active or active-passive; feature flags to shed load; idempotent workers so retries are safe.
- Data — cross-region replicas, backup vaults with immutability (ransomware-aware), regular restore drills. Know who is allowed to promote a replica to primary.
- Infrastructure — infrastructure as code in Git (GitOps) so a new region is “apply known state,” not artisanal clicking. Landing zones and network design matter—see AWS network architecture for multi-AZ and multi-region thinking.
- People — call trees, alternate comms if Slack is down, who declares a disaster and who can authorize spend on emergency capacity.
Security disasters
Credential compromise, supply-chain poisoned images, or ransomware are disasters even if CPUs are idle. Playbooks differ: isolate, preserve evidence, rotate secrets, involve security and legal, communicate under regulatory obligations. Foundations from cloud security and ISO 27001 thinking apply—contain first, investigate in parallel where policy allows.
Game days and drills
Calm in a real disaster comes from muscle memory in fake ones. Schedule table-top exercises (“region X is gone”) and technical failovers quarterly. Measure time-to-detect, time-to-mitigate, and time-to-restore. Fix the boring blockers—missing runbook, expired break-glass role, backup that never finished.
Before the pager: design that makes calm possible
Most incident stress is borrowed from yesterday’s shortcuts.
- Observability — metrics, logs, traces tied to SLOs; ownership of dashboards per service.
- Runbooks — linked from alerts; include rollback, escalation, and “known weird.”
- Safe deploys — canaries, automated rollback, change windows for high-risk paths.
- Blast-radius limits — namespaces, accounts, cell-based architecture, rate limits.
- On-call that sleeps — fair rotations, sensible alert thresholds, follow-the-sun where you can.
- Declarative prod — when live matches Git, you spend fewer incidents hunting hand-edited drift.
Kubernetes operators benefit from the same discipline: RBAC boundaries (cluster RBAC), storage and CSI behavior documented (PV/PVC/StorageClass), and debug habits from hands-on practice (debugging next steps).
During the incident: a ten-minute script
Print this mentally when you join a bridge call:
- 0–2 min — Confirm impact and severity; open incident record; assign IC and scribe.
- 2–5 min — Recent changes, dashboard review, customer-facing status draft.
- 5–15 min — One mitigation (rollback / scale / failover / flag off); measure.
- Ongoing — IC runs timeline; comms every N minutes for SEV-1/2; no unapproved changes.
- Stabilized — Declare mitigated; hand off to root-cause work; schedule postmortem within 48–72 hours.
After: learning without blame theater
A blameless postmortem asks how the system allowed this, not who clicked wrong. Strong postmortems include:
- Timeline with UTC timestamps
- Impact (users, revenue, duration, SLO burn)
- Contributing factors—not a single “root cause” fairy tale unless truly singular
- What went well (fast rollback, clear IC)
- Action items with owners and dates—automation, alerts, docs, architecture
Close the loop in the open loops your mind would otherwise carry home—documented in the psychology post as the Zeigarnik effect. A finished postmortem helps people actually log off.
Communication templates (calm is contagious)
Internal update (Slack / bridge):
SEV-2 — Checkout latency
Impact: ~15% of EU users; payments succeeding but slow.
Status: Mitigating — rolled back deploy
v2.4.1at 14:32 UTC; watching p95.Next: IC: Ana. Scribe: doc link. Next update 14:45 UTC.
External status (short, honest, no jargon soup):
We are investigating elevated errors affecting sign-in in Europe. Some users may need to retry. We will update within 30 minutes.
Common pitfalls
- Hero culture — one person holds all context; calm looks like martyrdom until they burn out.
- Alert fatigue — ignored pages mean slow detection; tune alerts to symptoms users feel.
- Untested backups — discovery during ransomware is a disaster on top of a disaster.
- DR on paper only — DNS TTL, secret replication, and license limits bite during real failover.
- Fixing in prod without recording — you “recover” and lose the audit trail for learning.
How this fits your DevOps and SRE practice
Incidents are where culture meets architecture. DevOps history taught shared ownership; SRE added error budgets and sustainable on-call; platform engineering adds paved roads so fewer incidents start as “nobody knows this cluster.” Calm response is the human interface to all of that—train it like you train kubectl.
Further reading
- Google — Site Reliability Engineering (incident response, postmortems, on-call)
- PagerDuty — incident response guides and severity definitions
- Atlassian — incident handbook (roles and comms)
- NIST — contingency planning (SP 800-34) for formal DR/BCP vocabulary
- Will Larson — An Elegant Puzzle (organizational load during crises)
Blog index · DevOps psychology after hours · Nature of a DevOps professional · GitOps principles · Kubernetes debugging