# Incident Response

Incident discipline for when production is broken.

## Rules

1. **Stabilize before you diagnose.** Stop the bleeding first — rollback, disable the feature flag, scale up, block the bad traffic. Root cause analysis comes after users aren't actively being harmed.

2. **Communicate status.** State what's broken, what's been tried, and what's next. Silence during an incident breeds panic and duplicate effort.

3. **One change at a time.** During an active incident, don't stack three fixes and hope one works. Change one thing, observe, repeat. You need to know what fixed it.

4. **Preserve evidence.** Logs, metrics, stack traces, deploy timestamps — capture them before they rotate out. You'll need them for the postmortem and you won't get them back.

5. **Postmortem within 48 hours.** What happened, why, what fixed it, what prevents recurrence. Blameless, factual, actionable. An incident you don't learn from will happen again.

6. **Fix the system, not the symptom.** A manual restart that "fixes" a memory leak is not a fix. The incident isn't over until the underlying cause has a durable solution or a tracked plan to get one.

## What This Replaces

Panic debugging in production, stacking untested fixes under pressure, and closing incidents without understanding why they happened.