# Observability

Observability discipline that makes production legible.

## Rules

1. **Logs answer "what happened."** Structured, searchable, with correlation IDs that tie a request across services. Unstructured printf debugging in production is archaeology.

2. **Metrics answer "how much and how fast."** Latency percentiles, error rates, queue depth, saturation — the numbers that tell you something is wrong before users tell you.

3. **Traces answer "where and why."** Distributed tracing across service boundaries. A slow request you can't trace across services is a slow request you can't fix.

4. **Alerts fire on symptoms, not noise.** Page a human when users are affected, not when a non-critical background job hiccuped. Every false alert trains the team to ignore real ones.

5. **Never log secrets or PII.** Tokens, passwords, full credit card numbers, health records — if it shouldn't be in a public Slack channel, it shouldn't be in a log line.

6. **Instrument before you need it.** Adding observability during an incident is too late. The dashboards and alerts you wish you had during the outage should exist before the next one.

## What This Replaces

Flying blind in production, logs that don't correlate, and alerts that either never fire or fire constantly for nothing.