Incident Playbooks

Playbooks provide step-by-step procedures for common incident types so responders don’t have to think from scratch under pressure.

Why playbooks matter

During an incident, stress is high, time is short, and context-switching is constant. A playbook turns “figure out what to do” into “follow these steps”:

Faster response — no time wasted deciding where to start
Consistent quality — every responder follows the same process
Reduced errors — stress-induced mistakes are minimized
Knowledge sharing — new team members can respond effectively

Playbook structure

A good playbook covers:

1. Detection signals

What triggered this playbook? Which alerts, symptoms, or user reports indicate this type of incident?

2. Impact assessment

Quick checks to determine scope:

Which services are affected?
How many users are impacted?
Is the issue regional or global?

3. Immediate actions

Step-by-step mitigation:

Check recent deployments — roll back if found
Check dependency status — is an upstream provider down?
Check resource utilization — CPU, memory, disk, connections
Check error logs for patterns

4. Communication template

Pre-written messages for stakeholders:

Status page update template
Internal Slack message template
Customer communication template

5. Escalation criteria

When to bring in more people:

No progress after 15 minutes → escalate to senior engineer
Customer-facing impact confirmed → notify incident commander
Data integrity concern → notify security team

6. Resolution verification

How to confirm the incident is resolved:

Monitoring shows all checks passing for 10+ minutes
Error rates return to baseline
No new user reports

Example: API latency spike

## API Latency Spike Playbook

### Detection
- Alert: "API p95 response time > 2s for 5 minutes"
- Monitor: API Health (HTTP)

### Immediate actions
1. Check dashboard for affected endpoints
2. Check recent deployments (last 2 hours)
   - If found: roll back → verify latency drops
3. Check database connection pool utilization
   - If saturated: restart connection pool or increase max connections
4. Check for unusual traffic patterns (DDoS, scraping)
   - If found: enable rate limiting

### Escalation
- No improvement after 15 minutes → page database team
- User-facing impact confirmed → update status page

### Resolution
- p95 latency below 500ms for 10 minutes
- No elevated error rates

Tips for effective playbooks

Keep them short — responders won’t read a 10-page document during an incident
Use checklists — not paragraphs. Each step should be a concrete action
Include links — to dashboards, runbooks, log queries, and escalation contacts
Update after incidents — every postmortem should review and update relevant playbooks
Test them — run game days to validate that playbooks work in practice

Incident Response 101

The full incident lifecycle.

Postmortems

Turn incidents into playbook improvements.

Documentation Index

​Why playbooks matter

​Playbook structure

​1. Detection signals

​2. Impact assessment

​3. Immediate actions

​4. Communication template

​5. Escalation criteria

​6. Resolution verification

​Example: API latency spike

​Tips for effective playbooks