Skip to main content
Playbooks provide step-by-step procedures for common incident types so responders don’t have to think from scratch under pressure.

Why playbooks matter

During an incident, stress is high, time is short, and context-switching is constant. A playbook turns “figure out what to do” into “follow these steps”:
  • Faster response — no time wasted deciding where to start
  • Consistent quality — every responder follows the same process
  • Reduced errors — stress-induced mistakes are minimized
  • Knowledge sharing — new team members can respond effectively

Playbook structure

A good playbook covers:

1. Detection signals

What triggered this playbook? Which alerts, symptoms, or user reports indicate this type of incident?

2. Impact assessment

Quick checks to determine scope:
  • Which services are affected?
  • How many users are impacted?
  • Is the issue regional or global?

3. Immediate actions

Step-by-step mitigation:
  1. Check recent deployments — roll back if found
  2. Check dependency status — is an upstream provider down?
  3. Check resource utilization — CPU, memory, disk, connections
  4. Check error logs for patterns

4. Communication template

Pre-written messages for stakeholders:
  • Status page update template
  • Internal Slack message template
  • Customer communication template

5. Escalation criteria

When to bring in more people:
  • No progress after 15 minutes → escalate to senior engineer
  • Customer-facing impact confirmed → notify incident commander
  • Data integrity concern → notify security team

6. Resolution verification

How to confirm the incident is resolved:
  • Monitoring shows all checks passing for 10+ minutes
  • Error rates return to baseline
  • No new user reports

Example: API latency spike

## API Latency Spike Playbook

### Detection
- Alert: "API p95 response time > 2s for 5 minutes"
- Monitor: API Health (HTTP)

### Immediate actions
1. Check dashboard for affected endpoints
2. Check recent deployments (last 2 hours)
   - If found: roll back → verify latency drops
3. Check database connection pool utilization
   - If saturated: restart connection pool or increase max connections
4. Check for unusual traffic patterns (DDoS, scraping)
   - If found: enable rate limiting

### Escalation
- No improvement after 15 minutes → page database team
- User-facing impact confirmed → update status page

### Resolution
- p95 latency below 500ms for 10 minutes
- No elevated error rates

Tips for effective playbooks

  • Keep them short — responders won’t read a 10-page document during an incident
  • Use checklists — not paragraphs. Each step should be a concrete action
  • Include links — to dashboards, runbooks, log queries, and escalation contacts
  • Update after incidents — every postmortem should review and update relevant playbooks
  • Test them — run game days to validate that playbooks work in practice

Incident Response 101

The full incident lifecycle.

Postmortems

Turn incidents into playbook improvements.