Skip to main content
Postmortems turn incidents into learning opportunities. Focus on systems, not people, and commit to concrete follow-up actions.

Blameless culture

The most important principle: blame systems, not people. If a human made an error, ask why the system allowed that error to have impact:
  • “Why didn’t the deployment pipeline catch the config error?”
  • “Why wasn’t there a validation check before that API call?”
  • “Why could a single change bring down the entire service?”
When people fear blame, they hide information. When they trust the process, they share freely — and the whole team learns.

Postmortem structure

Summary

2–3 sentences covering what happened, the impact, and how long it lasted.

Timeline

Chronological events from first signal to full resolution:
14:02 — Monitoring alert: API p95 latency > 5s
14:04 — On-call engineer acknowledges
14:08 — Investigation: database connection pool exhausted
14:12 — Mitigation: increased max connections from 50 to 200
14:15 — Latency returning to normal
14:25 — Confirmed stable for 10 minutes, incident resolved

Impact

Quantify the damage:
  • Duration of user impact
  • Number of affected users or requests
  • Revenue impact (if measurable)
  • SLA/SLO impact

Root cause

The underlying technical reason, not “human error”:
A deployment at 13:55 introduced a new query that opened a connection per request instead of using the connection pool. Under normal traffic, this exhausted the 50-connection limit within 7 minutes.

Contributing factors

What else made the incident worse or slower to resolve:
  • No connection pool monitoring alert existed
  • The deployment happened on Friday afternoon with reduced staffing
  • The runbook for database issues was outdated

Action items

Concrete, assignable tasks with owners and deadlines:
ActionOwnerDue
Add connection pool utilization alert@aliceNext sprint
Fix connection leak in new query@bobThis week
Update database runbook@carolNext sprint
Add connection pool limit to load test@daveNext month

Running a postmortem meeting

  1. Schedule within 48 hours while memories are fresh
  2. Include all responders plus relevant stakeholders
  3. Walk through the timeline together — fill gaps, correct errors
  4. Focus on systems — redirect any blame to systemic improvements
  5. Assign action items with clear owners
  6. Publish the document where the whole team can read it

Common pitfalls

  • Skipping postmortems for “small” incidents — small incidents reveal systemic issues
  • Action items without owners — unassigned items never get done
  • Blame disguised as process — “Engineer should have tested more carefully” is still blame
  • No follow-through — review action item completion in the next team sync

Incident Response 101

The full incident lifecycle including follow-up.

Communication during incidents

Keep stakeholders informed throughout.