Blameless culture
The most important principle: blame systems, not people. If a human made an error, ask why the system allowed that error to have impact:- “Why didn’t the deployment pipeline catch the config error?”
- “Why wasn’t there a validation check before that API call?”
- “Why could a single change bring down the entire service?”
Postmortem structure
Summary
2–3 sentences covering what happened, the impact, and how long it lasted.Timeline
Chronological events from first signal to full resolution:Impact
Quantify the damage:- Duration of user impact
- Number of affected users or requests
- Revenue impact (if measurable)
- SLA/SLO impact
Root cause
The underlying technical reason, not “human error”:A deployment at 13:55 introduced a new query that opened a connection per request instead of using the connection pool. Under normal traffic, this exhausted the 50-connection limit within 7 minutes.
Contributing factors
What else made the incident worse or slower to resolve:- No connection pool monitoring alert existed
- The deployment happened on Friday afternoon with reduced staffing
- The runbook for database issues was outdated
Action items
Concrete, assignable tasks with owners and deadlines:| Action | Owner | Due |
|---|---|---|
| Add connection pool utilization alert | @alice | Next sprint |
| Fix connection leak in new query | @bob | This week |
| Update database runbook | @carol | Next sprint |
| Add connection pool limit to load test | @dave | Next month |
Running a postmortem meeting
- Schedule within 48 hours while memories are fresh
- Include all responders plus relevant stakeholders
- Walk through the timeline together — fill gaps, correct errors
- Focus on systems — redirect any blame to systemic improvements
- Assign action items with clear owners
- Publish the document where the whole team can read it
Common pitfalls
- Skipping postmortems for “small” incidents — small incidents reveal systemic issues
- Action items without owners — unassigned items never get done
- Blame disguised as process — “Engineer should have tested more carefully” is still blame
- No follow-through — review action item completion in the next team sync
Incident Response 101
The full incident lifecycle including follow-up.
Communication during incidents
Keep stakeholders informed throughout.