Postmortems - DevHelm

Postmortems turn incidents into learning opportunities. Focus on systems, not people, and commit to concrete follow-up actions.

Blameless culture

The most important principle: blame systems, not people. If a human made an error, ask why the system allowed that error to have impact:

“Why didn’t the deployment pipeline catch the config error?”
“Why wasn’t there a validation check before that API call?”
“Why could a single change bring down the entire service?”

When people fear blame, they hide information. When they trust the process, they share freely — and the whole team learns.

Postmortem structure

Summary

2–3 sentences covering what happened, the impact, and how long it lasted.

Timeline

Chronological events from first signal to full resolution:

02 — Monitoring alert: API p95 latency > 5s
04 — On-call engineer acknowledges
08 — Investigation: database connection pool exhausted
12 — Mitigation: increased max connections from 50 to 200
15 — Latency returning to normal
25 — Confirmed stable for 10 minutes, incident resolved

Impact

Quantify the damage:

Duration of user impact
Number of affected users or requests
Revenue impact (if measurable)
SLA/SLO impact

Root cause

The underlying technical reason, not “human error”:

A deployment at 13:55 introduced a new query that opened a connection per request instead of using the connection pool. Under normal traffic, this exhausted the 50-connection limit within 7 minutes.

Contributing factors

What else made the incident worse or slower to resolve:

No connection pool monitoring alert existed
The deployment happened on Friday afternoon with reduced staffing
The runbook for database issues was outdated

Action items

Concrete, assignable tasks with owners and deadlines:

Action	Owner	Due
Add connection pool utilization alert	@alice	Next sprint
Fix connection leak in new query	@bob	This week
Update database runbook	@carol	Next sprint
Add connection pool limit to load test	@dave	Next month

Running a postmortem meeting

Schedule within 48 hours while memories are fresh
Include all responders plus relevant stakeholders
Walk through the timeline together — fill gaps, correct errors
Focus on systems — redirect any blame to systemic improvements
Assign action items with clear owners
Publish the document where the whole team can read it

Common pitfalls

Skipping postmortems for “small” incidents — small incidents reveal systemic issues
Action items without owners — unassigned items never get done
Blame disguised as process — “Engineer should have tested more carefully” is still blame
No follow-through — review action item completion in the next team sync

Incident Response 101

The full incident lifecycle including follow-up.

Communication during incidents

Keep stakeholders informed throughout.

Documentation Index

​Blameless culture

​Postmortem structure

​Summary

​Timeline

​Impact

​Root cause

​Contributing factors

​Action items

​Running a postmortem meeting

​Common pitfalls