The incident lifecycle
1. Detection
Something breaks. Detection can come from:- Automated monitoring — uptime checks, synthetic probes, alerting rules
- User reports — support tickets, social media, direct complaints
- Internal observation — a team member notices something off
2. Triage
Assess the situation quickly:- What’s affected? — which services, which users, what’s the blast radius?
- How severe is it? — total outage, degradation, or cosmetic issue?
- Who needs to know? — which teams, which stakeholders?
3. Mitigation
Stop the bleeding. Mitigation is about reducing impact, not fixing the root cause:- Roll back a bad deployment
- Redirect traffic away from a failing region
- Enable a feature flag to disable the broken feature
- Scale up resources if the issue is capacity-related
4. Resolution
The service is stable and functioning correctly. Resolution means:- The immediate problem is fixed (or worked around)
- Monitoring confirms the service is healthy
- No further user impact
5. Follow-up
After the incident is resolved:- Write a postmortem
- Identify action items to prevent recurrence
- Update playbooks with lessons learned
- Communicate resolution to stakeholders
Roles during an incident
For anything beyond a trivial incident, define roles:| Role | Responsibility |
|---|---|
| Incident commander | Coordinates response, makes decisions, manages communication |
| Technical lead | Investigates root cause and implements fixes |
| Communicator | Updates stakeholders, status page, and internal channels |
| Scribe | Documents timeline, actions taken, and decisions made |
Building an incident response process
- Define severity levels — so triage is fast and consistent
- Set up alerting — monitors, notification policies, escalation chains
- Create playbooks — step-by-step guides for common failure modes
- Practice — run game days or tabletop exercises
- Review — postmortems after every significant incident
DevHelm incident management
Incidents overview
Automated incident lifecycle in DevHelm.
Notification policies
Route alerts to the right people.