Skip to main content
Incident response is the structured process for detecting, triaging, mitigating, and resolving service disruptions. A clear process means faster recovery and less chaos.

The incident lifecycle

1. Detection

Something breaks. Detection can come from:
  • Automated monitoring — uptime checks, synthetic probes, alerting rules
  • User reports — support tickets, social media, direct complaints
  • Internal observation — a team member notices something off
Automated detection is always faster. The goal is to detect problems before users notice them.

2. Triage

Assess the situation quickly:
  • What’s affected? — which services, which users, what’s the blast radius?
  • How severe is it? — total outage, degradation, or cosmetic issue?
  • Who needs to know? — which teams, which stakeholders?
Triage should take minutes, not hours. Use your severity classification to standardize decisions.

3. Mitigation

Stop the bleeding. Mitigation is about reducing impact, not fixing the root cause:
  • Roll back a bad deployment
  • Redirect traffic away from a failing region
  • Enable a feature flag to disable the broken feature
  • Scale up resources if the issue is capacity-related
The priority is restoring service for users. Investigation comes later.

4. Resolution

The service is stable and functioning correctly. Resolution means:
  • The immediate problem is fixed (or worked around)
  • Monitoring confirms the service is healthy
  • No further user impact

5. Follow-up

After the incident is resolved:
  • Write a postmortem
  • Identify action items to prevent recurrence
  • Update playbooks with lessons learned
  • Communicate resolution to stakeholders

Roles during an incident

For anything beyond a trivial incident, define roles:
RoleResponsibility
Incident commanderCoordinates response, makes decisions, manages communication
Technical leadInvestigates root cause and implements fixes
CommunicatorUpdates stakeholders, status page, and internal channels
ScribeDocuments timeline, actions taken, and decisions made
Small teams may combine roles. The key is that someone is explicitly coordinating.

Building an incident response process

  1. Define severity levels — so triage is fast and consistent
  2. Set up alerting — monitors, notification policies, escalation chains
  3. Create playbooks — step-by-step guides for common failure modes
  4. Practice — run game days or tabletop exercises
  5. Review — postmortems after every significant incident

DevHelm incident management

Incidents overview

Automated incident lifecycle in DevHelm.

Notification policies

Route alerts to the right people.