Skip to main content
Good on-call practices balance rapid response with engineer well-being. Structure rotations, set expectations, and avoid burnout.

Rotation structure

Weekly rotations

The most common pattern. One engineer is primary on-call for a week, then hands off to the next person. Pros: Simple, predictable schedule. Engineers can plan their week. Cons: A bad week can be exhausting if incident volume is high.

Follow-the-sun

Engineers in different time zones cover their business hours only. No one gets woken up at 3 AM. Pros: No overnight pages. Better quality of life. Cons: Requires team members in multiple time zones. Handoff complexity.

Primary + secondary

Two engineers are on-call: primary handles alerts first, secondary is backup if primary doesn’t respond within a time window. Pros: Safety net prevents missed alerts. Cons: More people tied up in on-call duty.

Setting expectations

Document what on-call means at your organization:
ExpectationGuideline
Response timeAcknowledge alerts within 5–15 minutes
AvailabilityReachable by phone/laptop during on-call hours
EscalationEscalate to secondary after 15 minutes with no progress
HandoffEnd-of-rotation handoff with open issues summary
CompensationOn-call pay, comp time, or other recognition

Reducing on-call burden

Fix the alerts, not the people

If on-call is painful, the problem is usually the alerts, not the rotation:
  • High false positive rate → tighten monitoring thresholds
  • Too many alerts → consolidate related monitors into resource groups
  • Alerts with no action → remove or downgrade them to warnings
  • Same issue recurring → fix the root cause, don’t just respond again

Actionable alerts only

Every alert should have a clear action. If the on-call engineer can’t do anything about an alert, it shouldn’t page them.

Runbooks for every alert

Each alert should link to a runbook or playbook that explains:
  • What the alert means
  • How to investigate
  • Common fixes
  • When to escalate

On-call review meetings

Hold a weekly review of on-call experience:
  • How many alerts fired?
  • How many were actionable?
  • What was the false positive rate?
  • What improvements can be made?

Avoiding burnout

  • Limit on-call frequency — no more than every 3–4 weeks per person
  • Respect off-hours — don’t page for non-critical issues overnight
  • Compensate fairly — on-call is real work that deserves recognition
  • Track alert volume — if it’s trending up, prioritize reducing it
  • Post-incident rest — after a long incident, give the responder recovery time

DevHelm alerting

Escalation chains

Multi-step escalation with delays.

Tiered escalation guide

Set up primary → secondary → management escalation.