Rotation structure
Weekly rotations
The most common pattern. One engineer is primary on-call for a week, then hands off to the next person. Pros: Simple, predictable schedule. Engineers can plan their week. Cons: A bad week can be exhausting if incident volume is high.Follow-the-sun
Engineers in different time zones cover their business hours only. No one gets woken up at 3 AM. Pros: No overnight pages. Better quality of life. Cons: Requires team members in multiple time zones. Handoff complexity.Primary + secondary
Two engineers are on-call: primary handles alerts first, secondary is backup if primary doesn’t respond within a time window. Pros: Safety net prevents missed alerts. Cons: More people tied up in on-call duty.Setting expectations
Document what on-call means at your organization:| Expectation | Guideline |
|---|---|
| Response time | Acknowledge alerts within 5–15 minutes |
| Availability | Reachable by phone/laptop during on-call hours |
| Escalation | Escalate to secondary after 15 minutes with no progress |
| Handoff | End-of-rotation handoff with open issues summary |
| Compensation | On-call pay, comp time, or other recognition |
Reducing on-call burden
Fix the alerts, not the people
If on-call is painful, the problem is usually the alerts, not the rotation:- High false positive rate → tighten monitoring thresholds
- Too many alerts → consolidate related monitors into resource groups
- Alerts with no action → remove or downgrade them to warnings
- Same issue recurring → fix the root cause, don’t just respond again
Actionable alerts only
Every alert should have a clear action. If the on-call engineer can’t do anything about an alert, it shouldn’t page them.Runbooks for every alert
Each alert should link to a runbook or playbook that explains:- What the alert means
- How to investigate
- Common fixes
- When to escalate
On-call review meetings
Hold a weekly review of on-call experience:- How many alerts fired?
- How many were actionable?
- What was the false positive rate?
- What improvements can be made?
Avoiding burnout
- Limit on-call frequency — no more than every 3–4 weeks per person
- Respect off-hours — don’t page for non-critical issues overnight
- Compensate fairly — on-call is real work that deserves recognition
- Track alert volume — if it’s trending up, prioritize reducing it
- Post-incident rest — after a long incident, give the responder recovery time
DevHelm alerting
Escalation chains
Multi-step escalation with delays.
Tiered escalation guide
Set up primary → secondary → management escalation.