On-Call Best Practices

Good on-call practices balance rapid response with engineer well-being. Structure rotations, set expectations, and avoid burnout.

Rotation structure

Weekly rotations

The most common pattern. One engineer is primary on-call for a week, then hands off to the next person. Pros: Simple, predictable schedule. Engineers can plan their week. Cons: A bad week can be exhausting if incident volume is high.

Follow-the-sun

Engineers in different time zones cover their business hours only. No one gets woken up at 3 AM. Pros: No overnight pages. Better quality of life. Cons: Requires team members in multiple time zones. Handoff complexity.

Primary + secondary

Two engineers are on-call: primary handles alerts first, secondary is backup if primary doesn’t respond within a time window. Pros: Safety net prevents missed alerts. Cons: More people tied up in on-call duty.

Setting expectations

Document what on-call means at your organization:

Expectation	Guideline
Response time	Acknowledge alerts within 5–15 minutes
Availability	Reachable by phone/laptop during on-call hours
Escalation	Escalate to secondary after 15 minutes with no progress
Handoff	End-of-rotation handoff with open issues summary
Compensation	On-call pay, comp time, or other recognition

Reducing on-call burden

Fix the alerts, not the people

If on-call is painful, the problem is usually the alerts, not the rotation:

High false positive rate → tighten monitoring thresholds
Too many alerts → consolidate related monitors into resource groups
Alerts with no action → remove or downgrade them to warnings
Same issue recurring → fix the root cause, don’t just respond again

Actionable alerts only

Every alert should have a clear action. If the on-call engineer can’t do anything about an alert, it shouldn’t page them.

Runbooks for every alert

Each alert should link to a runbook or playbook that explains:

What the alert means
How to investigate
Common fixes
When to escalate

On-call review meetings

Hold a weekly review of on-call experience:

How many alerts fired?
How many were actionable?
What was the false positive rate?
What improvements can be made?

Avoiding burnout

Limit on-call frequency — no more than every 3–4 weeks per person
Respect off-hours — don’t page for non-critical issues overnight
Compensate fairly — on-call is real work that deserves recognition
Track alert volume — if it’s trending up, prioritize reducing it
Post-incident rest — after a long incident, give the responder recovery time

DevHelm alerting

Escalation chains

Multi-step escalation with delays.

Tiered escalation guide

Set up primary → secondary → management escalation.

Incident Response 101

Severity Classification

Documentation Index

​Rotation structure

​Weekly rotations

​Follow-the-sun

​Primary + secondary

​Setting expectations

​Reducing on-call burden

​Fix the alerts, not the people

​Actionable alerts only

​Runbooks for every alert

​On-call review meetings

​Avoiding burnout

​DevHelm alerting