Skip to main content
Every monitor has an incident policy that controls when incidents open, how they’re confirmed, and when they auto-resolve. A policy has three components: trigger rules, a confirmation policy, and a recovery policy.
Define this in code. Manage incident policies as part of your monitoring-as-code workflow: YAML format · Terraform · CI/CD patterns

Trigger rules

Trigger rules define the conditions that open an incident from check results. Each monitor can have multiple rules at different severities.

Rule types

TypeBehaviorRequired fields
consecutive_failuresOpens an incident after N consecutive failed checkscount
failures_in_windowOpens an incident after N failures within a time windowcount, windowMinutes
response_timeOpens an incident when response time exceeds a thresholdthresholdMs, aggregationType

Scope

Each rule has a scope that determines how regions are evaluated:
ScopeBehavior
per_regionEach region is evaluated independently — the rule must be satisfied in a single region
any_regionFailures are aggregated across all regions

Severity

Each rule targets a severity level. When multiple rules fire, the highest severity wins:
SeverityPriority
downHighest — complete failure
degradedLower — partial failure or performance issue

Response time aggregation

For response_time rules, the aggregationType field controls how latency is evaluated across checks:
AggregationBehavior
all_exceedEvery check in the evaluation window must exceed the threshold
averageThe average response time exceeds the threshold
p95The 95th percentile exceeds the threshold
maxThe maximum response time exceeds the threshold

Default policy

When you create a monitor without specifying a policy, DevHelm applies a sensible default:
  • Trigger: 2 consecutive failures per region → severity down
  • Confirmation: Multi-region, 1 region failing, wait up to max(60, frequency × 2) seconds
  • Recovery: 2 consecutive successes, 2 regions passing, 5-minute cooldown

Example

A policy with two trigger rules — one for complete failures and one for performance degradation:
{
  "triggerRules": [
    {
      "type": "consecutive_failures",
      "count": 3,
      "scope": "per_region",
      "severity": "down"
    },
    {
      "type": "response_time",
      "thresholdMs": 5000,
      "aggregationType": "p95",
      "scope": "any_region",
      "severity": "degraded"
    }
  ],
  "confirmation": {
    "type": "multi_region",
    "minRegionsFailing": 2,
    "maxWaitSeconds": 120
  },
  "recovery": {
    "consecutiveSuccesses": 3,
    "minRegionsPassing": 2,
    "cooldownMinutes": 10
  }
}

Confirmation

Confirmation prevents false positives by requiring failures from multiple probe regions before promoting an incident to CONFIRMED status.
FieldTypeDescription
typestringConfirmation strategy — currently multi_region
minRegionsFailingintegerMinimum regions that must be failing to confirm
maxWaitSecondsintegerMaximum seconds to wait for enough regions to report failures
When a trigger rule fires in one region, the confirmation policy waits up to maxWaitSeconds for at least minRegionsFailing regions to also report failures. If enough regions confirm within the window, the incident moves to CONFIRMED and alerts fire. If the window expires without enough regions failing, the incident is discarded.
Set minRegionsFailing to 1 to confirm on the first region that reports a failure. This is useful for monitors running from a single region.

Recovery

Recovery controls when a confirmed incident auto-resolves.
FieldTypeDescription
consecutiveSuccessesintegerNumber of consecutive passing checks required before resolving
minRegionsPassingintegerMinimum regions that must be healthy before recovery completes
cooldownMinutesintegerMinutes after resolution before a new incident can open (0–60)
The recovery policy ensures stability before closing an incident. After the required consecutive successes are observed across enough regions, the incident moves to RESOLVED. The cooldown period then suppresses new incidents for the same monitor, preventing flapping.

Managing policies

View a monitor’s policy

devhelm monitors get <monitor-id> --include-policy

Update a policy

devhelm monitors update-policy <monitor-id> \
  --trigger-type consecutive_failures \
  --trigger-count 3 \
  --trigger-scope per_region \
  --trigger-severity down \
  --confirmation-min-regions 2 \
  --confirmation-max-wait 120 \
  --recovery-successes 3 \
  --recovery-min-regions 2 \
  --recovery-cooldown 10

Next steps

Incidents overview

Understand the full incident lifecycle and statuses.

Monitoring regions

Learn how multi-region checks interact with confirmation policies.

Alerting overview

Configure notifications for confirmed incidents.

Maintenance windows

Suppress incidents during planned downtime.