Skip to main content
By the end of this guide, you’ll have monitors that detect slow endpoints before they become outages — using layered response time thresholds at warn and fail severity.

Why response time budgets matter

An endpoint that’s technically “up” but responding in 10 seconds is effectively broken for users. Response time assertions let you define performance budgets at multiple levels, catching degradation early.

Set up layered thresholds

Use two assertions — a warning for early detection and a failure for critical slowdowns:
monitors:
  - name: API Health
    type: HTTP
    config:
      url: https://api.example.com/health
    frequencySeconds: 60
    regions:
      - us-east
      - eu-west
    assertions:
      - config:
          type: response_time
          thresholdMs: 500
        severity: warn
      - config:
          type: response_time
          thresholdMs: 2000
        severity: fail
This gives you:
  • Warning at 500ms — endpoint is slower than expected, investigate
  • Failure at 2000ms — endpoint is critically slow, open a DEGRADED incident

Combine with trigger rules

Response time assertions detect individual slow checks. For incident creation, use a response_time trigger rule to aggregate across checks:
incidentPolicy:
  triggerRules:
    - type: consecutive_failures
      count: 3
      scope: per_region
      severity: down
    - type: response_time
      thresholdMs: 2000
      aggregationType: p95
      scope: any_region
      severity: degraded
  confirmation:
    type: multi_region
    minRegionsFailing: 2
    maxWaitSeconds: 120
This creates a DEGRADED incident when the p95 response time exceeds 2 seconds across any region, while still opening a DOWN incident for complete failures.

Aggregation types

TypeBehavior
all_exceedEvery check in the window must exceed the threshold
averageAverage response time exceeds the threshold
p9595th percentile exceeds the threshold
maxMaximum response time exceeds the threshold

Route differently by severity

Use notification policies to handle DEGRADED and DOWN incidents differently:
SeverityAlert channelUrgency
DEGRADEDSlack channelAwareness — investigate during business hours
DOWNPagerDutyPage on-call immediately
{
  "name": "Degraded to Slack",
  "matchRules": [
    { "type": "severity_gte", "value": "DEGRADED" }
  ],
  "escalation": {
    "steps": [{
      "delayMinutes": 0,
      "channelIds": ["<slack-channel-id>"]
    }]
  },
  "priority": 5
}

Choosing thresholds

Endpoint typeWarn thresholdFail threshold
Health check / status200ms1000ms
Public API500ms2000ms
Dashboard page1000ms5000ms
Background webhook2000ms10000ms
Base your thresholds on baseline measurements. Use the Dashboard’s response time charts to understand normal performance for each monitor.

Next steps

Incident policies

Configure trigger rules and response time aggregation.

HTTP assertions

Full list of HTTP assertion types.

Alert routing by tag

Send degraded and down alerts to different teams.