Why false positives happen
- Transient network issues — a single packet drop between probe and server
- DNS propagation delays — intermittent resolution failures during changes
- Load balancer health checks — brief unhealthy periods during deployments
- Rate limiting — monitoring probes get throttled
- Server restarts — momentary unavailability during graceful shutdown
Strategy 1: Multi-region confirmation
The most effective approach. Require failures from multiple probe regions before creating an incident:- Single-region failure → likely a network path issue, not an outage
- Multi-region failure → likely a real problem with your service
Strategy 2: Consecutive failure thresholds
Require multiple consecutive failed checks before alerting:| Threshold | Behavior |
|---|---|
| 1 failure | Alert immediately (noisy) |
| 2 consecutive | Filters out single transient failures |
| 3 consecutive | High confidence, but slower detection |
Strategy 3: Confirmation windows
Instead of counting consecutive failures, count failures within a time window:- “3 failures in the last 5 minutes” catches intermittent issues
- More flexible than consecutive-only thresholds
- Handles scenarios where checks alternate between pass and fail
Strategy 4: Smart assertions
Overly strict assertions cause false positives: Too strict:- Response body must exactly match a snapshot (breaks on any content change)
- Response time must be under 200ms (fails during normal load spikes)
- Response body contains
"status": "healthy"(tolerates other field changes) - Response time p95 under 2 seconds (allows occasional slow requests)
- Status code is in the 2xx range (not just exactly 200)
Strategy 5: Separate warning and failure severities
Use two-tier assertions:- Warning (severity:
warn): response time > 1s → log but don’t alert - Failure (severity:
fail): response time > 5s → create incident
Strategy 6: Maintenance windows
Schedule alert suppression during planned maintenance:- Deployments
- Database migrations
- Infrastructure changes
Measuring false positive rate
Track your false positive rate over time:DevHelm configuration
Incident policies
Configure trigger rules and multi-region confirmation.
Multi-region monitoring
Set up checks from multiple locations.