The monitoring challenge
In a monolith, one health check covers the entire application. In microservices, you need to monitor:- Each service independently — is the user service up? The payment service?
- Service dependencies — can service A reach service B?
- Aggregate health — is the overall system working for end users?
- Shared infrastructure — databases, message queues, caches
Layer your monitoring
Layer 1: External endpoint monitoring
Monitor the endpoints your users actually hit. These are your most important checks because they represent real user experience:- API gateway health
- Public website availability
- Authentication endpoints
Layer 2: Individual service health
Each microservice should expose a health endpoint that tests its own dependencies:Layer 3: Inter-service communication
Monitor service-to-service calls:- Can the order service reach the inventory service?
- Is the message queue processing events?
- Are internal API latencies within budget?
Layer 4: Infrastructure
Monitor the shared components that multiple services depend on:- Database clusters
- Message brokers (NATS, RabbitMQ, Kafka)
- Cache layers (Redis, Memcached)
- Service mesh / load balancers
Health endpoint design
A good microservice health endpoint:- Tests real dependencies — actually queries the database, pings the cache
- Is fast — returns in under 500ms (use timeouts on dependency checks)
- Is unauthenticated — monitoring probes shouldn’t need service credentials
- Returns structured data — JSON with component-level status
- Distinguishes readiness from liveness — “can accept traffic” vs “process is alive”
Resource groups for composite health
Group related monitors into a single health view:- Health threshold — “degraded if 2+ members are down”
- Group-level alerts — notify when the group’s health drops
- Suppress member alerts — avoid alert storms from cascading failures
Avoiding alert storms
When a shared dependency (database, DNS) fails, every service that depends on it fires an alert simultaneously. Mitigate this with:- Resource groups with
suppressMemberAlerts— one group alert instead of ten - Notification policy routing — route by tag so infrastructure alerts go to the right team
- Confirmation delays — wait 30–60 seconds before confirming, allowing cascading failures to consolidate
DevHelm for microservices
Resource groups
Build composite health views.
Monitor types
HTTP, TCP, DNS, and heartbeat checks.