Skip to main content
Microservices need monitoring at multiple levels — individual service health, inter-service communication, and aggregate system status.

The monitoring challenge

In a monolith, one health check covers the entire application. In microservices, you need to monitor:
  • Each service independently — is the user service up? The payment service?
  • Service dependencies — can service A reach service B?
  • Aggregate health — is the overall system working for end users?
  • Shared infrastructure — databases, message queues, caches

Layer your monitoring

Layer 1: External endpoint monitoring

Monitor the endpoints your users actually hit. These are your most important checks because they represent real user experience:
  • API gateway health
  • Public website availability
  • Authentication endpoints

Layer 2: Individual service health

Each microservice should expose a health endpoint that tests its own dependencies:
GET /health
{
  "status": "healthy",
  "dependencies": {
    "database": "connected",
    "cache": "connected",
    "messageQueue": "connected"
  }
}

Layer 3: Inter-service communication

Monitor service-to-service calls:
  • Can the order service reach the inventory service?
  • Is the message queue processing events?
  • Are internal API latencies within budget?

Layer 4: Infrastructure

Monitor the shared components that multiple services depend on:
  • Database clusters
  • Message brokers (NATS, RabbitMQ, Kafka)
  • Cache layers (Redis, Memcached)
  • Service mesh / load balancers

Health endpoint design

A good microservice health endpoint:
  1. Tests real dependencies — actually queries the database, pings the cache
  2. Is fast — returns in under 500ms (use timeouts on dependency checks)
  3. Is unauthenticated — monitoring probes shouldn’t need service credentials
  4. Returns structured data — JSON with component-level status
  5. Distinguishes readiness from liveness — “can accept traffic” vs “process is alive”

Resource groups for composite health

Group related monitors into a single health view:
Payment Service (resource group)
  ├── Payment API health (HTTP)
  ├── Payment database (TCP:5432)
  ├── Stripe webhook handler (HTTP)
  └── Stripe status (dependency)
A resource group can define:
  • Health threshold — “degraded if 2+ members are down”
  • Group-level alerts — notify when the group’s health drops
  • Suppress member alerts — avoid alert storms from cascading failures

Avoiding alert storms

When a shared dependency (database, DNS) fails, every service that depends on it fires an alert simultaneously. Mitigate this with:
  • Resource groups with suppressMemberAlerts — one group alert instead of ten
  • Notification policy routing — route by tag so infrastructure alerts go to the right team
  • Confirmation delays — wait 30–60 seconds before confirming, allowing cascading failures to consolidate

DevHelm for microservices

Resource groups

Build composite health views.

Monitor types

HTTP, TCP, DNS, and heartbeat checks.