Monitoring Microservices

Microservices need monitoring at multiple levels — individual service health, inter-service communication, and aggregate system status.

The monitoring challenge

In a monolith, one health check covers the entire application. In microservices, you need to monitor:

Each service independently — is the user service up? The payment service?
Service dependencies — can service A reach service B?
Aggregate health — is the overall system working for end users?
Shared infrastructure — databases, message queues, caches

Layer your monitoring

Layer 1: External endpoint monitoring

Monitor the endpoints your users actually hit. These are your most important checks because they represent real user experience:

API gateway health
Public website availability
Authentication endpoints

Layer 2: Individual service health

Each microservice should expose a health endpoint that tests its own dependencies:

GET /health
{
  "status": "healthy",
  "dependencies": {
    "database": "connected",
    "cache": "connected",
    "messageQueue": "connected"
  }
}

Layer 3: Inter-service communication

Monitor service-to-service calls:

Can the order service reach the inventory service?
Is the message queue processing events?
Are internal API latencies within budget?

Layer 4: Infrastructure

Monitor the shared components that multiple services depend on:

Database clusters
Message brokers (NATS, RabbitMQ, Kafka)
Cache layers (Redis, Memcached)
Service mesh / load balancers

Health endpoint design

A good microservice health endpoint:

Tests real dependencies — actually queries the database, pings the cache
Is fast — returns in under 500ms (use timeouts on dependency checks)
Is unauthenticated — monitoring probes shouldn’t need service credentials
Returns structured data — JSON with component-level status
Distinguishes readiness from liveness — “can accept traffic” vs “process is alive”

Resource groups for composite health

Group related monitors into a single health view:

Payment Service (resource group)
  ├── Payment API health (HTTP)
  ├── Payment database (TCP:5432)
  ├── Stripe webhook handler (HTTP)
  └── Stripe status (dependency)

A resource group can define:

Health threshold — “degraded if 2+ members are down”
Group-level alerts — notify when the group’s health drops
Suppress member alerts — avoid alert storms from cascading failures

Avoiding alert storms

When a shared dependency (database, DNS) fails, every service that depends on it fires an alert simultaneously. Mitigate this with:

Resource groups with suppressMemberAlerts — one group alert instead of ten
Notification policy routing — route by tag so infrastructure alerts go to the right team
Confirmation delays — wait 30–60 seconds before confirming, allowing cascading failures to consolidate

Monitoring Microservices

The monitoring challenge

Layer your monitoring

Layer 1: External endpoint monitoring

Layer 2: Individual service health

Layer 3: Inter-service communication

Layer 4: Infrastructure

Health endpoint design

Resource groups for composite health

Avoiding alert storms

DevHelm for microservices

Resource groups

Monitor types

​The monitoring challenge

​Layer your monitoring

​Layer 1: External endpoint monitoring

​Layer 2: Individual service health

​Layer 3: Inter-service communication

​Layer 4: Infrastructure

​Health endpoint design

​Resource groups for composite health

​Avoiding alert storms

​DevHelm for microservices

Resource groups

Monitor types

The monitoring challenge

Layer your monitoring

Layer 1: External endpoint monitoring

Layer 2: Individual service health

Layer 3: Inter-service communication

Layer 4: Infrastructure

Health endpoint design

Resource groups for composite health

Avoiding alert storms

DevHelm for microservices