Skip to main content
SLIs measure your service. SLOs set internal targets. SLAs make external promises. Together they form the language of reliability.

Service Level Indicator (SLI)

An SLI is a quantitative measurement of a specific aspect of your service:
SLIWhat it measuresExample
AvailabilityPercentage of successful requests99.95% of HTTP checks pass
LatencyResponse time distributionp99 latency is 800ms
Error ratePercentage of failed requests0.1% of API calls return 5xx
ThroughputRequests processed per second10,000 req/s sustained
SLIs should be measurable, meaningful to users, and derived from actual monitoring data.

Service Level Objective (SLO)

An SLO is an internal target for an SLI:
  • “Our API will have 99.9% availability measured over a rolling 30-day window”
  • “p95 latency will be below 500ms”
  • “Error rate will stay below 0.1%”
SLOs are targets your team commits to. They’re more aggressive than SLAs and give you a buffer before you violate external promises.

Error budgets

The error budget is the inverse of your SLO — the amount of unreliability you’re allowed:
Error budget = 1 − SLO target
For a 99.9% availability SLO over 30 days:
  • Error budget = 0.1% = ~43 minutes of downtime
When the error budget is consumed, the team should prioritize reliability over new features.

Service Level Agreement (SLA)

An SLA is a contract with customers that defines minimum service levels and consequences for violations:
  • “We guarantee 99.95% uptime per month. If we fall below this, affected customers receive a 10% service credit.”
SLAs are legal commitments. They should always be less aggressive than your SLOs — your SLO is the early warning system that prevents SLA violations.

How they relate

SLI (measurement) → SLO (target) → SLA (promise)

"Our availability is 99.97%" → "We target 99.95%" → "We guarantee 99.9%"
LayerAudienceConsequence of miss
SLIEngineering teamData point for dashboards
SLOEngineering + productTrigger error budget policy
SLACustomersFinancial penalties, credits

Setting effective SLOs

Start with user impact

What does the user experience when the SLI degrades? If users don’t notice p99 latency increasing from 200ms to 400ms, an aggressive latency SLO wastes engineering effort.

Use meaningful time windows

  • Rolling windows (last 30 days) give a continuous view
  • Calendar windows (this month) align with business reporting
  • Rolling windows are generally better for engineering decision-making

Not everything needs a 99.99%

Service typeTypical availability target
Payment processing99.99%
User-facing API99.9%
Admin dashboard99.5%
Development tools99%
Batch processing”Completes within SLA window”

Monitoring SLIs with DevHelm

Uptime monitors provide the data for availability and latency SLIs:
  • Availability SLI = percentage of passing checks over the window
  • Latency SLI = response time percentile from check results

Uptime reporting guide

Generate uptime reports from monitoring data.

MTTR & MTTD

Incident response metrics that feed into SLOs.