SLA, SLO, and SLI

SLIs measure your service. SLOs set internal targets. SLAs make external promises. Together they form the language of reliability.

Service Level Indicator (SLI)

An SLI is a quantitative measurement of a specific aspect of your service:

SLI	What it measures	Example
Availability	Percentage of successful requests	99.95% of HTTP checks pass
Latency	Response time distribution	p99 latency is 800ms
Error rate	Percentage of failed requests	0.1% of API calls return 5xx
Throughput	Requests processed per second	10,000 req/s sustained

SLIs should be measurable, meaningful to users, and derived from actual monitoring data.

Service Level Objective (SLO)

An SLO is an internal target for an SLI:

“Our API will have 99.9% availability measured over a rolling 30-day window”
“p95 latency will be below 500ms”
“Error rate will stay below 0.1%”

SLOs are targets your team commits to. They’re more aggressive than SLAs and give you a buffer before you violate external promises.

Error budgets

The error budget is the inverse of your SLO — the amount of unreliability you’re allowed:

Error budget = 1 − SLO target

For a 99.9% availability SLO over 30 days:

Error budget = 0.1% = ~43 minutes of downtime

When the error budget is consumed, the team should prioritize reliability over new features.

Service Level Agreement (SLA)

An SLA is a contract with customers that defines minimum service levels and consequences for violations:

“We guarantee 99.95% uptime per month. If we fall below this, affected customers receive a 10% service credit.”

SLAs are legal commitments. They should always be less aggressive than your SLOs — your SLO is the early warning system that prevents SLA violations.

How they relate

SLI (measurement) → SLO (target) → SLA (promise)

"Our availability is 99.97%" → "We target 99.95%" → "We guarantee 99.9%"

Layer	Audience	Consequence of miss
SLI	Engineering team	Data point for dashboards
SLO	Engineering + product	Trigger error budget policy
SLA	Customers	Financial penalties, credits

Setting effective SLOs

Start with user impact

What does the user experience when the SLI degrades? If users don’t notice p99 latency increasing from 200ms to 400ms, an aggressive latency SLO wastes engineering effort.

Use meaningful time windows

Rolling windows (last 30 days) give a continuous view
Calendar windows (this month) align with business reporting
Rolling windows are generally better for engineering decision-making

Not everything needs a 99.99%

Service type	Typical availability target
Payment processing	99.99%
User-facing API	99.9%
Admin dashboard	99.5%
Development tools	99%
Batch processing	”Completes within SLA window”

Monitoring SLIs with DevHelm

Uptime monitors provide the data for availability and latency SLIs:

Availability SLI = percentage of passing checks over the window
Latency SLI = response time percentile from check results

Uptime reporting guide

Generate uptime reports from monitoring data.

MTTR & MTTD

Incident response metrics that feed into SLOs.

​Service Level Indicator (SLI)

​Service Level Objective (SLO)

​Error budgets

​Service Level Agreement (SLA)

​How they relate

​Setting effective SLOs

​Start with user impact

​Use meaningful time windows

​Not everything needs a 99.99%

​Monitoring SLIs with DevHelm