Service Level Indicator (SLI)
An SLI is a quantitative measurement of a specific aspect of your service:| SLI | What it measures | Example |
|---|---|---|
| Availability | Percentage of successful requests | 99.95% of HTTP checks pass |
| Latency | Response time distribution | p99 latency is 800ms |
| Error rate | Percentage of failed requests | 0.1% of API calls return 5xx |
| Throughput | Requests processed per second | 10,000 req/s sustained |
Service Level Objective (SLO)
An SLO is an internal target for an SLI:- “Our API will have 99.9% availability measured over a rolling 30-day window”
- “p95 latency will be below 500ms”
- “Error rate will stay below 0.1%”
Error budgets
The error budget is the inverse of your SLO — the amount of unreliability you’re allowed:- Error budget = 0.1% = ~43 minutes of downtime
Service Level Agreement (SLA)
An SLA is a contract with customers that defines minimum service levels and consequences for violations:- “We guarantee 99.95% uptime per month. If we fall below this, affected customers receive a 10% service credit.”
How they relate
| Layer | Audience | Consequence of miss |
|---|---|---|
| SLI | Engineering team | Data point for dashboards |
| SLO | Engineering + product | Trigger error budget policy |
| SLA | Customers | Financial penalties, credits |
Setting effective SLOs
Start with user impact
What does the user experience when the SLI degrades? If users don’t notice p99 latency increasing from 200ms to 400ms, an aggressive latency SLO wastes engineering effort.Use meaningful time windows
- Rolling windows (last 30 days) give a continuous view
- Calendar windows (this month) align with business reporting
- Rolling windows are generally better for engineering decision-making
Not everything needs a 99.99%
| Service type | Typical availability target |
|---|---|
| Payment processing | 99.99% |
| User-facing API | 99.9% |
| Admin dashboard | 99.5% |
| Development tools | 99% |
| Batch processing | ”Completes within SLA window” |
Monitoring SLIs with DevHelm
Uptime monitors provide the data for availability and latency SLIs:- Availability SLI = percentage of passing checks over the window
- Latency SLI = response time percentile from check results
Uptime reporting guide
Generate uptime reports from monitoring data.
MTTR & MTTD
Incident response metrics that feed into SLOs.