Mean Time to Detect (MTTD)
The average time between when an incident starts and when your team becomes aware of it.What affects MTTD
| Factor | Impact |
|---|---|
| Check frequency | 30s checks detect faster than 5min checks |
| Alert routing | Direct PagerDuty pages are noticed faster than email |
| Monitoring coverage | Unmonitored services have infinite MTTD |
| Confirmation windows | Multi-region confirmation adds detection time |
Improving MTTD
- Increase check frequency for critical services
- Monitor all user-facing endpoints (not just the main one)
- Use aggressive alerting for revenue-critical paths
- Track third-party dependencies so upstream failures are detected automatically
Mean Time to Resolve (MTTR)
The average time between when an incident is detected and when the service is fully restored.What affects MTTR
| Factor | Impact |
|---|---|
| Playbooks | Documented procedures speed up resolution |
| On-call response time | Faster acknowledgement means faster start |
| Rollback capability | One-click rollback vs manual fix |
| System complexity | Microservices are harder to debug than monoliths |
| Monitoring detail | Rich context (logs, traces, check results) reduces investigation |
Improving MTTR
- Write playbooks for common failure modes
- Invest in fast rollback mechanisms
- Reduce mean time to acknowledge (MTTA) with clear escalation
- Provide context in alerts (which monitor, which region, recent changes)
Related metrics
MTTA (Mean Time to Acknowledge)
Time from alert to human acknowledgement. Tracks on-call responsiveness.MTBF (Mean Time Between Failures)
Time between incidents. Higher is better — indicates system stability.Failure rate
Percentage of checks that fail over a time period. Tracks overall reliability trend.Tracking these metrics
| Metric | How to measure |
|---|---|
| MTTD | Monitor timestamp vs incident creation timestamp |
| MTTA | Incident creation vs first human response |
| MTTR | Incident creation vs resolution timestamp |
| MTBF | Time between consecutive incident resolutions |
Benchmarks
These vary widely by industry and team maturity:| Metric | Good | Great |
|---|---|---|
| MTTD | < 5 minutes | < 1 minute |
| MTTA | < 10 minutes | < 5 minutes |
| MTTR | < 1 hour | < 15 minutes |
DORA metrics
Broader software delivery performance metrics.
SLA/SLO/SLI
Service level objectives and indicators.