MTTR and MTTD Explained

MTTD measures how quickly you spot problems. MTTR measures how quickly you fix them. Together they define your incident response effectiveness.

Mean Time to Detect (MTTD)

The average time between when an incident starts and when your team becomes aware of it.

MTTD = (time alert fired) − (time problem actually started)

What affects MTTD

Factor	Impact
Check frequency	30s checks detect faster than 5min checks
Alert routing	Direct PagerDuty pages are noticed faster than email
Monitoring coverage	Unmonitored services have infinite MTTD
Confirmation windows	Multi-region confirmation adds detection time

Improving MTTD

Increase check frequency for critical services
Monitor all user-facing endpoints (not just the main one)
Use aggressive alerting for revenue-critical paths
Track third-party dependencies so upstream failures are detected automatically

Mean Time to Resolve (MTTR)

The average time between when an incident is detected and when the service is fully restored.

MTTR = (time service restored) − (time alert fired)

MTTR includes triage, investigation, mitigation, and verification.

What affects MTTR

Factor	Impact
Playbooks	Documented procedures speed up resolution
On-call response time	Faster acknowledgement means faster start
Rollback capability	One-click rollback vs manual fix
System complexity	Microservices are harder to debug than monoliths
Monitoring detail	Rich context (logs, traces, check results) reduces investigation

Improving MTTR

Write playbooks for common failure modes
Invest in fast rollback mechanisms
Reduce mean time to acknowledge (MTTA) with clear escalation
Provide context in alerts (which monitor, which region, recent changes)

MTTA (Mean Time to Acknowledge)

Time from alert to human acknowledgement. Tracks on-call responsiveness.

MTBF (Mean Time Between Failures)

Time between incidents. Higher is better — indicates system stability.

Failure rate

Percentage of checks that fail over a time period. Tracks overall reliability trend.

Tracking these metrics

Metric	How to measure
MTTD	Monitor timestamp vs incident creation timestamp
MTTA	Incident creation vs first human response
MTTR	Incident creation vs resolution timestamp
MTBF	Time between consecutive incident resolutions

Benchmarks

These vary widely by industry and team maturity:

Metric	Good	Great
MTTD	< 5 minutes	< 1 minute
MTTA	< 10 minutes	< 5 minutes
MTTR	< 1 hour	< 15 minutes

The goal isn’t perfection — it’s continuous improvement. Track trends over months, not individual incidents.

DORA metrics

Broader software delivery performance metrics.

SLA/SLO/SLI

Service level objectives and indicators.

​Mean Time to Detect (MTTD)

​What affects MTTD

​Improving MTTD

​Mean Time to Resolve (MTTR)

​What affects MTTR

​Improving MTTR

​Related metrics

​MTTA (Mean Time to Acknowledge)

​MTBF (Mean Time Between Failures)

​Failure rate

​Tracking these metrics

​Benchmarks