Skip to main content
MTTD measures how quickly you spot problems. MTTR measures how quickly you fix them. Together they define your incident response effectiveness.

Mean Time to Detect (MTTD)

The average time between when an incident starts and when your team becomes aware of it.
MTTD = (time alert fired) − (time problem actually started)

What affects MTTD

FactorImpact
Check frequency30s checks detect faster than 5min checks
Alert routingDirect PagerDuty pages are noticed faster than email
Monitoring coverageUnmonitored services have infinite MTTD
Confirmation windowsMulti-region confirmation adds detection time

Improving MTTD

  • Increase check frequency for critical services
  • Monitor all user-facing endpoints (not just the main one)
  • Use aggressive alerting for revenue-critical paths
  • Track third-party dependencies so upstream failures are detected automatically

Mean Time to Resolve (MTTR)

The average time between when an incident is detected and when the service is fully restored.
MTTR = (time service restored) − (time alert fired)
MTTR includes triage, investigation, mitigation, and verification.

What affects MTTR

FactorImpact
PlaybooksDocumented procedures speed up resolution
On-call response timeFaster acknowledgement means faster start
Rollback capabilityOne-click rollback vs manual fix
System complexityMicroservices are harder to debug than monoliths
Monitoring detailRich context (logs, traces, check results) reduces investigation

Improving MTTR

  • Write playbooks for common failure modes
  • Invest in fast rollback mechanisms
  • Reduce mean time to acknowledge (MTTA) with clear escalation
  • Provide context in alerts (which monitor, which region, recent changes)

MTTA (Mean Time to Acknowledge)

Time from alert to human acknowledgement. Tracks on-call responsiveness.

MTBF (Mean Time Between Failures)

Time between incidents. Higher is better — indicates system stability.

Failure rate

Percentage of checks that fail over a time period. Tracks overall reliability trend.

Tracking these metrics

MetricHow to measure
MTTDMonitor timestamp vs incident creation timestamp
MTTAIncident creation vs first human response
MTTRIncident creation vs resolution timestamp
MTBFTime between consecutive incident resolutions

Benchmarks

These vary widely by industry and team maturity:
MetricGoodGreat
MTTD< 5 minutes< 1 minute
MTTA< 10 minutes< 5 minutes
MTTR< 1 hour< 15 minutes
The goal isn’t perfection — it’s continuous improvement. Track trends over months, not individual incidents.

DORA metrics

Broader software delivery performance metrics.

SLA/SLO/SLI

Service level objectives and indicators.