Looking beyond MTTR

Stuart Rimell

August 25, 2023

Tags:

IN THIS ARTICLE

Incident Timeline

Ready to make incident response your competitive advantage?

See how Uptime Labs builds provable, scalable incident response capability across your organisation.

Book a demo

Explore the platform

In the previous post, we explored how the ubiquitous MTTR (Mean Time To Restore/Recover) metric may not be as useful as you’d hope - if your aim is to measure IT incident response effectiveness. We described how Courtney Nash, Stepan Davidovic and others, demonstrated that the typically skewed statistical distribution of organisational incident resolution times, renders the mean (the 'M' in MTTR) unhelpful, and even misleading as a signal to identify change in performance over time.So if MTTR isn’t useful, what else can we measure to demonstrate the effectiveness, or otherwise, of our efforts to improve resilience?First, the bad news. Distilling a complex socio-technical phenomenon such as incident response into a single number is likely to be an unfulfilling exercise. The huge variety of influencing factors contributing to IT incidents tends to render the signal of incident response performance invisible amongst the noise of variation outside of our control.If a single metric such as MTTR is too blunt an instrument, perhaps a collection of measures will be more effective? There are plenty of MTT’X' measures that we could deploy in-aggregate to provide a more nuanced illustration of incident response, examples include:

MTTD - Mean Time To Detect: The time between an incident starting and the organisation detecting it. Detecting an incident quickly is better than slowly.
MTTA - Mean Time To Acknowledge: The time between an incident starting and responders starting to work on the response. Starting the response quickly is better than slowly, and indicates an ability to assemble to response effort effectively.
MTBF - Mean Time Between Failures: The time between incidents. This is less a measure of incident response performance and more a measure of stability.

This infoQ article does a good job of dissecting such metrics. However, it’s unclear without further analysis whether such metrics are less susceptible to the statistical variance that renders MTTR futile. It is clear that these metrics are lagging indicators, and they also suffer from requiring large sample sizes to normalise the mean. The last thing we want is a large number of incidents just to achieve a normal distribution.Incident response lends itself more naturally to qualitative analysis. This is an answer that’s unlikely to be satisfying to those who crave the deterministic satisfaction of a single number, or a graph that trends in a positive direction, but this doesn’t make it less true. Savvy organisations already do a lot of qualitative analysis during post-incident reviews (PIRs), where responders and stakeholders gather to share their incident experience and learnings from multiple different perspectives. Such reviews can also give birth to quantitative data that can illustrate a team’s efforts to learn and improve following incidents. For example:

How many times has a PIR been read, and by how many people?
How many people attend PIRs?
How soon after an incident was the PIR held?
What percentage of incidents were followed by PIRs?
How many times were PIRs referenced in commit messages - indicating learnings being applied?

Qualitative analysis also allows you to dive into subtler aspects of behaviour during incident response that can make the difference between a slow, inflexible response and a collaborative, agile response. Dr Laura McGuire’s research into the Cost of Coordination highlights several behaviours that positively contribute to effective incident response including:

Taking Initiative
Updating
Sharing Info
Deciding
Adjusting
Being Recruitable
Recruiting Others
Backfilling Incident Commander Tasks

Such attributes may be more difficult to measure than “time to resolve” but they do represent aspects of incident response that teams and organisations would do well to nurture, monitor and improve.So while there are alternatives to MTTR if you’re looking to track your incident response improvement, these measures are best combined with a qualitative analysis approach that allows you to reflect on, and learn from how your responders act together under conditions of surprise, uncertainty and ambiguity. This is precisely what Uptime Labs is designed to help you to experience. Guest writer: Stuart Rimmel

Stuart Rimell

Stuart is a product & technology leader at Uptime Labs. He previously led IG’s largest efficiency program, optimising client services, and built enterprise architecture, agile frameworks, and real-time trading platforms. He also advises startups on product management, specialising in fintech and high-performance systems.

Looking beyond MTTR

Ready to make incident response your competitive advantage?

Stuart Rimell

You've (Just) Had an Incident. What Next?

Incident Response in the Age of AI (Incident Fest)

5 Foundational Incident Response Skills, Demonstrated Live

Ready to make incident response your competitive advantage?

Looking beyond MTTR

Ready to make incident response your competitive advantage?

Stuart Rimell

Related content

You've (Just) Had an Incident. What Next?

Incident Response in the Age of AI (Incident Fest)

5 Foundational Incident Response Skills, Demonstrated Live

Ready to make incident response your competitive advantage?