In the previous post, we explored how the ubiquitous MTTR (Mean Time To Restore/Recover) metric may not be as useful as you’d hope – if your aim is to measure IT incident response effectiveness. We described how Courtney Nash, Stepan Davidovic and others, demonstrated that the typically skewed statistical distribution of organisational incident resolution times, renders the mean (the ‘M’ in MTTR) unhelpful, and even misleading as a signal to identify change in performance over time.

So if MTTR isn’t useful, what else can we measure to demonstrate the effectiveness, or otherwise, of our efforts to improve resilience?

First, the bad news. Distilling a complex socio-technical phenomenon such as incident response into a single number is likely to be an unfulfilling exercise. The huge variety of influencing factors contributing to IT incidents tends to render the signal of incident response performance invisible amongst the noise of variation outside of our control.

If a single metric such as MTTR is too blunt an instrument, perhaps a collection of measures will be more effective? There are plenty of MTT’X’ measures that we could deploy in-aggregate to provide a more nuanced illustration of incident response, examples include:

 

  • MTTD – Mean Time To Detect: The time between an incident starting and the organisation detecting it. Detecting an incident quickly is better than slowly.
  • MTTA – Mean Time To Acknowledge: The time between an incident starting and responders starting to work on the response. Starting the response quickly is better than slowly, and indicates an ability to assemble to response effort effectively.
  • MTBF – Mean Time Between Failures: The time between incidents. This is less a measure of incident response performance and more a measure of stability.

 

This infoQ article does a good job of dissecting such metrics. However, it’s unclear without further analysis whether such metrics are less susceptible to the statistical variance that renders MTTR futile. It is clear that these metrics are lagging indicators, and they also suffer from requiring large sample sizes to normalise the mean. The last thing we want is a large number of incidents just to achieve a normal distribution.

Incident response lends itself more naturally to qualitative analysis. This is an answer that’s unlikely to be satisfying to those who crave the deterministic satisfaction of a single number, or a graph that trends in a positive direction, but this doesn’t make it less true. Savvy organisations already do a lot of qualitative analysis during post-incident reviews (PIRs), where responders and stakeholders gather to share their incident experience and learnings from multiple different perspectives. Such reviews can also give birth to quantitative data that can illustrate a team’s efforts to learn and improve following incidents. For example:

 

  • How many times has a PIR been read, and by how many people?
  • How many people attend PIRs?
  • How soon after an incident was the PIR held?
  • What percentage of incidents were followed by PIRs?
  • How many times were PIRs referenced in commit messages – indicating learnings being applied?

 

Qualitative analysis also allows you to dive into subtler aspects of behaviour during incident response that can make the difference between a slow, inflexible response and a collaborative, agile response. Dr Laura McGuire’s research into the Cost of Coordination highlights several behaviours that positively contribute to effective incident response including:

 

  • Taking Initiative
  • Updating
  • Sharing Info
  • Deciding
  • Adjusting
  • Being Recruitable
  • Recruiting Others
  • Backfilling Incident Commander Tasks

 

Such attributes may be more difficult to measure than “time to resolve” but they do represent aspects of incident response that teams and organisations would do well to nurture, monitor and improve.

So while there are alternatives to MTTR if you’re looking to track your incident response improvement, these measures are best combined with a qualitative analysis approach that allows you to reflect on, and learn from how your responders act together under conditions of surprise, uncertainty and ambiguity. This is precisely what Uptime Labs is designed to help you to experience.

 

Guest writer: Stuart Rimmel

 

Join a meet-up dedicated to IT Incident Response (OOPS)

Join a meet-up dedicated to IT Incident Response (OOPS)

  • Jersey’s Recent Gas Outage: Understanding the Impact and Causes

    1

    Dec

    December 1, 2023

  • Irish News: High Flyers interview with Joe McKevitt

    6

    Oct

    October 6, 2023

  • How to Implement SRE In Your Organization

    21

    Sep

    September 21, 2023