Uptime labs hosted a breakfast discussion group for senior technology executives from the finance and retail sectors at The Shard in London in February. The discussion focussed on nurturing an incident-aware culture, and on the direct and indirect consequences of growing effective incident response skills. A consensus emerged amongst participants that although typical first-order effects of incident awareness, such as reducing incident resolution times and impact remain important, second-order effects such as improving resilient software techniques and DevOps practices are equally valuable for modern organisations.
What keeps technology execs up at night?
Most incident response-related insomnia appears, perhaps counterintuitively, to concern people rather than technology. The main concern can be summarised as, ‘ensuring appropriately skilled people are available during an incident.’ This can be further subdivided into three main themes: –
- Ensuring sufficient strength-in-depth of incident responders to avoid key person dependencies.
- Ensuring appropriately skilled individuals can be mobilised immediately.
- Ensuring the swift response of 3rd party service providers.
It’s common for effective incident response skills to be concentrated in a few key individuals such as dedicated incident managers and SREs. It can be challenging to grow the kind of strength in depth necessary to have confidence in one’s resilience in the face of a large variety of possible incident scenarios. Thankfully, severe incidents happen infrequently, but this can result in complacency and risk exposure without a coherent strategy to maintain skills and preparedness levels.
Participants volunteered their own preparedness strategies which included running incident drills in staging environments and introducing non-client-impacting degradations into production systems. Such methods are helpful, especially in allowing practitioners to experience incident scenarios within their own domain. However these methods can be expensive, disruptive to development, and carry the risk of customer impact. Such methods are also somewhat restrictive in the variety of scenarios that can be practiced.
If you saw a 50% improvement in your incident response capability, what would you notice that indicated the improvement?
Organisations commonly measure their incident response with metrics such as MTTR (Mean time to restore) and MTTA (mean time to acknowledge). Participants differentiated between such direct, first-order metrics and more indirect, second-order effects of incident awareness on overall engineering practices. Metrics such as MTTR are clearly essential and would be expected to improve, but participants we equally enthused by the promise of ‘incident awareness’ to feed back into engineering practices, resulting in continuous improvement in resilience and an overall reduction in incident frequency and severity.
“Never waste a good incident”
Participants noted the learning potential of incidents, emphasising both the experience of practicing incident resolution and the importance of PIRs (post-incident reviews) in establishing a feedback loop to directly impact ongoing engineering improvements. If this feedback loop is to be established, incident response experience needs to be distributed broadly throughout development teams, giving developers the best chance of proactively addressing common failure scenarios. “Never waste a good incident”, was mentioned as a memorable mantra to turn the negatives associated with incidents into positive learning opportunities.
If your incident response capability was as good as it could practically be, what improvement in uptime might you experience?
The assumption of the benefit of effective incident response capabilities is rooted in the belief that improvements will result in a tangible reduction of incident impact. This would most commonly be measured by an improvement in uptime, or its counterpoint, a decrease in downtime. So how much uptime are companies leaving on the table by failing to optimise their response capabilities?
This is a very difficult question to answer; participants differed in their responses. One financial services exec suggested a possible increase in uptime of 75%, referencing an incident that lasted 40 minutes that could have been solved within 10 minutes. Others suggested that a reduction in resolution time for individual incidents was unlikely, however, incident awareness would likely result in fewer incidents leading to an increase in overall uptime measured over a longer timeframe. Participants noted that a reduction in incidents where the root cause lay in the organisation’s own engineering practices was an especially likely outcome of improved incident response skills. Incidents caused by 3rd party failures were seen as less likely to be reduced, however it was noted that practices to build resiliency and redundancy into 3rd party integrations were likely to be improved by a widespread organisational culture of incident awareness. Regardless of an incident’s root cause or location, awareness of the variety of potential failure modes gives organisations the best chance to proactively defend against them.
The take home…
The differentiation between first-order aspects of incident response (such as MTTR) and second-order effects regarding the potential of incident experience to improve overall engineering practices and culture was the big take home from this breakfast briefing. It’s clear that executives are thinking beyond simply minimising downtime, towards establishing a culture of learning which benefits the whole.
We at Uptime Labs are continually excited by, and dedicated to giving organisations the best tools with which to experience, and learn from, an infinite variety of authentically simulated incident scenarios, in a safe and trusting environment.