Discover the ONE Thing You Can Do to Avoid Future Incidents

Stuart Rimell

April 16, 2024

Tags:

IN THIS ARTICLE

Incident Timeline

Ready to make incident response your competitive advantage?

See how Uptime Labs builds provable, scalable incident response capability across your organisation.

Book a demo

Explore the platform

Peer into the post-incident toolkit of many a self respecting Incident Manager and you’ll undoubtedly find a collection of Root Cause Analysis (RCA) tools. You’ll likely find fishbone diagrams, fault trees and an assortment of “whys” (which by convention tend to come in sets of 5). Such tools are wielded in pursuit of the root cause, which lurks hidden beneath more proximate causes, offering the tantalising promise of long term, systemic fixes rather than shallow “sticking plaster” remedies.

Root cause analysis is hungry business however, and there’s nothing better for one’s post RCA recovery than a cheese sandwich. I personally favour Emmental, but any cheese with holes in it will do. Such moments of dairy and carb fuelled downtime offer a satisfying opportunity to ponder the Swiss Cheese Model of Accident Causation.

The Swiss Cheese Model, coined by Psychology Professor James T Reason imagines systems and their defences as multiple slices of swiss cheese, stacked on top of each other. Each slice contains holes, representing risks, vulnerabilities or flaws, but for a threat (accidental or intentional) to pass through all the layers and endanger the system as a whole, the holes in each layer need to line up. Most holes in one layer will lay on top of a solid, cheesy barrier, protecting the system from failure.

The model serves as an appetising reminder that catastrophe requires multiple failures – single point failures are not enough. Even if an incident appears on the surface to be caused by a single point failure e.g., a hardware failure or ‘human error', it's rare that such events are the single, solitary ‘cause’ in a system that is impervious in every other aspect. What led the hardware failure to result in an outage? What system properties led to the human error?

Of course this is the precise purpose of root cause analysis: to peer beyond proximate causes the find the fundamental ‘root’ of the problem. The issue with this tends to be twofold : -

Root cause analysis efforts seldom go deep enough.
There is rarely, if ever, a single root. Rather, there are multiple contributing factors.

A Swiss cheese metaphor is useful here too. Imagine a typical triangular wedge of cheese. It has a sharp end, and a blunt end. The sharp end represents the events leading directly to the incident (that hardware failure again). Such events in pathological or bureaucratic organisations typically receive a painful poke from the bony finger of blame. The blunt end on the other hand represents system attributes such as culture, policies, procedures, resources, constraints. Examples include management style, hiring policies, training, performance management, salaries, working hours, pressure etc. Events at the sharp end emerge from the conditions at the blunt end. It’s also worth noting that events at the sharp end feed back into attributes at the blunt end, sometimes quickly (knee-jerk management actions) and sometimes more slowly.

It’s an unfortunate reality that many root cause analysis efforts following failure, tend to stop near the sharp end (she did this, they failed to do that, process x did the other). In contrast, most efforts to understand the contributing factors of success tend to land at the blunt end (it was well managed, it was resourced well, the processes were just divine!). While this tendency says more about common motivations behind RCA than the practice itself, even when motivations are benevolent, the idea that a single cause can be found is misguided.

That’s not the say that root cause analysis should be consigned to the bin. When done effectively, RCA can surface the myriad of sharp and blunt end conditions, leading to the disposition or propensity that in turn led to the holes aligning, resulting in an outage or incident. The ‘root cause’ is not really what you’re looking for anyway. You’re looking for the contributing factors (plural), such that you can learn from events and nudge your improvement efforts in the right direction.

So the next time someone asks, “What’s the ONE THING we can do to avoid future incidents?”, maybe your answer should be, “Eat a cheese sandwich”.

References

This blog makes multiple citations from Dr Richard Cook’s paper How Complex Systems Fail. This very short, readable paper isn’t specifically about resilience in technology, but if you read it, you’ll notice that it may as well have been.

Stuart Rimell

Stuart is a product & technology leader at Uptime Labs. He previously led IG’s largest efficiency program, optimising client services, and built enterprise architecture, agile frameworks, and real-time trading platforms. He also advises startups on product management, specialising in fintech and high-performance systems.

Discover the ONE Thing You Can Do to Avoid Future Incidents

Ready to make incident response your competitive advantage?

Stuart Rimell

You've (Just) Had an Incident. What Next?

Incident Response in the Age of AI (Incident Fest)

5 Foundational Incident Response Skills, Demonstrated Live

Ready to make incident response your competitive advantage?

Discover the ONE Thing You Can Do to Avoid Future Incidents

Ready to make incident response your competitive advantage?

Stuart Rimell

Related content

You've (Just) Had an Incident. What Next?

Incident Response in the Age of AI (Incident Fest)

5 Foundational Incident Response Skills, Demonstrated Live

Ready to make incident response your competitive advantage?