Peer into the post-incident toolkit of many a self respecting Incident Manager and you’ll undoubtedly find a collection of Root Cause Analysis (RCA) tools. You’ll likely find fishbone diagrams, fault trees and an assortment of “whys” (which by convention tend to come in sets of 5). Such tools are wielded in pursuit of the root cause, which lurks hidden beneath more proximate causes, offering the tantalising promise of long term, systemic fixes rather than shallow “sticking plaster” remedies.

Root cause analysis is hungry business however, and there’s nothing better for one’s post RCA recovery than a cheese sandwich. I personally favour Emmental, but any cheese with holes in it will do. Such moments of dairy and carb fuelled downtime offer a satisfying opportunity to ponder the Swiss Cheese Model of Accident Causation.

The Swiss Cheese Model, coined by Psychology Professor James T Reason imagines systems and their defences as multiple slices of swiss cheese, stacked on top of each other. Each slice contains holes, representing risks, vulnerabilities or flaws, but for a threat (accidental or intentional) to pass through all the layers and endanger the system as a whole, the holes in each layer need to line up. Most holes in one layer will lay on top of a solid, cheesy barrier, protecting the system from failure.

swiss cheese

Each layer has holes, but it’s unlikely anything’s going to get through this stack…

 

The model serves as an appetising reminder that catastrophe requires multiple failures – single point failures are not enough. Even if an incident appears on the surface to be caused by a single point failure e.g., a hardware failure or ‘human error’, it’s rare that such events are the single, solitary ‘cause’ in a system that is impervious in every other aspect. What led the hardware failure to result in an outage? What system properties led to the human error?

Of course this is the precise purpose of root cause analysis: to peer beyond proximate causes the find the fundamental ‘root’ of the problem. The issue with this tends to be twofold : –

  1. Root cause analysis efforts seldom go deep enough.

  2. There is rarely, if ever, a single root. Rather, there are multiple contributing factors.

A Swiss cheese metaphor is useful here too. Imagine a typical triangular wedge of cheese. It has a sharp end, and a blunt end. The sharp end represents the events leading directly to the incident (that hardware failure again). Such events in pathological or bureaucratic organisations typically receive a painful poke from the bony finger of blame. The blunt end on the other hand represents system attributes such as culture, policies, procedures, resources, constraints. Examples include management style, hiring policies, training, performance management, salaries, working hours, pressure etc. Events at the sharp end emerge from the conditions at the blunt end. It’s also worth noting that events at the sharp end feed back into attributes at the blunt end, sometimes quickly (knee-jerk management actions) and sometimes more slowly.

 Swiss Cheese Model of Accident Causation

The sharp end and the blunt end

It’s an unfortunate reality that many root cause analysis efforts following failure, tend to stop near the sharp end (she did this, they failed to do that, process x did the other). In contrast, most efforts to understand the contributing factors of success tend to land at the blunt end (it was well managed, it was resourced well, the processes were just divine!). While this tendency says more about common motivations behind RCA than the practice itself, even when motivations are benevolent, the idea that a single cause can be found is misguided.

That’s not the say that root cause analysis should be consigned to the bin. When done effectively, RCA can surface the myriad of sharp and blunt end conditions, leading to the disposition or propensity that in turn led to the holes aligning, resulting in an outage or incident. The ‘root cause’ is not really what you’re looking for anyway. You’re looking for the contributing factors (plural), such that you can learn from events and nudge your improvement efforts in the right direction.

So the next time someone asks, “What’s the ONE THING we can do to avoid future incidents?”, maybe your answer should be, “Eat a cheese sandwich”.

References

This blog makes multiple citations from Dr Richard Cook’s paper How Complex Systems Fail. This very short, readable paper isn’t specifically about resilience in technology, but if you read it, you’ll notice that it may as well have been.

 

Top Performers’ Secrets: 3 Ways to Excel in Incident Response

Top Performers’ Secrets: 3 Ways to Excel in Incident Response

  • what we notice when we notice a good incident response.

    Insights for Effective Incident Response

    2

    May

    May 2, 2024

  • grounding incident management and incident response

    Anchoring Your Incident Response with Grounding

    29

    Apr

    April 29, 2024

  • network incident management for swift response

    A Key Incident Response Skill That Can Reduce Resolution Time

    25

    Mar

    March 25, 2024