I’ve been the poor soul on support overnight, trying to make sense of the world while being harassed and distracted by managers. I have also been the manager who felt frustrated and was caught between a rock and a hard place, managing pressure from the C level and wanting to give space to responders. Major incidents often feel like a lose-lose game for mid-level managers depending on the organisation’s setup for incident management.

Today, my story takes a different turn. Our business revolves around creating incidents and witnessing their resolution! This is how we earn our bread, so for once, as both a business leader and engineer, I want a lot more incidents:)

We have set ourselves the hardship of starting a new business and running incident drills because we believe incident response requires specific skills different from those we, as engineers, use daily. Once these skills are well understood and regularly practiced, incident response becomes much less stressful and an amazing opportunity for learning and team bonding!

In this post, I’ll touch on two of the skills we watch for during incident drills and allow users to master them through practice:

Communication:

covers skills to keep senior IT management informed (and bring them into decision-making if needed), maintain a common ground of understanding with fellow responders, and manage business stakeholders.

Each audience has a different primary expectation from incident communication and uses communication to serve a different purpose:

  • Business stakeholders: Inform them that there is a problem with a super clear scope of impact on key business services (knowing what is working is as important as knowing what is broken), assure them that it is being handled competently, give them enough information to manage customers, and keep them away from incident managers and responding teams (do not ring us!).
  • Senior IT management: the primary objective is to keep them as far away as possible from incident responders, inform them that there is a problem with a clear scope of impact (nothing is more humiliating than someone from business telling them that there is a problem), provide key technical facts known about the incident, leave no room for rumors or side-channel information, give assurance that risk is understood and being handled, and assure them that lessons will be learned. The comms should provide answers rather than raising more questions.
  • Fellow responders: Maintain a common ground of understanding, give clear context and facts when escalating and engaging other responders, frequently update on thought processes and new information, and clearly distinguish facts from opinions.

Ability to progress the incident:

comes down to feeling comfortable with ambiguity (more on this topic: Navigating the High Seas of Incident Management: Insights from a Decade of Experience) and engaging in an iterative exercise of evolving a working theory of what is going on. We observed that, regardless of the level of domain knowledge, people who start by building a factual picture of the incident’s impact (sizing up the issue) and leverage knowledge available to them (known to them or by asking others) to understand the flow are more effective in building a working theory. The working theory is the foundation for good communication and teamwork. Once formed, the responding team will naturally try to validate the working theory’s assumptions and evolve it until the service is restored.

In future posts, I’ll delve into teamwork (calmness, positive attitude, promptness, proactiveness, generosity with knowledge, and willingness to stray from an assigned specific role) and domain understanding.

The above skills are only acquired through practice and experience. It normally takes several years to get exposure to enough incidents to allow an individual to build muscle memory of these skills. The alternative is to get exposure by participating in frequent incident drills.

Top Performers’ Secrets: 3 Ways to Excel in Incident Response

Top Performers’ Secrets: 3 Ways to Excel in Incident Response

  • grounding incident management and incident response

    Anchoring Your Incident Response with Grounding

    29

    Apr

    April 29, 2024

  • The Swiss Cheese Model - root cause analysis tool for incident management

    Discover the ONE Thing You Can Do to Avoid Future Incidents

    16

    Apr

    April 16, 2024

  • network incident management for swift response

    A Key Incident Response Skill That Can Reduce Resolution Time

    25

    Mar

    March 25, 2024