Difficult incidents and the magical power of in-between areas

I’m using the term “difficult incident” cautiously here, as it might raise the difficult question of what makes an incident easy or hard, particularly in the realm of complex incidents and incident management. But let’s assume a difficult incident involves multiple teams and isn’t easily traced to one issue.

If you’ve been on support duty for a while, you’ve probably seen exchanges like this in incident chat logs:

Engineer 1: “Anyone from Networks online? Can’t connect ClientData service to its database.”

Network engineer: “I need source and destination IP addresses and port numbers.”

Engineer 1: “We only have service and dependency names in the config. I don’t know the IP addresses.”

This leads to a delay of at least 10 minutes before Engineer 1 provides the needed information.

Now, imagine a better scenario:

Engineer 1: “Anyone from Networks online? ClientData service on HOSTXXX can’t connect to its database at 192.168.1.20:1521. Also seeing SYN_SENT in netstats.”

Network engineer: “Thanks for the info. SYN_SENT usually indicates Firewall issues.”

This second conversation moves faster and is more likely to fix the problem quickly.

The key difference is the application engineer’s understanding of network systems and the network engineer’s familiarity with application design. This gap in understanding, known as “fundamental common ground breakdown,” often leads to delays or larger disasters in network incident resolution.

Here’s a personal example:

I was new to a job at a financial trading firm when I encountered an issue with high memory usage and increasing connections on a load balancer. I suggested a rolling restart of instances to prevent memory issues. However, this led to a full outage because I didn’t realise the load balancer took time to mark servers as up.

The problem? I assumed a rolling restart meant quickly moving from one instance to the next, while the platform support didn’t clarify the timing of load balancer updates.

Expanding our knowledge beyond our area of expertise can prevent incidents or resolve them faster, particularly in the context of incident management. For example, database engineers understanding network and operating systems or application developers understanding databases, networks, and operating systems.

We can leave this overlap of skills to chance or actively create it in our incident management strategies. One approach is rotating engineers across functions or running incident drills to broaden skills. While we can’t ensure identical mental models, we can foster overlap among team members’ mental models to enhance network incident response and incident management effectiveness.

 

TL;DR:

In complex incidents, understanding the intersection between different areas of expertise is crucial for swift resolution. Bridging gaps in knowledge between network systems, application design, and operating systems can prevent delays and disasters. By actively fostering overlap among team members’ mental models through strategies like rotating engineers across functions and running incident drills, organisations can enhance incident management effectiveness.

Join a meet-up dedicated to IT Incident Response (OOPS)

Join a meet-up dedicated to IT Incident Response (OOPS)

  • The Swiss Cheese Model - root cause analysis tool for incident management

    Discover the ONE Thing You Can Do to Avoid Future Incidents

    16

    Apr

    April 16, 2024

  • Best practice without context is like…

    14

    Mar

    March 14, 2024

  • Navigating the High Seas of Incident Management: Insights from a Decade of Experience”

    5

    Mar

    March 5, 2024