Top 10 Incident Management Problems

Stuart Rimell

December 2, 2022

Tags:

IN THIS ARTICLE

Incident Timeline

Ready to make incident response your competitive advantage?

See how Uptime Labs builds provable, scalable incident response capability across your organisation.

Book a demo

Explore the platform

The Semicircle of Doom

You know the story. The incident is kicking off, the clock is ticking, all hands are on deck, and the focussed, purposeful wheels of the Incident Response playbook are turning in motion. Soon however, the audience arrives. Senior managers swoop in one by one like ravens onto a telephone wire searching for scraps of information to allay their increasing anxiety. Meanwhile, the incident responders struggle tothink while the impotent stares of the highly paid onlookers burn holes in their concentration. Get rid of them. Anyone who’s not playing an active role in the incident response should be gone, with the expectation that they’ll be regularly informed.

I need a hero

I’m holding out for a hero ‘til the end of the night. They’ve gotta be strong and they’ve gotta be fast and they’d better not be on holiday when the incident happens or we’re doomed… Every organisation has their heroes, their rockstars, their ninjas, and we love them and appreciate them dearly. But to depend on them entirely is to invite failure. Make sure you have a breadth of experience and skills on your team so you can have confidence in your incident response without risking a single point of failure. Even better, find those true heroes who are generous and selfless with their knowledge and can help grow your strength in depth.

Failing to keep stakeholders informed

Failing to communicate with your stakeholders is a sure way to invite a semi-circle of doom. Keep the ravens in their nest by setting an expectation of regular, timely communication and sticking to your promise.

Usual Suspects

It’s a network issue. It’s always a network issue. It’s definitely a network issue. Not so fast. Though common problems are common and although case based reasoning is a powerful diagnostic tool, jumping to conclusions without evidence is a common form of premature convergence that can result in long journey down the wrong diagnostic rabbit hole. What’s the evidence? What does the data say?

It’s been a while…

The last major incident was back in 2017. Since then we’ve grown complacent, overly confident that our resiliency efforts have resulted in invincibility but suddenly we’re reminded that incidents can occur and we’re not prepared. Incident response skills need to be embedded into muscle memory and if you don’t use it, you lose it. How are you keeping your incident response capability sharp so you’re ready at all times? Uptime Labs can help with that.

How big?

Failing to effectively size the issue is the first opportunity for your incident response to slip off the rails. Is the issue minor, perhaps non customer impacting or affecting a non critical feature or is it major? Who’s affected? Everyone or a specific segment? Global or local? Degradation or outage? Your assessment of the size and severity of the issue will impact your response and communication strategy so get it right.

To fail to plan is to plan to fail

It’s said that no plan survives first contact with the enemy, and while this may be true that’s not to say that planning isn’t critically important. Your incident protocol or playbook is your “break glass here” action plan or checklist that will help you to get the basics done on autopilot, leaving your conscious brain free to work with the agility that a complex, emerging incident scenario demands. How well established is yourincident response playbook within your team?

Authority Through Seniority

So you’ve just got off the phone with the CTO and she’s convinced that the incident’s caused by a load balancer issue. You feel like you have to focus your triage in this direction due to the respect you have for the CTO’s seniority. Be careful. Outside opinions provided at a distance may be useful and they’re worth listening to the basic principles of effective triage remain. What’s your evidence? What does the data say? You’re in charge of this incident and the seniority of an opinion should make no difference to how valid it is.

Get to the point

The incident bridge isn’t the place for your life story, leave that for the retrospective. Your job on a communications bridge is to keep the signal to noise ratio as high as possible. You can use helpful acronyms such as C.A.N to help focus your communication: -

C - Conditions - Describe what it happening
A - Actions - what has been done
N - Needs - What you need to have happen

Stuart Rimell

Stuart is a product & technology leader at Uptime Labs. He previously led IG’s largest efficiency program, optimising client services, and built enterprise architecture, agile frameworks, and real-time trading platforms. He also advises startups on product management, specialising in fintech and high-performance systems.

Top 10 Incident Management Problems

Ready to make incident response your competitive advantage?

The Semicircle of Doom

I need a hero

Failing to keep stakeholders informed

Usual Suspects

It’s been a while…

How big?

To fail to plan is to plan to fail

Authority Through Seniority

Get to the point

Stuart Rimell

You've (Just) Had an Incident. What Next?

Incident Response in the Age of AI (Incident Fest)

5 Foundational Incident Response Skills, Demonstrated Live

Ready to make incident response your competitive advantage?

Top 10 Incident Management Problems

Ready to make incident response your competitive advantage?

The Semicircle of Doom

I need a hero

Failing to keep stakeholders informed

Usual Suspects

It’s been a while…

How big?

To fail to plan is to plan to fail

Authority Through Seniority

Get to the point

Stuart Rimell

Related content

You've (Just) Had an Incident. What Next?

Incident Response in the Age of AI (Incident Fest)

5 Foundational Incident Response Skills, Demonstrated Live

Ready to make incident response your competitive advantage?