Commentary
By Hamed Silatani

 

I think this is a fantastic initiative from the Department of Education to make it easier for practitioners to follow set processes. One of the most common reasons why it is challenging to follow procedures is simply because people forget. Beyond that, people are often unaware of the process and do not know what is expected from them during incident response efforts. Incident response is often not the primary duty of those responsible for uptime; if you go through a period of infrequent incidents, you will need to dig up those guidance documents, which often inhabit a rarely updated document, as this author discusses. This adds to resolution time, which adds to the cost to business and, ultimately, to the stress responders must undergo.

Having a process and a playbook is essential; however, effective incident response requires far beyond a well-documented accessible process. If it were as simple as following a process, all incident responses would have been automated. An incident, by its very nature, is a situation of uncertainty. An incident happens because our assumptions of how systems should work break down, and we are facing a new reality that we need to adjust to before we can resolve the incident. It requires certain qualities of human beings to resolve incidents. That is why people fix incidents, not tools or processes. For instance, incident responders need to be able to share information effectively and promptly with each other or take a decision to relax certain rules to restore the service faster. Qualities like the ability to make a decision, share information, or adjust to new information come with experience and exposure to situations that require them to demonstrate these qualities.

Engineers or IT operations staff either will get the experience they need through high-cost real-life incidents or can use incident simulations that are safe and low-cost ways to gain the required experience. Safety-critical industries like aviation, where the cost of incidents is very high, have been using simulations for a long time to build resiliency with their staff, it is time for us to reevaluate.

Picture this – you’re an aspiring teacher and you’ve been working hard on your application form. You’re busy, but organised and you want your application to be the best it can be.

You set aside an evening to get it finished. You sign in to your application and you read “Sorry, there’s a problem with the service”.

You keep checking the site and when it eventually comes back online it’s too late, the evening has disappeared and you’re now forced to rush your application.

So, what happened? This message means there’s been an incident behind the scenes, and the team hasn’t been able to get the service back up and running in time.

This isn’t acceptable. As the owners of it, we have a duty to provide a reliable service and that includes how we manage incidents. So we’ve recently improved how we do this by building our very own Slack bot to manage incidents.

Handling incidents can be a challenge

As a developer, the old incident process caused a lot of pain. The instructions lived inside a Google Doc, but it frequently went out of date. So those not familiar with the process would often be unsure who to contact or how to assess the level of severity.

We spent longer working out what to do than dealing with the incident, and when we did finally trigger ‘the process’, it was just a Slack thread and a video call.

We wanted to improve this, so we could respond more effectively and provide our users with a better experience. We knew the process needed to:

  • be easy and accessible, so everyone in the team would feel confident about raising an incident
  • be automated and always follow a consistent set of actions
  • produce better incident documentation

We wanted the process to remain in Slack and we considered using a pre-built app from the Slack Marketplace, but as developers we decided to build the bot ourselves. That way we could have something that completely met our needs and would be ours to customise and develop further.

As part of our development process we follow certain principles–we like our code to be well tested, open source and well-documented. It was important to us that we applied these principles to building the bot too.

We were keen to build the bot in C# or Ruby, our main programming languages. We found a great Slack-Ruby community on GitHub, which inspired us.

Bringing it to life

Now, when a developer notices something is wrong with the service, what do they do? Dig out that old Google Doc? Not any more. This is our new process.

To trigger the incident bot you enter a single Slack command: /incident. (A window appears asking for information.)

You enter the required fields and the bot instantly creates a series of actions, which are:

  • creating a brand new Slack channel for the incident
  • inviting the incident leads to that channel
  • setting the channel topic, with information such as a summary of the incident and severity level
  • tagging useful documentation to the incident channel, so it’s easy to find

Once the incident has been resolved you simply issue another Slack command: /closeincident

It then alerts everyone, letting them know it’s now closed allowing the team to focus on the post-mortem.

What’s next

We do what we can to keep incidents to a minimum. So far we’ve only used our bot a few times. But when they occur, we’re clear about what we need to do.

Developers in other teams have welcomed our bot–it’s improved the process for all of us. We want to encourage more DfE service teams to use it. And we’re keen to get feedback so we can develop the bot to better meet user needs.

Author: James Glenn

This content has been rehosted with permission from the Department of Education. Please check the source link to see if the original content has changed since publication: https://dfedigital.blog.gov.uk/2022/02/03/incident-bot/

Join a meet-up dedicated to IT Incident Response (OOPS)

Join a meet-up dedicated to IT Incident Response (OOPS)

  • How to Implement SRE In Your Organization

    21

    Sep

    September 21, 2023

  • Navigating Chaos: The National Air Traffic Services Outage

    21

    Sep

    September 21, 2023

  • Looking beyond MTTR

    25

    Aug

    August 25, 2023