Amidst a production incident and with a code change prepared to address the issue, there’s a critical decision to make: should you run the automated test suite before deploying the fix?

Of course you should. It’s not just a recommendation in the run-book; it’s a fundamental practice. Running tests before any production change helps to ensure that the fix doesn’t inadvertently cause more problems.

However, the incident has plunged the system into a total outage for over 45 minutes, with significant financial losses accruing by the minute. Angry customers are flooding the phones, and regulatory requirements mandate reporting outages exceeding 60 minutes, a threshold already surpassed twice this year.

So, do you run the tests?


It’s not a simple decision anymore, despite ‘best practise’.

Questions arise:

  • How long does the test suite take to run?
  • Could the code change exacerbate the situation if it fails?
  • How confident are you in the fix’s effectiveness?
  • Is there a backup plan if the fix fails?
  • Is the testing environment impacted by the outage?
  • Are the tests comprehensive and reliable?
  • What are the repercussions of reporting to the regulator for the third time?
  • Will the deployment system allow a release without passing tests?
  • And, perhaps most soberingly, what happened to the last person who deployed without running tests?

In this pressure-cooker scenario, the decision is not clear-cut. Each consideration weighs heavily on the outcome. Ultimately, you must make a call.

Fast forward to the post-incident review: it becomes evident that the playbook wasn’t followed; the tests were skipped. The accompanying realisation is that blindly adhering to “best practice” without context is akin to navigating with a map without knowing your location.

So, what’s the takeaway?

Next time, you’ll recognise that in the heat of an incident, rigid adherence to protocol might not always be feasible or wise. Context matters. You’ll strive to understand the nuances of each situation, balancing the urgency of the moment with the principles of so-called ‘best practice’. And with this experience, you’ll be better equipped to navigate future challenges with confidence and composure.

Join a meet-up dedicated to IT Incident Response (OOPS)

Join a meet-up dedicated to IT Incident Response (OOPS)

  • The Swiss Cheese Model - root cause analysis tool for incident management

    Discover the ONE Thing You Can Do to Avoid Future Incidents

    16

    Apr

    April 16, 2024

  • network incident management for swift response

    A Key Incident Response Skill That Can Reduce Resolution Time

    25

    Mar

    March 25, 2024

  • Navigating the High Seas of Incident Management: Insights from a Decade of Experience”

    5

    Mar

    March 5, 2024