On 27 August, the UK found itself in the midst of a crisis as a widespread National Air Traffic Services (NATS) outage led to thousands of cancellations and many more flights delayed by up to 12 hours. This unexpected disruption served as a stark reminder of how vulnerable our digital infrastructures are.  

Outages of this nature are simultaneously impossible to predict and also unavoidable. The only thing a responsible organisation can do is to ensure their teams are prepared and ready to react in such an emergency. 

According to aviation analysts Cirium, over the period affected by the outage up to 27% of expected departures and arrivals across the UK were cancelled. IATA estimated that airlines alone lost over £100 million in revenue, however, the cost for the nation will be many times more than this as both holiday-goers and businesses faced major disruption as a result. 

As incident response professionals, it is unpleasantly easy to imagine the stress those responders faced. We have all been in similar situations (although perhaps not to this scale) – a costly outage occurs and the clock begins to tick to find the problem and fix it. 

 Hamed Silatani, CEO of Uptime Labs, is all too familiar with the stress and pressure that accompany these outages. He commented, “Incidents are inevitable and this particular incident really drives home how costly they can be. Although they are inevitable, it is impossible to prepare for every eventuality. NATS chief executive Martin Rolfe claimed this fault was a 1 in 15 million chance, but there are always 15 million other faults.

“The timeline of the incident appears to be that a fault occurred early on the Monday morning. The team responsible then only had a few hours to solve the issue before resorting to cancellations. Our audience will be intimately familiar with how stressful it was for them to make the significant decision to shut down UK airspace in the face of incomplete information to prioritise safety. Behind that difficult decision must have been a wealth of experience and practice and ultimately – it must be praised” 

The immediate cause of the outage was attributed to a technical failure in the NATS’ communication systems. It was later clarified in the NATS report to the UK’s Civil Aviation Authority that there was an issue with a flight plan processing sub-system called Flight Plan Reception Suite Automated – Replacement (FPRSA-R). It encountered a rare set of circumstances presented by a flight plan that included two identically named, but separate waypoint markers outside of UK airspace, which caused the main system and back-up system to collapse. 

The ripple effect of the NATS outage was felt not only by passengers but also by the wider economy. Airlines faced a logistical nightmare, with crews and aircraft out of position, leading to further delays in the days following the incident. Airport employees were strained as they tried to manage the influx of stranded travellers, providing accommodation, rescheduling flights and manually inputting flight information. 

Despite the chaos and inconvenience, the incident shed light on the importance of the rigorous safety protocols and contingency plans in place within the aviation sector. The NATS, in collaboration with airlines and airports, swiftly enacted their emergency procedures, rerouting flights to nearby airports, and ensuring that no aircraft took off or landed without the essential guidance of air traffic control. 

The NATS outage was undoubtedly a challenging and disruptive event and the decisions taken by those involved must be praised. However, it is important to reflect on the nature of this crisis before the case is considered closed. While this particular event may be unlikely to occur again anytime soon, the next unforeseen disruption is just around the corner. The key is to reinforce those vital skills required to tackle any incident. 

 

By Patrick Aquilina

Top Performers’ Secrets: 3 Ways to Excel in Incident Response

Top Performers’ Secrets: 3 Ways to Excel in Incident Response

  • Tech without us: Why there wasn't an outage today

    Tech without us: Why there wasn’t an outage today

    13

    May

    May 13, 2024

  • what we notice when we notice a good incident response.

    Insights for Effective Incident Response

    2

    May

    May 2, 2024

  • grounding incident management and incident response

    Anchoring Your Incident Response with Grounding

    29

    Apr

    April 29, 2024