Top 10 Disruptions of 2022

The worst network and service outages of 2022 had far-reaching consequences. Flights are grounded, virtual meetings are disrupted, and communications are blocked.

The culprits that brought down major infrastructure and service providers were also diverse, according to an analysis by ThousandEyes, a Cisco-owned cyber intelligence firm that tracks internet and cloud traffic. Maintenance-related errors have been cited more than once: Canadian carrier Rogers Communications experienced a massive nationwide outage that was traced to a maintenance update, and a maintenance script error caused problems for software maker Atlassian.

BGP configuration errors also appear in the top outage reports. Border Gateway Protocol tells Internet traffic what route to take, but if the routing information is incorrect, then the traffic can be diverted to the incorrect route, which happened with Twitter. (Read more about the U.S. and global outages in our weekly Internet Health Check.

Here are the top 10 disruptions of the year, in chronological order.

British Airways loses online system: February 25

On February 25, British Airways’ online services were inaccessible for several hours, causing hundreds of flight cancellations and disrupting airline operations. Flights cannot be booked and travelers cannot check in electronically. The airline was reportedly forced to revert to paper-based processes when its online systems became inaccessible, and the impact was felt around the world. “Our monitoring shows that the network path to the airline’s online services (and servers) is accessible, but the server and site responses are timing out,” ThousandEyes said in its outage analysis, which blamed the failure on an unresponsive application server. – Instead of network problems – outage.

“The nature of the problem and the airline’s response to it suggests that the root cause may be a central back-end repository that multiple front-end services rely on. If this is the case, this incident may be the basis for British Airways to rebuild or deconstruct its Catalysts at the end to avoid single points of failure and reduce the likelihood of recurrence. However, it is equally likely that the sequence of events leading to the outage occurs rarely and can be largely controlled in the future. Time will tell,” Thousand Eyes said.

Twitter Hijacked by BGP: March 28

On March 28, Russian internet and satellite communications provider JSC RTComm.RU improperly announced one of Twitter’s prefixes (104.244.42.0/24), causing traffic to Twitter to be rerouted to certain users and fail. Some users cannot use Twitter. After RTComm’s BGP advisory was withdrawn, affected users regained access to the Twitter service. ThousandEyes points out that BGP misconfiguration can be used to block traffic in a targeted manner – but it’s not always easy to tell whether the situation is accidental or intentional.

“We know that the March 28 Twitter Incident was caused by RTComm announcing itself as the origin of the Twitter prefix and then withdrawing it. While we do not know what led to this announcement, it is important to understand BGP’s Accidental misconfigurations are not uncommon, and given that the ISP withdrew this route, it is likely that RTComm had no intention of causing an outage of global impact to Twitter’s service. That said, ISPs in some regions have used localized manipulation of BGP to enforce local access policies based on Block traffic,” ThousandEyes said in its outage analysis.

One way organizations deal with route leaks and hijacks is to monitor for rapid detection and protect BGP using security mechanisms such as Resource Public Key Infrastructure (RPKI), a cryptographic security mechanism used to enforce route source authorization. RPKI is effective against BGP hijacking and leakage, but adoption is not widespread. “While your company may have implemented RPKI to protect against BGP threats, your telco may not. This is something to consider when choosing an ISP,” ThousandEyes said.

Atlassian exaggerates outage impact: April 5

Atlassian reported issues with several of its largest development tools on the morning of April 5, including Jira, Confluence and OpsGenie. A maintenance script error caused these services to be disrupted for several days, but only affected approximately 400 Atlassian customers.

In analyzing the outage, ThousandEyes emphasized the importance of vendor status pages when reporting issues: Atlassian’s status page had a “sea of ​​orange and red indicators” indicating a severe outage, and the company said it would mobilize hundreds of engineers to correct the incident, but for most customers, there is no problem.

Status pages often underestimate the extent of an outage, but status pages can also overstate its impact, ThousandEyes warns: “It’s a very difficult balance: say too little or too late, and customers will feel uneasy about responsiveness; say too much Too much, and being too transparent, risks unnecessarily worrying a large number of unaffected customers, as well as wider stakeholders.

Rogers power outage cuts service across Canada: July 8

A botched maintenance update caused a lengthy nationwide outage on Canadian carrier Rogers Communications’ network. The outage affected phone and internet services for about 12 million customers and hampered many critical services across the country, including banking transactions, government services and emergency response capabilities.

According to ThousandEyes, Rogers withdrew its prefixes due to internal routing issues, which left Tier 1 providers unreachable over the internet for nearly 24 hours. “This incident appears to have been triggered by the withdrawal of a large number of Rogers prefixes, which made their network unreachable from the global internet. However, the behavior observed in their network during this time suggests that the withdrawal of external BGP routes may have been caused by Caused by internal routing issues,” ThousandEyes said in its analysis of the outage.

The Rogers outage is an important reminder of the need for redundancy in critical services; ThousandEyes recommends having multiple network providers, having backup plans in place in the event of an outage, and ensuring you have proactive visibility. “No provider is immune to outages, no matter how large. So for critical services like hospitals and banks, plan for a backup network provider that can mitigate the length and scope of the outage,” ThousandEyes wrote.

AWS US East Region Outage: July 8

A power outage on July 28 disrupted service within Amazon Web Services (AWS) Availability Zone 1 (AZ1) in the US East 2 region. “The outage affected connectivity to the region and brought down Amazon’s EC2 instances, which impacted applications such as Webex, Okta, Splunk, BambooHR, and others,” ThousandEyes reported in its outage analysis. Not all users or services are affected equally; for example, Webex components located in Cisco data centers are still functioning normally. AWS reported that the outage only lasted about 20 minutes, but it took up to three hours for some of its customers’ services and applications to be restored.

It’s important to design a degree of physical redundancy into cloud-delivered applications and services, writes ThousandEyes: “There’s no soft landing with a data center outage – when the power goes out, it’s hard on dependent systems. Whether it’s a grid outage or related systems ( In times like these, architectural resiliency and redundancy of digital services are crucial.

Google Search, Google Maps obsolete: August 9

The brief outage affected Google Search and Google Maps, with these widely used Google services unavailable to users around the world for about an hour. “Attempts to access these services result in error messages from Google edge servers, including HTTP 500 and 502 server responses, which often indicate internal server or application issues,” ThousandEyes reported.

According to reports, the root cause is a software update gone wrong. Not only were end users unable to access Google Search and Google Maps, but applications that relied on Google software functionality also stopped working during the outage.

IT professionals are interested in outages for several reasons, ThousandEyes noted. “First, it highlights the fact that even the most stable services, such as Google Search, which we rarely experience problems or hear about outages, are still subject to the same forces that can disrupt any complex digital system. Second, , the event revealed how pervasive some software systems are, intertwined through the many digital services we consume every day without being aware of these software dependencies.

Zoom outage disrupts virtual meetings: September 15

During the outage on September 15, users were unable to log in or join Zoom meetings for approximately an hour, which resulted in Bad Gateway (502) errors for users around the world. Users were unable to log in or join meetings, and in some cases users who were already in the meeting were kicked out of the meeting.

The root cause has not been confirmed, “but it appears to be within Zoom’s backend systems, surrounding their ability to resolve, route or redistribute traffic,” ThousandEyes said in its analysis of the outage.

Zscaler agent suffers 100% packet loss: October 25

On October 25, traffic to a subset of Zscaler proxy endpoints experienced 100% packet loss, impacting customers using the Zscaler Internet Access (ZIA) service on their Zscaler Cloud Network 2. According to ThousandEyes’ outage analysis, the most severe packet loss lasted about 30 minutes, although some accessibility issues and packet loss spikes at certain user locations persisted intermittently over the next three hours.

Zscaler refers to the issue as a “traffic forwarding issue” on its status page. When the proxy device’s virtual IP is unreachable, traffic cannot be forwarded.

ThousandEyes explained how the situation prevented some customers using the Zscaler security service from accessing critical business tools and SaaS applications: “This may have affected various applications for enterprise customers using the Zscaler service because it is at the edge of the security service ( It is typical in SSE) implementations to proxy not only web traffic, but also other critical business tools and SaaS services such as Salesforce.com, ServiceNow and Microsoft Office 365. Therefore, the proxy is in the user’s data path, and when the proxy is not accessible, there is a need for these tools to access is affected, and remediation often requires manual intervention to route affected users to an alternate gateway.

WhatsApp cuts off messaging: October 25

A two-hour outage on October 25 left WhatsApp users unable to send or receive messages on the platform. The free software owned by MetaWiki is the world’s most popular messaging app – 31% of the world’s population uses WhatsApp, according to 2022 data from digital intelligence platform Similarweb.

According to ThousandEyes’ outage analysis, the outage was related to a backend application service failure rather than a network failure. It happened during peak hours in India, where the app has a user base of hundreds of millions.

AWS US East Region Hits Again: December 5

Amazon Web Services (AWS) suffered a second outage in its US East 2 region in early December. According to AWS, the outage lasted approximately 75 minutes and caused Internet connectivity issues to and from the US East 2 region.

ThousandEyes observed packet loss between two global locations and the US-East-2 region of AWS for over an hour. This incident affected end users who connected to AWS services through their ISPs. “This loss only occurs between end users connected through their ISPs and does not appear to affect connectivity between instances within or between regions,” ThousandEyes said in its outage analysis.

Later that day, AWS posted a blog post saying the issue had been resolved. “Connections between instances within a zone, between zones, and direct connections are not affected by this issue. The issue has been resolved and connectivity has been fully restored,” the post said.

X

Contatto

Contatto