How Chaos Engineering Improves Cyber ​​Resilience

Conventional wisdom says, “If it ain’t broke, don’t fix it.” Chaos Engineering says, “Let’s try to break it anyway and see what happens.”

The online group Chaos Community defines chaos engineering as “the discipline of conducting experiments on systems to build confidence in their ability to withstand turbulent conditions in production.”

Chaos engineering practitioners essentially stress-test systems and then compare what they think might happen to what actually happens. The goal is to increase resiliency.

For network practitioners who have spent their entire careers focusing on keeping networks up and running, the idea of ​​deliberately trying to shut them down may seem a little crazy.

CSO Executive Meeting/ASEAN: Lim Sai Hsien talks security sector

0 seconds 16 minutes 6 seconds volume 0%

Why chaos engineering makes sense

But David Mooter, a senior analyst at Forrest Research, believes that chaos engineering is a logical response to an environment where networks are distributed across multi-cloud platforms and are increasingly vulnerable to cyberattacks.

“The problem is that distributed systems are too complex for us to fully understand,” Mooter said. “This means they violate our assumptions and do unexpected things. Modern resilience efforts must be built on assumptions that we cannot fully understand and predict how our systems will behave.”

“The network is not always reliable,” adds Nora Jones, founder and CEO of event management software provider Jeli, who pioneered chaos engineering while working at streaming service Netflix.

“The concept of testing a network is the same as testing a CPU or anything else – simulating adverse events and revealing unknown unknowns,” Jones said. Chaos engineering supports the concept of continuous verification that things are never completely reliable and failure is always around the corner. “It’s a constant battle to stay ahead of the eight ball, and it requires a mindset shift in how you approach operations,” she said.

What is an example of chaos engineering?

Mooter said he worked with a company that did a simple chaos experiment involving misconfiguring ports. “The hypothesis is that a misconfigured port will be detected and blocked by the firewall, and then logged to immediately alert the security team,” Mooter said.

The company conducts chaos experiments by regularly introducing misconfigured ports into production. Half the time, the firewall does what is expected, but the rest of the time the firewall fails to block the port. However, the secondary cloud configuration tool always blocks it.

“The problem is that assistive tools don’t alert security teams, so they turn a blind eye to these incidents,” Mooter said. “As a result, the experiments not only demonstrate a fault in the firewall, but also a flaw in the security team’s ability to detect and respond to incidents.”

There is a way to go crazy

Chaos engineering is useless if it randomly introduces faults that the network or security team is unaware of and actually shuts down the production network or causes performance issues.

Chaos engineering methods are very specific. First, Chaos Engineering is primarily performed in non-production environments, Mooter said.

He added: “You don’t break things haphazardly, but you intelligently identify unacceptable risks, form a hypothesis about that risk, and conduct chaos experiments to confirm that the hypothesis is correct.

“You’ll have a test group and a control group so that you can be 100 percent sure that any problems that arise are due to errors you injected into the test group and not unrelated things that happened by coincidence while you were running the experiment.”

Like a scientific experiment, a hypothesis should be falsifiable, Mooter said. “Every time I run an experiment and it works, I become more convinced that my hypothesis is correct,” he said. “If it fails, then I discover new information about my system that corrects my incorrect assumptions.”

One of the main benefits of this approach is that it can identify problems before they have a significant impact on the business.

“Suppose there is some unclear situation that causes your payment service to go offline,” Mooter said. “Do you want to find a situation in a controlled environment, possibly a non-production environment, where the failure can be shut down immediately and people are actively monitoring the situation? Or do you want it to happen unexpectedly on a Friday night while some critical operations employee happens to be on vacation?”

Best Practices in Chaos Engineering

There are several best practices organizations can apply when experimenting with chaos engineering:

Including application developers: Mooter said, “With complex distributed architectures, developers don’t have good intuition about the limitations of their applications. When chaos engineering becomes part of software delivery, developers will see an increase in More examples show their assumptions are wrong. This will create the habit of questioning your assumptions more proactively.”

Improving communication: At Netflix, the company built its own chaos engineering tool and later made it open source, the idea “was to create a mandatory capability for engineers to build resilient systems,” Jones said. “Everyone knows that servers randomly shut down, and systems need to be able to handle it. Not only that, but people need to know how to communicate with the right parties when this happens.”

Choosing the right experiment: Network chaos experiments are “arguably the most popular test for simulating the disruptions that cause unplanned outages in today’s complex distributed systems,” said Uma Mukkara, director of chaos engineering at Harness, which provides chaos engineering tools and support services. Enterprises can leverage chaos engineering to conduct specific experiments such as verifying network latency between two services, examining resiliency mechanisms in code, dropping traffic on service calls to understand the impact on any upstream dependencies, or introducing packet corruption into the network Stream to learn about the application or service resiliency, Mukkara said.

Loops in security teams: Chaos engineering can be applied to any complex distributed system, including cybersecurity, Mooter said. “For security, our mindset is to assume that no matter how hard you try to be perfect, security controls will fail,” he said. For example, one bank used chaos engineering to change the metrics it measured. Rather than simply tracking time without a security incident, Mooter said it starts measuring which specific security protections are known to be effective.

Tips for controlling chaos

Chaos engineering can come with risks, such as shutting down networks during busy or even less busy times. That’s why it’s important to follow these guidelines.

Limit chaos engineering projects. “I don’t think you should give every engineer a key to break things,” Jones said. “This is a discipline—more specifically, it’s a human discipline, not a tool discipline—so instilling the appropriate psychological safety and learning culture are prerequisites for chaos engineering to be effective.”

Learn from existing incident response systems. Jones said organizations should take the time to ensure they learn from incidents that have occurred. “If you’re thinking about chaos engineering, I guarantee you already have a lot of event information,” she said. “Exploring their first and surface patterns” will help understand the best types of experiments to run.

There are ways to jump-start a chaos engineering project. Mooter says it’s a good idea to use automated means to immediately halt disruptive activity when necessary. “Every chaos experiment should be designed to minimize the blast radius if something goes wrong,” he said. “This can be at the infrastructure, application or business layer.” For example, at the infrastructure layer, faults are isolated to a limited set of connections.

Joint chaos engineering program. “Centralized chaos engineering teams don’t scale,” Mooter said. “If delivery teams aren’t directly involved, they won’t learn and build resilience intuition, so if you centralize, you lose the benefit of culture change.” Mooter says creating an “us versus them” relationship between the central chaos team and the delivery team ” dynamic makes no sense.

“For example, one software company found that in the past, the development team would blame the infrastructure for not providing enough disk space, and the infrastructure team would come back and ask why the code written by the developers was taking up so much space,” he said.

Mooter said that after embracing the chaos engineering mindset, both parties stopped arguing about why disks were full and instead asked how to make systems resilient to full disks.

Change the culture. Organizations using chaos engineering would be wise to create a culture of experimentation, Mukkara said.

“No system can be 100 percent reliable,” she said. “However, your customers want it to be available when they need it. You need to build a system that can withstand common failures and train your team to deal with unknown failures. This starts with trying to understand the behavior and functionality of your system, and over time Continuous improvement over time.”

It’s also important to have visibility and transparency, Mukkara adds: “Report and share with multiple stakeholders the issues you identify and the reliability improvements you’re making to the system to keep the business on board,” she says.

For example, reporting to product management leadership the failure modes on which the system was protected and how resiliency mechanisms were successfully tested. “This will give them the confidence to understand the system and the availability it should maintain,” Mukkara said. “You can also let them know what failure modes your system is susceptible to, so the issue can be prioritized or at least identified as an acceptable risk.”

Póngase en contacto con nosotros