Amazon Web Services (AWS) is the world's most popular cloud computing platform. Much of this success is due to the scalability and reliability they enable for customers and their applications. However, this support doesn't come out of the box. AWS provides reliability and scalability, but it's up to their users to implement these functions in a way that best supports their workloads and applications.
Knowing how to configure and apply AWS' tools is one thing; verifying that your configuration works is an entirely separate project. Fortunately, we can use Chaos Engineering—the practice of injecting faults into systems to check for weaknesses—to test this. This way, we can make sure our AWS-hosted applications can withstand any conflicts or adversity that our production environment might throw at them.
In this article, we'll explain how this works and how you can apply it to your own systems.
Reliability is critical for software companies, but the complexity of cloud platforms like AWS—combined with the fast speed of development expected by modern DevOps teams—makes it difficult for teams to guarantee reliability out of the gate. The goal of Chaos Engineering is to test and verify application resiliency in your production environments so that they can withstand real-world failures.
AWS recognizes this need, which is why they made reliability a pillar of their Well-Architected Framework (WAF). The WAF is a guide for AWS customers on how to optimize applications for AWS, all the way from the design phase up to operating and monitoring. Operational Excellence and Reliability are the key pillars for Chaos Engineering, as they encompass running and managing applications.
Although the WAF provides extensive guidance and instruction, it doesn't provide a way to test whether your newly configured applications meet its standards. In other words, how do you know whether your applications are truly well-architected after you've put in all the effort to make them so? The answer is to test for the conditions that the WAF aims to prevent, and the way to do that is by intentionally injecting faults into your AWS workloads. Doing this reveals behaviors in your applications that are unexpected, undesirable, or otherwise deviate from the WAF.
“You can’t consider your workload to be resilient until you hypothesize how your workload will react to failures, inject those failures to test your design, and then compare your hypothesis to the testing results.” — AWS Reliability Pillar announcement blog
Chaos Engineering does more than just highlight operational issues. It also helps identify potential gaps in your monitoring and alerting setup by giving you a way to trigger those alerts. For instance, if you have an Amazon CloudWatch alert designed to notify you when CPU usage reaches critically high levels, you can run a Chaos Engineering experiment designed to consume CPU to test whether that alert fires.
Chaos Engineering isn't strictly about reliability, either. Performance is also a critical aspect of cloud systems. After you've tested your assumptions about reliability, you can use Chaos Engineering to test your assumptions about the performance of your AWS deployments. You can find the parts of your application that isn't scaling properly, or isn't optimized for low CPU or low memory availability, or is especially susceptible to noisy neighbors.
It's impossible to list all of the ways an AWS deployment can fail. This isn't because AWS is inherently unreliable, but rather because the sheer number of different services, configuration options, and workflows means there's always the potential for unexpected and unpredicted failure modes. For example, these could include:
Many teams develop incident response procedures for handling these types of failures, and while incident response is important, it's a reactive process. Response plans can only be created after a failure has already happened. And even if you have a response plan, are you certain that it's effective? Have you and your team already tested it? An untested plan is an unanswered question that could result in outages.
Implementing Chaos Engineering on AWS shares many of the same challenges as implementing Chaos Engineering on other platforms:
The first step is to identify which AWS services and features you're using. Your applications might consist of several AWS services: ECS containers, EKS clusters, Lambda functions, EC2 instances, etc. Each of these services has different failure modes and requires different types of tests. For example, Elastic Kubernetes Service (EKS) abstracts away the Kubernetes control plane, so you don't need to worry about the reliability of your control plane nodes. However, you do need to consider what happens to your applications when one of your worker nodes (the nodes that run your Kubernetes workloads) fails.
Multi-cloud testing is also an important consideration. Nearly 90% of organizations use a multi-cloud approach (Flexera), which means any Chaos Engineering tools they adopt should ideally support testing on more than one cloud. There are tools like AWS Fault Injection Simulator (FIS) that make Chaos Engineering on AWS more accessible, but if your workloads are spread across AWS, Azure, Google Cloud Platform (GCP), or others, you'll either need to adopt additional tools or forego Chaos Engineering on this platforms altogether.
Observability is a key part of Chaos Engineering; without visibility into EC2 instance performance, cluster health, HTTP request/response status, and other metrics, you might not notice when a problem occurs, let alone why it happened.
You can set up monitoring relatively easily using Amazon CloudWatch, or using other tools such as Dynatrace, Datadog, New Relic, etc. One major benefit of CloudWatch is that it natively supports collecting observability data from AWS services, and usually does so by default.
The steady state is how a system performs under normal or ideal conditions. When running in pre-production, your steady state is typically the average amount of load your application uses when it's completely up and running. It's important to collect steady state metrics because these provide a baseline for comparison. Running Chaos Engineering experiments will almost certainly change your metrics, and measuring the difference between the changed metric's value and its steady state value gives you the experiment's impact.
This process is usually done just before running your first attack, as it reduces the time frame that other factors (like load) have to affect your systems.
You've identified the systems that you want to test. You've developed a hypothesis. And, you've measured your steady state. How do you go about running Chaos Engineering experiments?
This step requires a Chaos Engineering tool. A good Chaos Engineering tool doesn't just support fault injection on the AWS services you use but also provides control plane management and reporting. It should make it easy for you to initiate experiments on select systems, observe their effect, and easily stop or roll back experiments if needed.
There are a few approaches you can take to Chaos Engineering on AWS:
Once you've selected a tool, the next step is to design and run your first experiment. Running a Chaos Engineering experiment consists of six steps:
Setting a small blast radius is especially important when getting started with Chaos Engineering. The blast radius is the scale of the test. A small blast radius would be a single application or server, while a large blast radius would be an entire Availability Zone or region. Starting with a small blast radius helps isolate the impact of the experiment to a limited area, making it easier to observe and measure the effects. It also prevents any unexpected problems from impacting other systems. As you become more confident in running experiments, or if the current blast radius is too small to provide meaningful results, increase the blast radius step-by-step.
Lastly, or each test, you should set abort conditions. Abort conditions are the system conditions under which the test should be stopped, regardless of whether it's finished running. They're used to prevent accidental damage to the systems being tested if they enter an undesirable or unexpected state. For example, if you're testing a single application on a single server, you might set your abort conditions to:
When your experiment finishes running, compare the metrics gathered during the experiment to your baseline. Is the impact what you expected? Were there any unusual results, such as simultaneous spikes across multiple different metrics? Do the results prove your hypothesis, do they refute it, or do they not provide a clear answer? And most importantly, did the experiment reveal problems in your systems?
From your observations, create a list of fixes to implement on your systems. Once the fixes are in place, repeat the same experiments to validate that the fixes are working as intended.
A fix, like any change to a complex system, can impact the system beyond just the single service it was implemented for. That's why it's important to repeatedly run chaos experiments while gradually increasing the blast radius. This ensures that your fixes don't just improve the system(s) they were implemented for, but also improve reliability at a larger scale.
This is also an opportunity to improve your monitoring. If you notice any gaps in your monitoring/alerting setup, work on creating and deploying fixes for it. Remember, observability is key to Chaos Engineering, and having a robust monitoring setup makes running experiments much easier and much more effective. Some questions to ask of your monitoring setup are:
Running Chaos Engineering experiments on AWS can seem daunting at first, but with the right tools and procedures, you can quickly start making your systems more reliable. Gremlin provides an AWS-ready platform that supports experimenting on the most popular AWS services including Amazon EC2 and Amazon EKS. Gremlin also helps you validate continued adherence to the AWS Well-Architected Framework (WAF) so you can get the most value out of the tools available to you
Learn how Charter Communications uses Gremlin to ensure the reliability of their customer data platforms in AWS.
When you have a Chaos Engineering tool selected, watch our webinar: Continuous validation of the AWS Well-Architected Framework with Chaos Engineering.