Getting Started with Chaos Engineering

Last Updated: July 8, 2020

Introduction

Chaos Engineering is a disciplined approach to finding failures before they become outages. You literally "break things on purpose" to learn how to build more resilient systems.

If you're curious to try Chaos Engineering for yourself, but want to practice in a demo environment first, this tutorial is for you.

In this tutorial, we'll walk through 3 chaos experiments to test the reliability of our demo app. We'll do this using Gremlin, a simple, safe and secure service for performing Chaos Engineering experiments through a SaaS-based platform.

After completing this tutorial, you'll have hands-on experience running chaos experiments in a demo environment and be able to run them with confidence on your own infrastructure.

Prerequisites

To successfully complete this tutorial, you’ll need:

An AWS account
Console access with AWS CLI, kubectl and eksctl
A Gremlin account (request a free trial)

But first, what is Chaos Engineering?

Chaos Engineering is a disciplined approach to identifying failures before they become outages. By testing how a system responds under stress, we can proactively fix vulnerabilities and make our systems more reliable.

So how does this work in practice? The best way to start is by creating a thoughtful, planned chaos experiment to validate expected behavior. First, ask yourself, "What could go wrong?" (For example, what happens when one of the third-party services we rely on goes down?) Then, use the scientific method to create a hypothesis, run a controlled experiment simulating the failure, and measure the impact.

After running your experiment, you'll have one of two outcomes. Either you’ve verified that your system is resilient to the failure you introduced, or you’ve found a problem you need to fix. Both of these are good outcomes. On one hand, you’ve increased your confidence in the system and its behavior, on the other you can fix the problem before it causes an outage.

Tutorial overview

Will our demo app be resilient in the face of failure or will we experience an outage? Let’s find out with Chaos Engineering.

We’ll use Chaos Engineering to test how our demo app handles the following failure scenarios:

High CPU
Dependency outage
Service container failure

When we inject these failures into the demo app, we'll be able to see if the system maintains its functionality or if its services degrade under stress.

Infrastructure and demo environment

We'll use this open-source microservices application as our demo environment. The demo app has eCommerce functionality. We'll refer to it as the Sock Shop going forward.

Getting set up

Set up eksctl and EKS cluster

Follow instructions here to configure AWS CLI, kubectl and eksctl.

In your respective region, launch an EKS cluster. For example:

bash

1eksctl create cluster --name sockshop-eks-cluster --version 1.15 --region us-west-2 --nodegroup-name standard-workers --node-type t3.medium --nodes 3 --nodes-min 1 --nodes-max 4

Note: It’s important to use --version 1.15 as version 1.16 breaks sock shop’s YAML due to incompatibilities.

Deploy Sock Shop

Using kubectl, deploy Sock Shop.

Clone the repo below and and go into the deploy/kubernetes folder.

bash

1git clone https://github.com/microservices-demo/microservices-demo

bash
```
1kubectl create namespace sock-shop
```
bash
```
1kubectl apply -f complete-demo.yaml
```

Access Gremlin

If you haven't already, request a free trial of Gremlin.
Activate your account using the link sent to your email.
- This will open a browser tab to https://app.gremlin.com.

Retrieve your Team ID and Secret Key

To install the Gremlin Kubernetes agent, you will need your Gremlin Team ID and Secret Key. If you already know what those are, you can skip ahead to installing the agent. If you don’t know what your Team ID and Secret Key are, you can get them from the Gremlin web app.

Visit the Teams page in Gremlin, and then click on your team’s name in the list.
On the Teams screen click on Configuration. Make a note of your Team ID.
If you don’t know your Secret Key, you will need to reset it. Click the Reset button. You’ll get a popup reminding you that any running agents using the current Secret Key will need to be configured with the new key. Hit Continue. Next you’ll see a popup screen that will show you the new Secret Key. Make a note of it.

Create a Kubernetes secret from Gremlin certificates

(Skip this step if you are using secret-based authentication)

Download the Gremlin certificates (you need at least team manager access)
Unzip certificates.zip
Rename the files in the certificates folder. Team Name.pub_cert.pem becomes gremlin.cert. Team Name.priv_key.pem becomes gremlin.key.
Create a gremlin namespace: kubectl create namespace gremlin

Create a kubernetes secret by running the following:

bash

1kubectl -n gremlin create secret generic gremlin-team-cert --from-file=/path/to/gremlin.cert --from-file=/path/to/gremlin.key

kubectl

Download and apply the Gremlin configuration manifest

Download the Gremlin configuration manifest by running the following:
bash
```
1wget https://k8s.gremlin.com/resources/gremlin-conf.yaml
```
Open the file and update the following:
Replace the following line with your team ID: "YOUR TEAM ID GOES HERE"
Replace the following line with your team secret: "YOUR TEAM SECRET GOES HERE"
(If you are using certificate-based authentication, remove this line.)
Replace the following line with a string that you will use to identify your cluster: "YOUR UNIQUE CLUSTER NAME GOES HERE"

Apply the manifest with this command:

bash

1kubectl apply -f /path/to/gremlin-conf.yaml

Download and apply the Gremlin agent manifest

If you are using certificate-based authentication:

Download and apply the gremlin agent manifest for your kubernetes cluster by running the following:
bash
```
1kubectl apply -f https://k8s.gremlin.com/resources/gremlin-client.yaml
```

If you are using secret-based authentication:

Download and apply the gremlin agent manifest for your kubernetes cluster by running the following:
bash
```
1kubectl apply -f https://k8s.gremlin.com/resources/gremlin-client-secret.yaml
```

Enabling Gremlin on the Kubernetes Master

Most Kubernetes deployments configure master nodes with the node-role.kubernetes.io/master:NoSchedule taint. You can run the following command to see if any of your nodes have this taint:

shell

1kubectl get no -o=custom-columns=NAME:.metadata.name,TAINTS:.spec.taints
2NAME      TAINTS
3kube-01   [map[effect:NoSchedule key:node-role.kubernetes.io/master]]
4kube-02   <none>

If you wish to install Gremlin on a Kubernetes master that has been tainted, add a tolerations section to the PodSpec of the Gremlin Agent Manifest.

yaml

1tolerations:
2  - key: node-role.kubernetes.io/master
3    operator: Exists
4    effect: NoSchedule

You will need to reapply the Gremlin agent manifest after making this change.

Download and apply the K8s agent manifest

If you are using certificate-based authentication:

Download and apply the k8s agent manifest by running:

bash

1kubectl apply -f https://k8s.gremlin.com/resources/gremlin-chao.yaml

If you are using secret-based authentication:

Download and apply the k8s agent manifest by running:

bash

1kubectl apply -f https://k8s.gremlin.com/resources/gremlin-chao-secret.yaml

Access monitoring

Open CloudWatch in the AWS Console.
On the CloudWatch Dashboard, click on Overview and select Container Insights.
Make sure EKS Clusters is selected.

For more information on using Container Insights, see this documentation.

Access Sock Shop

bash

1kubectl get svc -o wide -n sock-shop | grep LoadBalancer

Copy the load balancer’s DNS name.
- Sample: a34.us-east-1.elb.amazonaws.com
Paste this DNS into a browser to access the sock shop front-end.
Navigate around to get a feel of all the functions of the shop. Things to try out:
- Register and log in
- View various items
- Add items to cart
- Remove items from cart
- Check out items

Run chaos experiments

Experiment 1: Validate auto scaling on CPU load

For this first experiment, we will check to see if this cluster has autoscaling policies dialed in correctly.

Select Recommended Scenarios in Gremlin, then click View Details for “Validate Auto Scaling” scenario.
Scroll to the bottom and click Add targets and run.
Select all 3 hosts found and click Run Scenario.

Questions

In AWS Console, look into Container insights:

Was this failure detected?
Did the outcome of this failure result in expected behaviors?
Would the service be able to handle this failure?

Remediation: Set up auto scaling

Go to the AWS Console and select EC2 from Services.
On the left navigation bar, select Auto Scaling Groups.
Each Cluster gets its own Auto Scaling Group. Select the one you need, and then at the lower navigation, select Scaling Policies.
Select Add Policy, then Create a simple scaling policy.
- We will be creating two of these, one to scale up and one to set back to the usual 3.
Give the Policy a name, we will call it “Cluster-ScaleUp” and select Create New Alarm. Create the alarm to go off when CPU utilization is greater than or equal to 13% for at least 1 minute, and name it "Cluster-ScaleUp."
Press Create Alarm. Now you will be taken back to finish editing the policy.
Edit the values to add “1” instance and then wait 120 seconds before the next activity. Press Create when finished.
We want to follow the same steps as above, but instead the policy will be called “Cluster-ScaleDown,” we will be creating a new alarm. This Alarm will be for when CPU utilization is less than or equal to (<=) 13% within 15 minutes.
- Customize your scenario if you would like to see more auto scaling events kick off.

Experiment 2: Dependency outage

For this experiment, we will test and discover what happens when there is a service dependency outage as your primary service attempts to make requests to it.

In Gremlin, create a new attack.
Select the Containers tab.
In the Search bar, look up carts-db.
Scroll down and click Choose a Gremlin.
Select Network -> Blackhole Gremlin.
Change the length of the attack to 300 seconds.
Click Unleash Gremlin.
As the attack is running, try the following:
- Add items to the cart
- Remove items from the cart
- Update quantity of items in cart

Questions:

Is there any customer impact?
Are systems recovering gracefully?
Is there any way to mitigate this?
- In Gremlin, click Halt All Attacks to stop this attack.
Did systems recover?
What did we learn?

Experiment 3: Service container failure

For this experiment, we will test what happens if a container were to fail. Sometimes, especially in a containerized environment, your orchestration can automatically recover, but it takes time to detect and fix the issue, resulting in a potential partial outage.

In Gremlin, create a new attack.
Select the Containers tab.
In the Search bar, look up carts-db.
Scroll down and click Choose a Gremlin.
Select State -> Shutdown Gremlin.
Switch off Reboot.
Click Unleash Gremlin.
As the attack is running, try the following:
- Add items to cart
- Access items in cart
- Remove items in cart
- Check out

Questions:

Is there any customer impact?
Are systems recovering gracefully?
- How long did it take?
- Is full customer experience restored?
How might you mitigate this?

Create your own experiment

Now that you’ve had a chance to run some pre-planned experiments, you can create your own experiment from start to finish. There is no wrong way to create an experiment, but it’s important to go through the full thought process.

How to create a chaos experiment:

Create a hypothesis
Contain the blast radius
Run the experiment
Measure the impact
Share results

Questions

Was this failure detected?
Did this failure have customer impact?
- If so, what are they?
Did the impact of this failure expected, or, did it match your hypothesis?
- If not, what happened instead?
Can this failure be handled or mitigated?

Now increase the reliability of your own systems

While running experiments on a demo app is admittedly pretty fun, it doesn't improve the reliability of your systems. Start running experiments on your own infrastructure to test and validate your systems' response to failure and improve overall reliability.

Check out our documentation to install Gremlin anywhere, including bare-metal, on-prem, VMs, containers, serverless and Kubernetes environments.

If you'd like to try all Gremlin Attacks, including Packet Loss and Memory, request a demo and we'll set you up with a free trial of Gremlin.

Start

How to run an experiment on AWS Lambda using Failure Flags and Node.js

Introduction In this tutorial, we'll show you how to run a Chaos Engineering experiment on a serverless application…

Andre Newman

Sr. Reliability Specialist

Start

How to run multiple experiments in parallel using Gremlin

Introduction Gremlin lets you run multiple Chaos Engineering experiments in a single workflow called a Scenario…

Andre Newman

Sr. Reliability Specialist

Start

How to use your Gremlin reliability score in Jenkins to ensure reliable releases

Introduction Adding Gremlin to your CI/CD pipeline is a key step in automating your reliability efforts. We previously…

Andre Newman

Sr. Reliability Specialist

Avoid downtime. Use Gremlin to turn failure into resilience.

Gremlin empowers you to proactively root out failure before it causes downtime. See how you can harness chaos to build resilient systems by requesting a demo of Gremlin.

Get started

Introduction

Prerequisites

But first, what is Chaos Engineering?

Tutorial overview

Infrastructure and demo environment

Getting set up

Set up eksctl and EKS cluster

Deploy Sock Shop

Access Gremlin

Retrieve your Team ID and Secret Key

Create a Kubernetes secret from Gremlin certificates

kubectl

Download and apply the Gremlin configuration manifest

Download and apply the Gremlin agent manifest

Enabling Gremlin on the Kubernetes Master

Download and apply the K8s agent manifest

Access monitoring

Access Sock Shop

Run chaos experiments

Experiment 1: Validate auto scaling on CPU load

Questions

Remediation: Set up auto scaling

Experiment 2: Dependency outage

Questions:

Experiment 3: Service container failure

Questions:

Create your own experiment

Questions

Now increase the reliability of your own systems

Related

Avoid downtime. Use Gremlin to turn failure into resilience.